uber
UberGlobal Business provides the confidence of business class hosting with Australian 24/7 phone support and a variety of dedicated hosting and managed services.
UberGlobal Enterprise offers reliability and secruity through managed infrastructure, platform and software responses for clients seeking enterprise grade solutions.
UberGlobal Wholesale provides whitelabel reseller services via our SaaS and IaaS platform. Generate revenue for your business using our products and services.

Preliminary Incident Report: 24.6.11 Outages

Please note: The preliminary report below is replaced by the final report, published on 28 June.

Incident

Commencing at approximately 12:00 pm 24 June Uber experienced three separate outages which caused significant disruption to most services hosted in both Canberra and Sydney.  Outage 1 lasted from approximately 12:10 pm to 2:30pm.  Outage 2 lasted from approximately 4:00 pm to 6:00 pm.  Outage 3 commenced at approximately 6:30 pm and was resolved for most customers by 8:30 pm and all customers by 9:00 pm.  Initial investigations indicate that all three outages were caused by the same root cause.

Background

Uber makes extensive use of Cisco switching products.  Cisco is employed because of its high reliability.  One of the default features in Cisco IOS is the Error Disable (errdisable) feature (http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00806cd87b.shtml) which is designed to disable a physical switch port in the event of the detection of certain error conditions.  The purpose of the feature is to prevent an error condition moving through a network and causing extensive damage.

During the evening of 21/22 June Uber relocated customer specific hardware providing services to a small number of customers within its data centre.  It is currently believed that the root cause of the outage is one of the items of customer specific hardware moved during the relocation incident.  Additionally the inter-capital Canberra-Sydney trunk provider was swapped during the same activity.

Outage 1

Outage 1 commenced at approximately 12:10 pm on 22 June.  The outage took down large parts of the Uber infrastructure in Canberra and portions of the Sydney network.  Examining the network it was discovered that a number of access switch trunk connections to the switching core and the trunk inter-capital connection to Sydney infrastructure had had been disabled by the errdisable feature.  The switches reported loopback errors.  Network services were restored by manually re-enabling the physical ports on more than forty switches.  As ports were re-enabled customer services came back on line, and by approximately 2:30 pm all services had been restore.  Investigations into the source of the loopback error commenced.

Outage 1 occurred approximately 20 minutes after a new IBM BladeCentre was added to the infrastructure in Canberra, and so initial investigations focused on the configuration of the switches in this device. Network infrastructure is designed to allow for the addition of BladeCentre capacity, but as a precaution the BladeCentre was immediately removed from the network.  Part way through the investigation the second outage occurred.

Outage 2

Outage 2 commenced at approximately 4:00 pm part way through investigation of outage 1.  This outage manifested itself very similarly to outage 1 with large scale error disablement across Cisco access switch trunk ports.  It was clear at this stage that the BladeCentre was not the cause, and suspicion turned to the equipment relocations of the evening of 21 June.  This outage was resolved using similar approach to outage 1 by approximately 6:00 pm.  During this outage the inter-capital trunk change from the evening 21/22 was also reversed to eliminate this as a cause of the loopback error.

Outage 3

Outage 3 commenced at approximately 6:30 pm, 30 minutes after the restoration of services after outage 2.  During the previous two outages Uber had focused primarily on restoring services as quickly as possible.  During outage 3 it became clear that the emphasis needed to change from quick restoration to more considered troubleshooting. By this stage suspicion had turned to the customer specific hardware moved on the evening of 21 June.  These devices were connected directly to the core, and the errdisable had occurred on trunk connections to the core.  Error messages on a number of servers observed during outage 2 seemed to point to one of these devices as the problem.  Finally, customers making use of services provided by the customer specific hardware  seemed to be disproportionally affected.  In restoring outage 3 Uber firstly disabled spanning tree functionality on one of the customer specific hardware devices and then progressively restored switch configurations from the peripheries of the network inwards in an attempt to identify the device causing the issue.  The full network was restored from outage 3 without incident and has remained stable after restoration.

Preliminary Root Cause

At this stage Uber believe the root cause is a fault with one of the customer specific hardware devices moved on the evening of 21/22 June.  It needs to be stressed that this is a preliminary finding, and the devices need to be separately tested to attempt to replicate the problem.

Unresolved Questions

If the problem is with a customer specific hardware device, then why did it not manifest itself until approximately ten hours after they were moved?  At this stage the answer is not known.  Two theories are (1) the fault is an intermittent hardware fault (Uber has experienced intermittent faults with interface cards on other, more recent hardware of similar type in the recent past), and (2) the core switches were able to buffer the error condition for ten hours before buffers became overloaded and the error condition then spread through the network.  It needs to be stressed that these are presently theories and Uber has no direct evidence at this stage to support either theory.

Is the root cause actually a customer specific hardware device?  The evidence Uber has points this way, but as yet this is not definitive.  Further investigation is required.

Should Uber implement errdisable recovery on its switches, which automatically reverse port disablement after a specified period of time for specified error conditions?  This issue needs investigation.  While the errdisable caused significant network outages its Boolean nature allowed relatively quick network restoration.  If a similar error were to occur with switches configured to automatically disable and then re-enable ports the network may go into a highly unstable state as the problem rippled across the network that would probably prove more difficult to troubleshoot and restore.  Conversely, this is the first time Uber has experienced a problem related to errdisable in over six years of operation.

Malicious Activity

There is no evidence to suggest that the outages were caused by malicious activity.  Uber remains conscious of this possibility as investigations continue, but as yet nothing would indicate that the incident is related to an attack on our infrastructure.

Next Steps

At this stage the network appears stable.  Uber has implemented additional monitoring measures and placed additional technical staff on call.  All network changes have been frozen, and the performance of the network will be monitored for the next 48 hours before any further actions are taken.  At this stage Uber is likely to remove the suspect customer specific device from the network in consultation with those customers currently receiving services from it in order to perform further bench testing.

This entry was posted in Technology. Bookmark the permalink. Trackbacks are closed, but you can post a comment.

5 Comments

  1. Dylan
    Posted June 25, 2011 at 1:56 pm | Permalink

    During outage 3 status.aussiehq.com.au was not online. The failure of the status website needs to be documented and resolved in this article. Please update.

  2. Posted June 25, 2011 at 2:18 pm | Permalink

    Good point. The status page is balanced between two sites, the primary in Australia and secondary in Dallas USA for cases like yesterday’s network issue. The failover check is located on the US network so it should have kicked in.
    We’ll include the failover process in the list of items to investigate further.

  3. Brian
    Posted June 25, 2011 at 4:34 pm | Permalink

    Good report, thanks for your work on a Saturday.

    To recover a switch, would it be enough to simply power cycle the switch? IE: is the errdisable remembered across a reboot? Just that if you have to re-enable 40 in future, a power switch, if you can physically access it, is a darn good workaround.

    I notice that your status site is “auto failed over” to a site outside your network. Can I just ask that you simplify this a little by just hosting it outside your network 100%. Auto-failover is always going to cause you problems, simply because your auto-failover may not detect situations stopping some people from accessing the status site – it may work from the US, but not from sites in Australia. Thinking you’re going to be able to get this right is probably hubris; simplicity and robustness tend to hang out together a lot. :)

    No doubt this was a complex and confusing issue to debug, since it didn’t happen till 10 hours after the move. I think it’s worth trying to track it down further but good luck with reproducing it!

    What procedural changes could you make that would have sped up recovery? That is the only thing I didn’t see addressed here.

    Thanks again for the thorough update. This was the longest outage (by far) I’ve experienced from you, and I got some heat for it, so it’s good to see that it’s being taken very seriously internally.

  4. Posted June 28, 2011 at 1:44 pm | Permalink

    Further to the comment above, and the comment on the final report, we have remained focused on the core incident. The secondary issue of the status page remaining unavailable is being addressed separately.

  5. Patrick
    Posted June 29, 2011 at 1:53 pm | Permalink

    While the err-disable is not carried over after a reboot this is a highly inefficient way of doing things. It takes 10 secs to re-enable a port, it takes minutes for a switch to boot.

    I see the way forward one of 3 ways;
    1. Upgrade to a newer version of code that doesn’t send loop keepalives
    2. Disable loop keepalives
    3. Enable err-disable recovery on all core trunks

One Trackback

  1. By HostCompas.com on June 27, 2011 at 5:00 pm

    UberGlobal investigates IT outage…

    Web host UberGlobal has released a preliminary post-incident report on a six-hour outage Friday, believed to be caused by customer equipment triggering an error-disable mode in the company’s switches.UberGlobal – which owns hosting brands AussieHQ an….

Post a Comment

You must be logged in to post a comment.

UberGlobal RSS