Please note: The preliminary report below is replaced by the final report, published on 28 June.
Commencing at approximately 12:00 pm 24 June Uber experienced three separate outages which caused significant disruption to most services hosted in both Canberra and Sydney. Outage 1 lasted from approximately 12:10 pm to 2:30pm. Outage 2 lasted from approximately 4:00 pm to 6:00 pm. Outage 3 commenced at approximately 6:30 pm and was resolved for most customers by 8:30 pm and all customers by 9:00 pm. Initial investigations indicate that all three outages were caused by the same root cause.
Uber makes extensive use of Cisco switching products. Cisco is employed because of its high reliability. One of the default features in Cisco IOS is the Error Disable (errdisable) feature (http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00806cd87b.shtml) which is designed to disable a physical switch port in the event of the detection of certain error conditions. The purpose of the feature is to prevent an error condition moving through a network and causing extensive damage.
During the evening of 21/22 June Uber relocated customer specific hardware providing services to a small number of customers within its data centre. It is currently believed that the root cause of the outage is one of the items of customer specific hardware moved during the relocation incident. Additionally the inter-capital Canberra-Sydney trunk provider was swapped during the same activity.
Outage 1 commenced at approximately 12:10 pm on 22 June. The outage took down large parts of the Uber infrastructure in Canberra and portions of the Sydney network. Examining the network it was discovered that a number of access switch trunk connections to the switching core and the trunk inter-capital connection to Sydney infrastructure had had been disabled by the errdisable feature. The switches reported loopback errors. Network services were restored by manually re-enabling the physical ports on more than forty switches. As ports were re-enabled customer services came back on line, and by approximately 2:30 pm all services had been restore. Investigations into the source of the loopback error commenced.
Outage 1 occurred approximately 20 minutes after a new IBM BladeCentre was added to the infrastructure in Canberra, and so initial investigations focused on the configuration of the switches in this device. Network infrastructure is designed to allow for the addition of BladeCentre capacity, but as a precaution the BladeCentre was immediately removed from the network. Part way through the investigation the second outage occurred.
Outage 2 commenced at approximately 4:00 pm part way through investigation of outage 1. This outage manifested itself very similarly to outage 1 with large scale error disablement across Cisco access switch trunk ports. It was clear at this stage that the BladeCentre was not the cause, and suspicion turned to the equipment relocations of the evening of 21 June. This outage was resolved using similar approach to outage 1 by approximately 6:00 pm. During this outage the inter-capital trunk change from the evening 21/22 was also reversed to eliminate this as a cause of the loopback error.
Outage 3 commenced at approximately 6:30 pm, 30 minutes after the restoration of services after outage 2. During the previous two outages Uber had focused primarily on restoring services as quickly as possible. During outage 3 it became clear that the emphasis needed to change from quick restoration to more considered troubleshooting. By this stage suspicion had turned to the customer specific hardware moved on the evening of 21 June. These devices were connected directly to the core, and the errdisable had occurred on trunk connections to the core. Error messages on a number of servers observed during outage 2 seemed to point to one of these devices as the problem. Finally, customers making use of services provided by the customer specific hardware seemed to be disproportionally affected. In restoring outage 3 Uber firstly disabled spanning tree functionality on one of the customer specific hardware devices and then progressively restored switch configurations from the peripheries of the network inwards in an attempt to identify the device causing the issue. The full network was restored from outage 3 without incident and has remained stable after restoration.
Preliminary Root Cause
At this stage Uber believe the root cause is a fault with one of the customer specific hardware devices moved on the evening of 21/22 June. It needs to be stressed that this is a preliminary finding, and the devices need to be separately tested to attempt to replicate the problem.
If the problem is with a customer specific hardware device, then why did it not manifest itself until approximately ten hours after they were moved? At this stage the answer is not known. Two theories are (1) the fault is an intermittent hardware fault (Uber has experienced intermittent faults with interface cards on other, more recent hardware of similar type in the recent past), and (2) the core switches were able to buffer the error condition for ten hours before buffers became overloaded and the error condition then spread through the network. It needs to be stressed that these are presently theories and Uber has no direct evidence at this stage to support either theory.
Is the root cause actually a customer specific hardware device? The evidence Uber has points this way, but as yet this is not definitive. Further investigation is required.
Should Uber implement errdisable recovery on its switches, which automatically reverse port disablement after a specified period of time for specified error conditions? This issue needs investigation. While the errdisable caused significant network outages its Boolean nature allowed relatively quick network restoration. If a similar error were to occur with switches configured to automatically disable and then re-enable ports the network may go into a highly unstable state as the problem rippled across the network that would probably prove more difficult to troubleshoot and restore. Conversely, this is the first time Uber has experienced a problem related to errdisable in over six years of operation.
There is no evidence to suggest that the outages were caused by malicious activity. Uber remains conscious of this possibility as investigations continue, but as yet nothing would indicate that the incident is related to an attack on our infrastructure.
At this stage the network appears stable. Uber has implemented additional monitoring measures and placed additional technical staff on call. All network changes have been frozen, and the performance of the network will be monitored for the next 48 hours before any further actions are taken. At this stage Uber is likely to remove the suspect customer specific device from the network in consultation with those customers currently receiving services from it in order to perform further bench testing.