Uber operates hosting environments in Canberra and Sydney. Each environment joins to the internet by a high availability pair of routers, with individual routers connecting to different transit providers to ensure that there are no single points of failure in the networks.
Uber operates several hundred shared hosting servers providing SME web hosting services. Each of these servers supports several hundred individual customers. Uber is responsible for the security of the server operating system and certain shared software products. Customers are responsible for the security of their own application code and software packages they choose to load onto the servers to support their own applications. The network connection of these servers is throttled to 10 Mbps to prevent excessive bandwidth consumption.
At approximately 6:50 am EADT Uber experienced a very large increase in outbound traffic from its Canberra environment passing through border router 1 and then out through Uber’s Canberra Internode transit link. This volume of traffic completely filled a 1 Gbps router interface. The large volume of traffic caused the border router 1 to slow and the Canberra Internode transit link to fill with traffic. As a result of this event, customers with services located in Canberra that either used border router 1 resources or were accessed by customers who reached the Uber network through the Canberra Internode link experienced service degradations which included very slow load times or total loss of service.
At approximately 9:15 am a source of the very high traffic volumes was discovered within the shared hosting fleet. The server’s network connection was shaped and traffic volumes drop. Service degradation was reduced but traffic volumes remained high. At 10:00 am a second source was discovered and network connectivity was also shaped. Volumes dropped more and most degradation for most customers was mitigated. At 11:00 am a third shared server was identified and shaped and traffic volumes returned to normal. By 11:30 am all changes during the troubleshooting were reversed and all services returned to normal.
A short time later, five more servers, these with shaped ports, were identified as hosting malicious code and cleaned. During the remainder of the day the ports of all shared servers were audited.
Customers either (1) located in Sydney, or (2) receiving services through Canberra border router 2 and non-Internode transit and peering providers were not affected.
Shared hosting accounts on eight separate shared servers, three with unshaped internet connections, were compromised on 7 Feb and malicious code was loaded. The servers with unshaped ports generated excessive volumes of outbound traffic that consumed excess resources on Canberra border router 1 and filled the Canberra Internode link.
How was the malicious code loaded?
The accounts were compromised through known weaknesses in out-of-date Joomla and WordPress applications. These applications are loaded by the customer’s themselves and the patching of the application is the responsibility of the customers. At approximately 6:50am this malicious code was activated as part of what appears to be a DDoS attack against a web service located in the United States. Ordinarily this would not cause a significant problem since the network interface servicing the server is throttled to 10 Mbps. This type of event is not uncommon and is normally quickly detected and cleaned by Uber Engineers.
Why weren’t the ports on three of these servers shaped?
In the case of the incident on 7 Feb the three of the servers had their network ports set to 1 Gbps instead of 10 Mbps. The content on these servers had recently been migrated from lower capacity infrastructure to higher capacity infrastructure and during migrations the port speeds are opened to facilitate fast data transfer. At the conclusion of the migration the port speeds are supposed to be dropped back to 10 Mbps, but this final step had not been implemented. When the apparent DDoS was activated those servers with throttled ports created a small volume of traffic that in normal events would have been quickly detected and addressed. However since three servers had their ports set to 1 Gbps and are deployed on Enterprise grade infrastructure they were able to generate massive amounts of traffic. This traffic passed through the Uber Canberra core network easily (the core is scoped to support multiple 10s of Gbps of traffic) to border router 1 and then out through the Canberra Internode link towards the United States. This volume of traffic consumed significant amounts of border router 1 resources making it extremely slow and choking the Internode link.
Is monitoring architecture sufficient?
Previously monitoring for DDoS type events on border routers has been sufficient. Inbound attacks are naturally limited by the bandwidth of the transit provider and since they must pass through the border routers collecting and analysing border router syslog data has previously proved to be the most reliable mechanism. Historically even under the heaviest of external DDoS attacks we have not experienced significant performance degradation of border routers or syslog data production. Since the Uber network is engineered to distribute content to the internet it is configured to support high volumes of outbound traffic. Indeed except where demanded by the business model we work very hard to ensure we don’t artificially limit outbound traffic. However this incident shows that we need to look at mechanisms to monitor outbound traffic volumes. Our internal Technology Group has taken this on as a priority.
If there are High Availability routers at each site why didn’t they fail over?
While the routers are HA there are only single connections to each of the transit providers. Uber balances its transit connections between the routers. Border router 1 had the connection to Internode and since the traffic was bound for the US it was all sent this way. Had the Internode link failed the traffic would have routed out a different interface, but the link remained up, albiet congested.
Is Shared Hosting security sufficient?
Given the high number of shared hosting accounts and the low skill levels of many of the customers using those services compromised shared hosting accounts are relatively common. The business model of shared hosting does not support extensive security countermeasures so the industry standard is to ‘delay, detect, react’. Delay is facilitated by the throttling of network speed to ensure that no significant damage can be done, detect is facilitated by monitoring tools, and react is Engineers who firstly isolate and then clean the server. Uber follows this approach, however due to the oversight in the migration process the ‘delay’ measure (implementing the port shaping) had not been implemented on three of the servers. The other five servers that had been throttled also participated in the apparent DDoS but did not generate traffic volumes that threatened services for other customers. The security of shared hosting is appropriate to cost and level of service, as long as the ‘delay, detect, react’ measures are properly implemented.
Does the fact that eight shared hosting servers in total were compromised indicate that Uber was specifically targeted?
Our view is probably not. These servers were all in contiguous IP address space and the vector for compromise was well-known, out-of-date, open-source software, so it is likely that they were subjected to automated scanning. Since they were all near neighbours they were all detected by the same automated process.
Are there any more servers with unshaped network connections?
Uber has completed an audit of all shared servers. A handful of servers with unshaped network connections were detected and corrected. We are confident that we are no longer exposed to a similar event.