True to its word, Google provided an update on the root cause for Sunday's large network outage, which it pinned on a server configuration change.
Sunday's outage, which started on the East Coast and lasted more than four hours, impacted social media companies, such as Snapchat, that rely on Google Cloud as well as Google's own services, including YouTube, Gmail, Google Search, G Suite, Google Drive, Nest, and Google Docs.
Benjamin Treynor Sloss, Google's vice president of engineering, explained in a blog post that the configuration change was intended for a small number of servers in a single region.
"The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity," according to Treynor Sloss. "The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam."
While users of the services that fall into the "less latency-sensitive traffic" may not be too pleased with the reasoning for being dropped, Google did make mention of providing information for service level agreements (SLAs) in a Sunday post on its Google Cloud status dashboard. Treynor Sloss didn't provide any additional SLA information in his blog on Monday.
Treynor Sloss said that Google's engineering teams detected the issue within seconds, "but diagnosis and correction took far longer than our target of a few minutes." Google's efforts to fix the issue were hamstrung by the very same network congestion.
"Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage," according to Treynor Sloss. "The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts."
Google services that require high-bandwidth include YouTube's videos, while Google Search, for example, only saw a short increase in latency.
"Overall, YouTube measured a 10% drop in global views during the incident, while Google Cloud Storage measured a 30% reduction in traffic," according to Treynor Sloss. "Approximately 1% of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email. As Gmail users ourselves, we know how disruptive losing an essential tool can be! Finally, low-bandwidth services like Google Search recorded only a short-lived increase in latency as they switched to serving from unaffected regions, then returned to normal."
Treynor Sloss said that Google's engineering teams were conducting a through post-mortem of the outage "to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration.nWe will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event," he said.
He apologized for the network outage in his blog post, and said that Google would learn from its mistakes going forward. For Google Cloud customers, the big takeaway should be not to have all of your cloud services centralized with one provider.