Monitoring vendor ThousandEyes provided a detailed postmortem for CenturyLink's widespread outage Sunday morning, which included a few lessons learned.
As previously reported the outage, which started just after 6 a.m. ET, was due to an "offending Flowspec announcement" that originated in CenturyLink's CA3 data center in Mississauga, a city in Ontario, Canada.
"On August 30, we experienced an IP outage in several global markets due to a problematic Flowspec announcement that prevented Border Gateway Protocol from establishing correctly," CenturyLink said in a statement. "We made a global configuration change to block the Flowspec announcement, which allowed the BGP to begin routing properly. We restored all services by 11:30 a.m. Eastern time."
ThousandEyes said the Flowspec update on CenturyLink's Level 3 network prevented the establishment of Border Gateway Protocol (BGP) sessions across elements of its network, which "led to a catastrophic scenario given the role that BGP plays in internal router traffic routing, as well as internet traffic routing," which includes peering and traffic exchange between autonomous networks.
Flowspec is an extension to BGP that is used to distribute firewall-like rules via BGP updates. Flowspec allows service providers to push rules across an entire network almost instantly, but doing so incorrectly led to CenturyLink/Level 3's outage on Sunday.
"In an expanded analysis provided to its customers, CenturyLink has indicated that the improperly configured Flowspec was part of a botched effort to block unwanted traffic on behalf of a customer — a routine internal use case for Flowspec," said ThousandEyes' Angelique Medina, director for product marketing, in a blog. "In this instance, the customer requested CenturyLink block traffic from a specific IP address. The Flowspec for this request was accidentally implemented with wildcards, rather than isolated to a specific IP address.
"This misconfiguration, along with the failure of filter mechanisms, and other factors, ultimately led to the incident. The timing of the outage (4 a.m. MDT relative to Level 3 headquarters in Denver, Colorado), suggests that this announcement could have been introduced as part of routine network updates, which typically take place during early morning hours to avoid broad user impact."
When the offending Flowspec update was received and executed by a router, BGP sessions were dropped, which in turn severed the control plane connection, including the communication of the bad Flowspec rule.
"Since BGP is dynamic and the Flowspec rule would not persist past termination of BGP, the routers would then attempt to reestablish BGP, at which point they would receive BGP announcements, including the offending Flowspec rule," according to Medina. "At this point, the routers would go down the rule list, implementing as they go along, until hitting, yes, the dreaded Flowspec rule, and the control plane connections would terminate. If this sounds like a nightmarish infinity loop, you’re not wrong."
At the start of the outage, ThousandEyes detected traffic terminating on a large number of Level 3 interfaces and on Level 3's infrastructure, as well as on other ISPs networks on nodes directly connected to Level 3. ThousandEyes said 72 interfaces were impacted as the outage first unfolded and increased to 522 interfaces at the peak of the outage across Level 3 and other ISPs that peered with it.
Despite CenturyLink's efforts it continued to advertise stale routes despite services withdrawing routes.
"What made this outage so disruptive to Level 3’s enterprise customers and peers, is that efforts to revoke announcements to Level 3 (a common method to reroute around outages and restore service reachability) were not effective, as Level 3 was not able to honor any BGP changes from peers during the incident, most likely due to an overwhelmed control plane," Medina said. "Revoking the announcement of prefixes from Level 3, or preventing route propagation through a no-export community string and even shutting down an interface connection to the provider would have been fruitless."
In an effort to remediate the bad Flowspec update, CenturyLink asked large backbone carriers, such as Telia Carrier and NTT, to de-peer with Level 3. After withdrawing all of its announcements from Level 3, ThousandEyes and NTT used Cogent as their replacement provider.
"It’s important to keep in mind that the internet is a web of interdependencies and even if you or your immediate peers were not directly connected to Level 3, you and your users could have been impacted," according to Medina. "Internet routing is highly dynamic and influenced by a number of factors that include path length, route specificity, commercial agreements, and provider-specific peering preferences. Many users, services and ISPs impacted as a result of this outage were not customers or direct peers of CenturyLink/Level 3, yet found that their traffic was routed through that provider at some point."
While GoToMeeting uses Level 3, it went to its back-up provider, GTT, shortly after the outage started. Even though the prefixes announced by GoToMeeting were the same ones that were announced through Level 3, the routes announced through GTT seemed to be preferred to those of Level 3.
"The reason why GTT routes were preferred to Level 3 (mitigating the impact of the outage on GoToMeeting) likely came down to GTT’s peering density and peering relations that may have made GTT routes more attractive (for cost or other reasons)," according to Medina. "The dynamic, uncontrolled (and contextual) nature of Internet routing was on full display during this incident, underscoring the significant impact of peering and provider choices — not only your own, but those of your peers and their peers.
"The deeply interconnected and interdependent nature of the Internet means that no enterprise is an island — every enterprise is a part of the greater Internet whole, and subject to its collective issues. Understanding your risk factors requires an understanding of who is in your wider circle of dependencies and how their performance and availability could impact your business if something were to go wrong."
Medina said maintaining visibility into the routing, availability, and performance of critical providers are also extremely important, as external communication on status and root cause could vary widely by provider and are often slow to arrive.
"When it does, it may be past its usefulness in addressing an issue proactively, "Medina said. "Finally, consider the context of the outage (and any outage). In the case of this CenturyLink/Level 3 incident, the timing of it dramatically reduced its impact on many businesses, as it occurred in the early hours (at least in the U.S.) on a Sunday morning.
"Perhaps, that’s one bit of good news we can take away from this incident. And we all could use some good news about now."