Level 3 technician's misstep causes largest outage ever reported

On Oct. 4, 2016, phone service on Level 3’s network was blocked for nearly an hour and a half across the nation. Level 3 shortly thereafter copped to “a configuration error,” but said little more publicly. The company got more specific with its customers, revealing a Level 3 technician made a clerical error. The specific mechanism has just been made public by the Federal Communications Commission.  

Yesterday the FCC’s Public Safety and Homeland Security Bureau posted its report (PDF) on the Level 3 outage. The bureau administers the Network Outage Reporting System (NORS) and conducts investigations into service disruptions. Its summary of the error points to a decidedly pedestrian source:

“As part of its regular network maintenance practices, which involve network changes once or twice a day, a technician made changes to Level 3’s network management software, which manages soft switches and gateways. Specifically, the outage occurred while the technician was conducting routine anti-fraud operations in Level 3’s vendor-supplied network management software. The anti-fraud operations were intended to block calls originating from telephone numbers that are not native to Level 3’s network that are suspected of association with malicious activity. The technician left empty a field that would normally contain a target telephone number. The network management software interpreted the empty field as a 'wildcard,' meaning that the software understood the blank field as an instruction to block all calls, instead of as a null entry. This caused the switch to block calls from every number in Level 3’s non-native telephone number database.”

The vendor who supplied the network management software is not identified in the report. Cisco is a supplier (PDF) of network management systems to Level 3. 

Level 3 was aware it had a problem within four minutes, the FCC report said. The problem was difficult to diagnose, however, because no one at Level 3 was aware of the consequences of leaving that particular field empty, nor had anyone at the company previously seen the system behave the way it was behaving.

The outage affected approximately 29.4 million interconnected VoIP users and approximately 2.3 million wireless users. The full tally of calls that failed to go through exceeded 111 million. “This nationwide outage was the largest ever reported in NORS,” the FCC said.

The FCC report said Level 3 subsequently adopted measures to prevent a recurrence of the problem—measures in accord with best practices, the FCC drily noted, the Communications Security, Reliability and Interoperability Council had adopted five years earlier.

The Level 3 network disruption was on Oct. 4. On Oct. 21, the U.S. experienced what at the time was one of the worst disruptions of the internet ever, one that made many prominent websites grind nearly to a halt for hours, particularly on the East Coast but elsewhere as well. The cause was a series of distributed denial-of-service (DDoS) attacks, together the largest to that date, all in an onslaught against Dyn.

Some immediately accused Level 3 of being the culprit behind that outage as well. In fact, the company had nothing to do with the attack on Dyn, but the accusations did expose how Level 3’s reputation had been compromised.

But then, Level 3 hasn’t helped resuscitate its image much either. About a year later, on Nov. 7, 2017, Level 3 experienced yet another outage, this one of backhaul systems that caused a disruption in service to customers of Comcast, Charter, Cox Communications and Verizon, among other service providers.

CenturyLink bought Level 3 at the end of 2016.