An outage on Comcast’s Xfinity network left customers in several major U.S. cities without internet access on Tuesday, sparking questions about what went wrong and why. The operator has yet to issue an official account of the incident, but experts from two network monitoring companies shed some light on what was happening behind the scenes during the event and what the problem might have been.
Angelique Medina, director of Product Marketing at Cisco-owned ThousandEyes, told Fierce the outage was unusual due its scope, duration and some strange routing activity that occurred during the incident.
She stressed her insights were not based on direct knowledge of the attack from Comcast, but instead on the highly granular performance data ThousandEyes collects by sending test traffic over major service provider networks. That data allows it to track things like latency, congestion and where traffic is getting dropped on a per router basis, she said.
Regarding the incident’s scope, Medina noted it appears to have actually been comprised of two outages, which occurred back-to-back, rather than just one. ThousandEyes data showed the first hit the San Francisco Bay Area and lasted from approximately 9:40 p.m. to 11:25 p.m. PT on Monday evening (12:40 a.m. to 2:25 a.m. ET Tuesday). A second, larger event began at 5:05 a.m. PT (8:05 a.m. ET) on Tuesday and impacted service in New York, Chicago, Pittsburg, Atlanta, Virginia and Sunnyvale, California, among other locations. While the first incident was characterized by a continuous loss of service, the second produced more intermittent disruption, she said.
“We’ve definitely seen incidents in which it’s either a localized event or it’s maybe something that’s like really broad and it’s impacting the entire network of a service provider,” Medina said. But the two-part nature of the Comcast incident and “the fact that there was something that seemed to not go right in Sunnyvale and then hours go by and there’s sort of an issue that’s even broader in impact…that definitely was sort of interesting.”
RELATED: Comcast hit by widespread Xfinity outages in major U.S. cities
Doug Madory, director of internet analysis at network monitoring startup Kentik, made a similar point, telling Fierce the Comcast incident was “pretty unusual” due to its breadth. Normal outages tend to be “single events," but in this case there were “multiple discrete outages across different parts of the country,” creating a sort of constellation of incidents, he said.
While some traffic on Comcast’s network was dying locally during the second outage, Medina noted there was also some strange routing activity that popped up which hadn’t been happening before the incident began. For instance, she said traffic on some healthy routes on the East Coast was getting rerouted to one of the impacted locations in California.
“So, for example, there was an instance in which I believe the destination was either Tennessee or some location in Georgia and it was coming from New York, and at some point during the outage it was getting routed to Sunnyvale where the traffic was getting blackholed,” she explained. “It seemed like the nature of this outage was such that not only was there failure traffic loss at a pretty broad surface area within Comcast’s network but also that some either side effect or the same cause was also creating some routing abnormalities as well.”
Medina said it was also notable that the outage lasted as long as it did, calling the duration “pretty lengthy” for a provider like Comcast.
Not an attack
Asked whether the widespread nature of the outage could indicate it was some sort of attack, Medina said the event didn’t have the hallmarks of something like a distributed-denial-of-service (DDoS) event. First, she said a DDoS would have to be “really massive” to impact such a wide surface area. But even putting that factor aside she said there was no real ramp up like you would typically see during a DDoS attack.
“This was from zero to 100% packet loss like that and typically I’ve not seen that in a DDoS situation,” Medina explained. She and Madory both pointed to some sort of control plane or management software issue as a more likely culprit.
“It’s not going to be a physical piece of infrastructure, it’s going to be some sort of software that failed in some way,” Madory said. “These are all large, very complex networks…you have to have pretty elaborate automation to manage the complexity but the automation ends up being kind of complex in and of itself. Sometimes that creates space for an error to creep in despite your best efforts and best practices.”
Medina said it’s not out of the realm of possibility that an update or some other kind of maintenance created an issue with the control plane.
RELATED: AT&T mobile traffic dropped 10% in some cities during Facebook outage
“We’ve seen that with things like, for example, Facebook. They did some update and they took down their whole backbone,” she said, pointing to Centurylink’s outage from last year as another example. “It has sort of the markings of something that damaged how the network is managed.”
Comcast did not respond to a request for comment about what was behind the Xfinity outage.
Regardless of the cause, Recon Analytics founder Roger Entner said the incident provides an opening for fiber players to hammer home the idea that their product is superior to cable.
Referring to Comcast, he said “I don’t think it’ll hurt them much. But that will not stop their fiber competitors from probably pointing it out, [from saying] that fiber is more reliable than cable.”