Company that's supposed to prevent outages caused a big one

  • A bad CrowdStrike update crashed Microsoft 365 systems, causing widespread outages across the globe

  • CrowdStrike said a fix is now available, but implementing it could be tricky and take time

  • Analysts were left scratching their heads and wondering how this could happen

IT professionals woke up to quite the headache early Friday morning. A bad update from cloud cybersecurity company CrowdStrike bricked millions of Microsoft machines around the world – those used by banks, airlines, governments and even 911 systems – displaying the dreaded blue screen.

The incident left analysts and techies across the globe wondering how such a catastrophic failure happened. And involving one of the very companies whose sole purpose is preventing outages, no less.

As AvidThink Founder Roy Chua noted on LinkedIn, “One would think this error should have been caught much earlier as part of a good software development QA, CI/CD process. Or a staged rollout for an Falcon channel update?”

He added that on one hand, the scope of the outage reflects CrowdStrike’s impressively large global footprint. But on the other hand, he said it highlights the fact that “developers who provide auto-updates for near-system-level software (drivers, security sensors, monitoring and telemetry sensors) that hook deep into OSes need to vet their updates with much more diligence.”

Similarly, Moor Insights and Strategy Founder Patrick Moorhead questioned “why enterprises globally update a .sys file without an airgapped test prior to deployment. Speed? Confidence because ‘it never happened before’?”

Meanwhile, Patrick Kelly, Founder of Appledore Research had an eye on the fallout. He warned that the outage will “cost companies hundreds of billions of $$$ in lost productivity. Although a fix is in process, it will take many weeks to unwind the damage.”

CrowdStrike, what happened?

According to a statement from Microsoft on X, an issue impacting customers’ ability to use their Microsoft 365 services was first noticed on Thursday evening.

Microsoft’s outage dashboard indicated that the breakdown was caused by a “configuration change in a portion of our Azure backend workloads” which interrupted the connection between storage and compute resources. That in turn resulted in access failures downstream.

Intel from Fierce Network's internal IT team suggests the outage impacted the Central US region of Azure.

CrowdStrike CEO George Kurtz said on X that Mac and Linux host devices were not impacted by the faulty update. Kurtz added that as of 5:45 am ET Friday, “The issue has been identified, isolated and a fix has been deployed.”

CrowdStrike has apologized and has been posting updates on the situation here.

But just because a fix is available doesn’t necessarily mean that implementing it will be easy.

As Moor Insights and Strategy Founder Patrick Moorhead pointed out on X, “Modern enterprise Windows PCs have BIOS-based tools that can put a device into a known good state on an automated, mass-scale. These tools can also go in and delete a specific file and reboot the machine.”

The question, he continued, is how many of the impacted PCs have that capability. If they don’t, “a human will likely have to intervene to go in and delete that offending CrowdStrike .sys file.”

As of 10 AM ET Friday, Microsoft stated that mitigation efforts remained underway but that its metrics indicated "that the remaining impacted scenarios are progressing towards a full recovery."

More cloud security woes

Cloud security is already under scrutiny this week as AT&T recovers from a data breach due to a security lapse with Snowflake.

“The breach was caused by exploiting the inherent vulnerability of single-factor credentials – stolen Snowflake customer credentials – that were then used in a credential-stuffing attack to gain access to the customer's databases,” Semperis principal technologist Sean Deuby told Fierce earlier in the week.

This is a developing story.