CrowdStrike failure: What you need to know

Cybersecurity vendor CrowdStrike initiated a series of computer system outages across the world on Friday, July 19, disrupting nearly every industry and sowing chaos at airports, financial institutions, and healthcare systems, among others.

At issue was a flawed update to CrowdStrike Falcon, the company’s popular endpoint detection and response (EDR) platform, which crashed Windows machines and sent them into an endless reboot cycle, taking down servers and rendering ‘blue screens of death’ on displays across the world.

How did the CrowdStrike outage unfold?

Australian businesses were among the first to report encountering difficulties on Friday morning, with some continuing to encounter difficulties throughout the day. Travelers at Sydney Airport experienced delays and cancellations. At 6pm Australian Eastern Standard Time (08:00 UTC), Bank Australia posted an announcement to its home page saying that its contact center services were still experiencing problems.

Businesses across the globe followed suit, as their days began. Travelers at airports in Hong Kong, India, Berlin, and Amsterdam encountered delays and cancelations. The Federal Aviation Administration reported that US airlines grounded all flights for a period of time, according to the New York Times.

What has been the impact of the CrowdStrike outage?

As one of the largest cybersecurity companies, CrowdStrike’s software is very popular among businesses across the globe. For example, over half of Fortune 500 companies use security products from CrowdStrike, which CSO ranks No. 6 on its list of most powerful cybersecurity companies.

Because of this, fallout from the flawed update has been widespread and substantial, with some calling it the “largest IT outage in history.”

To provide scope for this, more than 3,000 flights within, into, or out of the US were canceled on July 19, with more than 11,000 delayed. Planes continued to be grounded in the days since, with nearly 2,500 flights canceled within, into, or out of the US, and more than 38,000 delayed, three days after the outage occurred.

The outage also significantly impacted the healthcare industry, with some healthcare systems and hospitals postponing all or most procedures and clinicians resorting to pen and paper, unable to access EHRs.

On July 20, Microsoft reported that an estimated 8.5 million Windows devices had been impacted by the outage.

Given the nature of the fix for many enterprises, and the popularity of CrowdStrike’s software, IT organizations have been working around the clock to restore their systems, with many still mired in doing so days after the initial faulty update was served up by CrowdStrike.

What caused the CrowdStrike outage?

In a blog post on July 19, CrowdStrike CEO George Kurtz apologized to the company’s customers and partners for crashing their Windows systems. Separately, the company provided initial details about what caused the disaster.

According to CrowdStrike, a defective content update to its Falcon EDR platform was pushed to Windows machines at 04:09 UTC (0:09 ET) on Friday, July 19. CrowdStrike typically pushes updates to configuration files (called “Channel Files”) for Falcon endpoint sensors several times a day.

The defect that triggered the outage was in Channel File 291, which is stored in “C:\Windows\System32\drivers\CrowdStrike\” with a filename beginning “C-00000291-” and ending “.sys”. Channel File 291 passes information to the Falcon sensor about how to evaluate “named pipe” execution, which Windows systems use for intersystem or interprocess communication. These commands are not inherently malicious but can be misused.

“The update that occurred at 04:09 UTC was designed to target newly observed, malicious named pipes being used by common C2 [command and control] frameworks in cyberattacks,” the technical post explained.

However, according to CrowdStrike, “The configuration update triggered a logic error that resulted in an operating system crash.”

Upon automatic reboot, the Windows systems with the defective Channel File 291 installed would crash again, causing an endless reboot cycle.

In a follow-up post on July 24, CrowdStrike provided further details on the logic error: “When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).”

The defective update, which included new exploit signatures, was part of CrowdStrike’s Rapid Response Content program, which the company says goes through less rigorous testing than do updates to Falcon’s software agents. Whereas customers have the option of operating with the latest version of Falcon’s Sensor Content, or with either of the two previous versions if they prefer reliability over coverage of the most recent attacks, Rapid Response Content is deployed automatically to compatible sensor versions.

The flawed update only impacted machines running Windows. Linux and MacOS machines using CrowdStrike were unaffected, according to the company.

How has CrowdStrike responded?

According to the company, CrowdStrike pushed out a fix removing the defective content in Channel File 291 just 79 minutes after the initial flawed update was sent. Machines that had not yet updated to the faulty Channel File 291 update would not be impacted by the flaw. But those machines that had already downloaded the defective content weren’t so lucky.

To remediate those systems caught up in endless reboot, CrowdStrike published another blog post with a far longer set of actions to perform. Included were suggestions for remotely detecting and automatically recovering affected systems, with detailed sets of instructions for temporary workarounds for affected physical machines or virtual servers, including manual reboots.

On July 24, CrowdStrike reported on the testing process lapses that led to the flawed update being pushed out to customer systems. In its post-mortem, the company blamed a hole in its testing software that caused its Content Validator tool to miss a flaw in the defective Channel File 291 content update. The company has pledged to improve its testing processes by ensuring updates are tested locally before being sent to clients, adding additional stability and content interface testing, improving error handling procedures, and introducing a staggered deployment strategy for Rapid Response Content.

CrowdStrike has also sent $10 in Uber Eats credits to IT staff for the “additional work” they put in helping CrowdStrike clients recover, TechCrunch reported. The email, sent by CrowdStrike Chief Business Officer Daniel Bernard, said in part, “To express our gratitude, your next cup of coffee or late night snack is on us!” A CrowdStrike representation confirmed to TechCrunch that the Uber Eats coupons were flagged as fraud by Uber due to high usage rates.

On July 25, CrowdStrike CEO Kurtz took to LinkedIn to ensure customers that the company “will not rest until we achieve full recovery.”

“Our recovery efforts have been enhanced thanks to the development of automatic recovery techniques and by mobilizing all our resources to support our customers,” he wrote.

What went wrong with CrowdStrike testing?

CrowdStrike’s review of its testing shortcomings noted that, whereas rigorous testing processes are applied to new versions of its Sensor Content, Rapid Response Content, which is delivered as a configuration update to Falcon sensors, goes through less-rigorous validation.

In developing Rapid Response Content, CrowdStrike uses its Content Configuration System to create Template Instances that describe the hallmarks of malicious activity to be detected, storing them in Channel Files that it then tests with a tool called the Content Validator.

According to the company, disaster struck when two Template Instances were deployed on July 19. “Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data,” CrowdStrike said in its review.

How has recovery from the outage fared?

For many organizations, recovering from the outage is an ongoing issue. With one suggested solution for remedying the defective content being to reboot each machine manually into safe mode, deleting the defective file, and restarting the computer, doing so at scale will remain a challenge.

It has been noted that some organizations with hardware refresh plans in place are considering accelerating those plans as a remedy to replace affected machines rather than commit the resources necessary to conduct the manual fix to their fleets.

On July 25, CrowdStrike CEO Kurtz posted to LinkedIn that “over 97% of Windows sensors are back online as of July 25.”

What is CrowdStrike Falcon?

CrowdStrike Falcon is endpoint detection and response (EDR) software that monitors end-user hardware devices across a network for suspicious activities and behavior, reacting automatically to block perceived threats and saving forensics data for further investigation.

Like all EDR platforms, CrowdStrike has deep visibility into everything happening on an endpoint device — processes, changes to registry settings, file and network activity — which it combines with data aggregation and analytics capabilities to recognize and counter threats by either automated processes or human intervention.

Because of this, Falcon is privileged software with deep administrative access to the systems it monitors, making it tightly integrated with core operating systems, with the ability to shut down activities that it deems malicious. This tight integration proved to be a weakness for IT organizations in this instance, rendering Windows machines inoperable due to the flawed Falcon update.

The company has also introduced AI-powered automation capabilities into Falcon for IT, to help bridge the gap between IT and security operations, according to the company.

What has been the fallout of CrowdStrike’s failure?

In addition to dealing with fixing their Windows machines, IT leaders and their teams are evaluating lessons that can be gleaned from the incident, with many looking at ways to avoid single points of failure and re-evaluating their cloud strategies. Industry thought leaders are also questioning the viability of administrative software with privileged access, like CrowdStrike’s.

As for CrowdStrike, US Congress has called on CEO Kurtz to testify at a hearing about the tech outage. According to the New York Times, Kurtz was sent a letter by Representative Mark Green (R-Tenn.), chairman of the Homeland Security Committee, and Representative Andrew Garbarino (R-NY).

Americans “deserve to know in detail how this incident happened and the mitigation steps CrowdStrike is taking,” they wrote in their letter to Kurtz, who was involved in a similar situation when, as CTO of McAfee, the company pushed out a faulty anti-virus update that impacted thousands of customers, triggering BSODs and creating the effect of a denial-of-service attack.

Financial impacts of the outage have yet to be estimated, but Derek Kilmer, a professional liability broker at Burns & Wilcox, said he expects insured losses of up to $1 billion or “much higher,” according to The Financial Times. Insurer Parametrix pegs that number at $5.4 billion lost, just for US Fortune 500 companies, excluding Microsoft, Reuters reported.

Based on Microsoft’s initial estimate of 8.5 million Windows devices impacted, research firm J. Gold Associates has projected the IT remediation costs at $701 million, based on 12.75 million resource-hours necessary from internal technical support teams to repair the machines. That coupled with the fact that, according to Parametrix, “loss covered under cyber insurance policies is likely to be no more than 10% to 20%, due to many companies’ large risk retentions,” the financial hit from CrowdStrike is likely to be enormous.

Ongoing coverage of the CrowdStrike failure

July 19: Blue screen of death strikes crowd of CrowdStrike servers
July 20: CrowdStrike CEO apologizes for crashing IT systems around the world, details fix
July 20: Put not your trust in Windows — or CrowdStrike
July 22: CrowdStrike incident has CIOs rethinking their cloud strategies
July 22: Microsoft pins Windows outage on EU-enforced ‘interoperability’ deal
July 22: Focusing open source on security, not ideology
July 22: Early IT takeaways from the CrowdStrike outage
July 24: CrowdStrike blames testing shortcomings for Windows meltdown
July 24: CrowdStrike meltdown highlights IT’s weakest link: Too much administration
July 25: CIOs must reassess cloud concentration risk post-CrowdStrike
July 26: 97 per cent of CrowdStrike Windows sensors back online
July 26: Counting the cost of CrowdStrike: the bug that bit billions

Originally published on July 23, 2024, this article has been updated to reflect evolving developments.

Source: CIO
Published: Fri, 26 Jul 2024 18:53:46 +0000