CrowdStrike Update Outage Exposes Four Critical Issues: Next Steps for CIOs

IDC’s Quick Take

The recent IT outage caused by silent updates pushed out by CrowdStrike to its Falcon agent exposes an issue that is at the heart of how the IT industry operates. It highlights the contrasting trust and attestation mechanisms taken by operating system vendors like Microsoft, Apple, and Red Hat in allowing its ecosystem of independent software vendors (ISVs) direct access to certain parts of the operating system stack and especially software that can potentially severely negatively impact the system kernel.

While this issue impacted Windows devices– network and human centric – managed by CrowdStrike, none of the iOS, MacOS, or even Linux devices were affected. That is very telling and should compel vendors like Microsoft and Apple to take a long hard look at what “openness” means in the wake of regulations like EU’s Digital Markets Act (DMA). It should also compel the largely Windows-dependent customer base to redefine their long-term cyber recovery strategy. It should include making a shift to more modern operating system environments.

Event Highlights

On July 19, 2024, at 04:09 UTC, a sensor configuration update was released by CrowdStrike for Windows systems as part of the Falcon platform’s protection mechanisms. This update contained a logic error that led to a “blue screen of death” (BSOD), affecting certain systems. A remediation was implemented by 05:27 UTC on the same day.

According to CrowdStrike, the impact of this event was specific to customers using Falcon sensor for Windows version 7.11 or higher. It needs to be pointed out that to make their endpoint protection products effective, vendors like CrowdStrike require access to the system files. Any configuration issues with these files can lead to unpredictable behavior at best and leave the system in an unrecoverable state at worst.

The resulting outage caused disruptions to airlines, businesses and emergency services and could be the largest IT outage in history. In time, we will know whether the scale and impact of the outage will reach the level of the “NotPetya” cyberattack in 2017. At the time of writing, two days later, airlines – the biggest group of affected enterprises – were still reeling from the outage.

It is important to note that this incident was not caused by a cyberattack but rather routine update to configuration files, often referred to as “Channel Files.” In the context of the Falcon sensor, Channel Files are integral to the behavioral protection mechanisms that safeguard systems against cyber threats. These files are dynamically updated multiple times daily. The Falcon sensor’s architecture, designed to incorporate these updates seamlessly, has been a foundational component.

In Windows environments, Channel Files are typically located within the directory path C:\Windows\System32\drivers\CrowdStrike\, identifiable by their “C-” prefix. Each file is uniquely numbered, serving as an identifier that aids in the management and deployment of updates. For instance, Channel File 291, denoted by the filename “C-00000291-“, plays a crucial role in how Falcon assesses the execution of named pipes—a standard method for interprocess communication within Windows systems.

The significance of Channel File 291 came to the forefront during an update aimed at neutralizing the threat posed by malicious named pipes associated with prevalent Command and Control (C2) frameworks. The update introduced a logic error, leading to a system crash.

IDC’s Point of View

For historical context, this is not the first time something like this has happened. For example, in 2010, McAfee had an issue with a “DAT” file. The issue with McAfee’s DAT file version 5958 caused a reboot loop and loss of network access on Windows XP SP3 systems due to a false positive that misidentifies the Windows binary “svchost.exe” as the virus “W32/Wecorl.a”. In 2017, Webroot released an update that misidentified Windows system files as malware and Facebook as a phishing site. This update quarantined essential files, leading to instability in numerous computers. In 2021, a mass internet outage was caused by a bad software update by Fastly, there have been many others.

This situation – which is not unique to CrowdStrike – exposes four key issues that are fundamental to the IT industry and its complex ecosystem of ISVs.

First, it exposes the fact that by giving its ecosystem ISVs direct access to the system kernel, the operating system vendor is essentially removing itself from the trust value chain. Thus, the trust value chain now only includes the ISV and its customers.
Second, the process of silent updates in which the customers implicitly rely on the QA process employed by the ISV leaves them inadequately prepared for drastic and timely intervention in the case of mass outages that leave the system in an unrecoverable state.
Third, this situation is a wake-up call for the industry on what a system of checks and balances means and what kind of accountability operating system vendors, ISVs and customers must play to avoid this kind of a situation from repeating itself.
And finally, fourth, this situation indirectly exposes the fragile human-centric Windows stack that unlike modern network-centric Unix and Linux operating systems cannot robustly manage exception errors instead defaulting to a manually recoverable state.

The first point exposes contrasting approaches taken by leading operating system vendors. On one side there are vendors like Apple that take a very prescriptive and closed approach to endpoint protection making it almost impossible for any ecosystem ISV/provider like CrowdStrike to push out configuration changes that can potentially catastrophically impact on the operating system (e.g., iOS or macOS) kernels. Apple has been a fierce advocate of a “walled garden” approach implementing stringent attestation mechanisms to ensure that no one – and we mean no one – gets to modify the system kernel without express approval from Apple. This has made Apple run afoul of the European Commission, and its hawkish regulatory approach to open up operating systems under the premise of fair competition. On the other hand, Microsoft takes – or more importantly was forced to take – a more open approach enabling at least a dozen ISVs in offering modern endpoint protection software. Here too, regulation forced their hand. For example, according to Microsoft it cannot legally wall off its operating system in the same way Apple does because of an understanding it reached with the European Commission following a 2009 complaint to give makers of security software the same level of access to Windows that Microsoft gets.

And then there is the ecosystem of Linux vendors that are staunch proponents of validating software from third parties to ensure enterprise customers from inadvertently creating kernel level “panics” en masse.

The second and third points above speak to the process of “silent patching” or “silent upgrades” where the customers often do not have the luxury of QA’ing updates – and especially endpoint updates – before they are rolled out. For one, they are too frequent. For another, as history shows, they are often harmless. Though, all it takes is one outage to question if this implicit trust needs to be revisited.

The final point speaks to a fundamental difference between modern operating systems that are Unix- or Linux-based and are designed from the ground up to be network centric. This is in contrast with Windows variants – especially those used in embedded environments like terminals used by Banks and Airlines – that are modified human-centric operating systems. While the IT industry has slowly migrated their infrastructure stacks to Unix or Linux environments, for convenience, the embedded side often remains on Windows (running on “bare metal”). Unfortunately, that leaves them exposed to cyber security threats and hence the reliance on external providers. When there is a catastrophic failure of any kind, the only recourse is to send the IT guy with a USB drive to fix all the affected units. System rollback and recovery – available in many virtualized environments – is unfortunately not an option.

For those unfamiliar with CrowdStrike, CrowdStrike is positioned as a leader in the 2024 IDC MarketScape for worldwide modern endpoint security for enterprises. With its cloud-native platform and lightweight sensor agent, and as an early-to-the-market provider of EDR capabilities, CrowdStrike quickly gained prominence in the modern endpoint security market, expanded its cloud-native platform into adjacent security categories that also benefit from large and diverse data sets, integrated threat intelligence, and centralized analysis and, in the process, increased its wallet share among security-spending decision makers.

In this instance, CrowdStrike’s strength aggravated the issue. The cloud-based analytics backend architecture has a lightweight agent on the endpoint that allows the solution to scale really well according to CISOs with whom IDC speaks. In this instance, the platform also allowed the solution scale the logic error.

To be clear, we are discussing more than a cybersecurity issue. CrowdStrike and its competitors are simply relying on endpoint configuration management and more broadly IT service management framework that has been developed by Microsoft for managing Windows devices. The problem with this framework is that it is a one-way street. Once the system goes into a recovery mode there is no automatic way to recover it. Network recovery is possible but only if it has been set up by the enterprise.

For any affected entity, the effort to correct the logic error issue in the CrowdStrike platform was the easy part. Identifying effected systems is even easier than that as blue screens of death are easy to see – you can perhaps still see them in many public places or lights out environments where the institutions lacked the wherewithal for fast manual intervention. Recovering the hyperconnected and hardened systems into a known good state or safe mode for remedial action is the hard part. And ensuring it does not happen again is perhaps the hardest part.

This incident – and the scale at the entire IT industry globally is reeling from it – should be the basis for CIOs to ask their organizations to:

Update Cyber Recovery Procedures: Revise their recovery strategies to include automated or remote recovery in situations where the system requires manual intervention.
Shift to a better operating environment. Push for more reliable and verifiable solutions from OSE vendors and endpoint security providers. These must be built on network-centric operating systems.
Short-Term Focus Shift: Prioritize basic IT Service Management issues over emerging technologies like AI and Generative AI. The depth and breadth of the impact from a basic unsupervised configuration change shows that sometimes the small things matter the most.

The insights provided in this blog are sourced from IDC’s Endpoint Security and Future of Trust research. If you are interested in learning more about cybersecurity best practices businesses should embrace you can listen to the recent IDC Webinar, Cybersecurity Norms and Trends: How Does Your Business Stack Up via the link below.

Access Webinar

IDC analysts Ashish Nadkarni and Matthew Eastwood contributed to this article.