Why did CrowdStrike cause the Windows Blue Screen? (original) (raw)
Flavijus Piliponis â stock.ado
The ‘blue screen of death’ signals a catastrophic Windows failure, which is exactly what many people faced on 19 July 2024 – but why did it happen?
David William Plummer, a former Microsoft software engineer who developed Windows Task Manager, has posted a video describing how the CrowdStrike update could have caused Windows to halt.
He described CrowdStrike Falcon as anti-malware for Windows servers, which “proactively detects new attacks” and analyses application behaviour. To do this, CrowdStrike needs to run as a kernel device driver.
Kernel device drivers usually provide a way to abstract hardware, such as graphics cards, from applications. When they run, they generally have full access to the computer and operating system and, in operating system terminology, they are said to run at “Ring Zero”. This is different to application code, which users run in the operating system’s user space known as “Ring One”.
The difference, as Plummer notes, is that when a user application crashes, nothing else on the computer should be affected. However, a fault in code running at Ring Zero is considered so serious that the operating system immediately halts, which, in Windows results in the so-called Blue Screen of Death.
“Even though there’s no hardware device that it’s really talking to, by writing the code as a device driver, CrowdStrike lives down in the kernel Ring Zero and has complete and unfettered access to the system data structures and the services that CrowdStrike believes it needs to do its job,” said Plummer.
Certified device drivers
Plummer noted that Microsoft, and likely also CrowdStrike, are aware of the stakes when software is running code in kernel mode, adding: “That’s why Microsoft offers the WHQL [Windows Hardware Quality Labs] certification.”
According to Plummer, the certification involves device driver software providers to test their code on various platforms and system configurations. The code is then signed digitally by Microsoft, which certifies that it is compatible with the Windows operating system. Plummer said the certifications process means that Windows users can be reasonably confident that the driver software is robust and trustworthy.
Certification is too slow to ensure anti-malware protection such as CrowdStrike is released as software updates every time there is a new threat. Plummer believes it is more likely that CrowdStrike will often release a definition file that is processed by its Windows kernel driver. This gets around the WHQL device driver certification process and means users have access to the latest protection.
“You can already perhaps see the problem,” he added. “Let’s speculate for a moment that the CrowdStrike dynamic definition file is not merely a malware definition but a complete program written in pseudocode that the driver can then execute.”
He said this would allow the device driver from CrowdStrike to execute the definition file as code running within the Windows kernel at Ring Zero even though the update itself has never been signed. “Executive p-code [pseudocode] in the kernel is risky at best and, at worst, is asking for trouble,” said Plummer.
By looking at crash dumps posted on X (formerly Twitter), Plummer said that a “null pointer reference” caused an empty file containing zeros to be uploaded by the CrowdStrike device driver, rather than the actual pseudocode.
“We don’t know how or why this happened, but what we know is that the CrowdStrike driver that handles and processes these updates is not very resilient and appears to have inadequate error-checking and parameter validation,” he added.
These are needed to ensure that data values required by the software are valid and good. If they are not, the error should not cause the entire system to crash, Plummer said.
While it is often possible to restart Windows from the last known “good state”, which can remove rogue kernel drivers that prevent the operating system from booting up, Plummer said the situation was made worse by the fact that CrowdStrike is marked as a boot-start driver, which means it is needed for Windows to start up correctly.
While it is too early to understand how to ensure this never happens again, it is clear that there are serious limitations in Microsoft’s WHQL certification that allowed CrowdStrike to install an anti-malware update that had such a devastating impact across the Windows community.