CrowdStrike IT Outage Explained by a Windows Developer

CrowdStrike IT Outage Explained by a Windows Developer
Short Summary:
This video explains the recent CrowdStrike outage that caused blue screens on Windows machines worldwide. The issue stemmed from a faulty update to the CrowdStrike software, specifically its kernel driver. The video delves into the differences between kernel mode and user mode, highlighting the risks associated with running code in kernel mode. It explains how CrowdStrike's dynamic definition files, while intended to enhance security, could potentially execute untrusted code in the kernel, leading to system crashes. The video concludes by providing a step-by-step guide on how to fix affected machines by deleting the faulty update file.
Detailed Summary:
Section 1: Introduction and Background
- The video starts with a brief introduction by Dave, a retired Microsoft software engineer, who explains his experience with blue screens and how the recent CrowdStrike outage impacted him.
- He states that the outage was caused by a faulty update to the CrowdStrike software.
- Dave outlines the three key points he will discuss:
- Why CrowdStrike software is installed on machines.
- What happens when a kernel driver fails.
- Why the CrowdStrike update caused the issue and how to fix it.
Section 2: Kernel Mode vs. User Mode
- Dave explains the concept of kernel mode and user mode in operating systems, emphasizing that kernel mode is more privileged and has access to the entire system memory.
- He highlights the risks associated with running code in kernel mode, as a crash in kernel mode can lead to a system crash.
- He compares this to user mode, where application crashes only affect the specific application.
Section 3: CrowdStrike Falcon and Kernel Drivers
- Dave introduces CrowdStrike Falcon, a security product that uses a kernel driver to analyze application behavior and detect threats.
- He explains that the driver needs to run in kernel mode to access the necessary system data and services.
- He emphasizes the importance of driver certification through Microsoft's WHQL program to ensure driver robustness and trustworthiness.
Section 4: Dynamic Definition Files and Untrusted Code
- Dave discusses CrowdStrike's use of dynamic definition files, which are processed by the driver but not included in the driver itself.
- He raises concerns about the potential risks of executing untrusted code in the kernel through these definition files.
- He uses the analogy of a driver executing PE code within the definition files, which could lead to system crashes due to bugs or vulnerabilities.
Section 5: Postmortem Debugging and the Root Cause
- Dave analyzes a crash dump report from the CrowdStrike outage, identifying the offending instruction that caused the crash.
- He explains that the crash was caused by a null pointer dereference, likely due to a faulty update file containing only zeros instead of the expected data.
- He highlights the lack of parameter validation in the CrowdStrike driver, which could have prevented the crash.
Section 6: Fixing the CrowdStrike Outage
- Dave provides a step-by-step guide on how to fix machines affected by the CrowdStrike outage.
- He recommends booting into safe mode to avoid loading the faulty driver.
- He instructs users to delete the update file located in the "C:\Windows\System32\drivers\CrowdStrike" folder.
- He concludes by stating that deleting the update file should resolve the issue and prevent further crashes.
Notable Quotes:
- "When kernel mode crashes, the system crashes."
- "It's Risky Business at best and could be asking for trouble."
- "Parameter validation means checking to ensure that the data and arguments being passed to a function are valid and good."
- "CrowdStrike marked their driver as a boot driver, which means the system won't boot without it."
- "Deleting the update file should resolve the issue and prevent further crashes."