Published: Thu Jul 25 2024

Crowdstruck

Last week, an anti-virus product called CrowdStrike caused global downtime due to a faulty update. There are a lot of lessons from this.

Kernel mode software is special

CrowdStrike's faulty component was a kernel module.

Running software in your OS kernel is not something that comes for free. It has huge security and reliability implications and should be used only in very special circumstances where you fully trust the software.

In theory, anti-virus software needs to run in the kernel in case a virus got into the kernel and was able to mask itself from anti-virus products by intercepting the anti-virus's user land calls. So, essentially, it's a trade off. Is a virus a big enough threat to justify the risks of having third party code sitting in your kernel? For me, absolutely not. For a large enterprise organisation full of non-technical users, well, maybe, but it's still a bad option.

MacOS is trying to deprecate kernel extensions entirely and replace them with some kind of user-land equivalent. I think they are right to do this. Developers shouldn't expect to have access to the kernel of a user's operating system. Users (and people who make organisational IT decisions) rarely understand the implications of allowing this. Of course, Apple is in a privileged position of fully owning both the hardware and software, whereas other operating systems' kernels need extensive and broad hardware support.

Microsoft has said that they'd like to lock down access to the Windows kernel but are prevented by regulation. This is slightly misleading. They sell a competing product to CrowdStrike called Microsoft Defender (which is not the same as the Defender that comes with Windows), which runs in the kernel. The EU decided that MS needs to keep a level playing field and not shut out kernel access to third parties while selling their own products that benefit from special kernel access. This is a reasonable stance. But one wonders why Microsoft needs to sell additional security software for their own operating system.

The Linux model also works well in general, where the kernel maintainers take responsibility for a huge number of necessary kernel modules like device drivers, meaning that users generally don't need to use third party blobs in their kernel (except Nvidia users).

Parsers are sources of insecurity

The actual problem was that Crowdstrike pushed an update to a virus definition file on which their parser crashed while reading.

Parsers are classic sources of vulnerabilities in memory unsafe languages, because parsing is fiddly and complex. Having a parser running inside the kernel seems like a really bad idea at face value. It may be necessary, but if so, it needs to be tested extensively.

The best way to test parsers is by fuzzing. Basically, you throw random data at the parser and check that it fails in a controlled manner. The best way to generate random data is to take valid data and randomly mutate it - a few bytes here and there to begin with and as then more substantial mutations as that stops revealing bugs.

Given recent events, it appears unlikely that Crowdstrike's parser had been tested particularly well. There will now be a lot of attention on Crowdstrike to see if that parser can be exploited in any other way. Again, this is the problem with anti-virus products - they run in such a way that they increase the attack surface availble to a virus.

Config changes are code changes

That this was a virus definition file update rather than a code update highlights that all changes pushed to production systems are code changes and should be treated as such. You might think that a minor config change doesn't need testing as much as a code change, but it does, because a config or definition change can cause different and previously untested code paths to be executed.

Reliance on foreign tech companies is bad

A lot of UK (and international) infrastructure was taken offline by an American company. It is objectively insane that essential infrastructure in areas like healthcare has been set up to rely on foreign tech companies.

In this case the problem seems to have been a simple mistake, but the implications go beyond a scenario where the vendor has good intentions.

We have little idea what happens inside Microsoft or CrowdStrike. All we really have is a black box called Windows (or Crowdstrike) that we trust to work in our interests. We don't know if the American government or the NSA is able to exert any direct influence to insert backdoors into Windows. We certainly don't know if foreign hostile nations have successfully placed employees within Microsoft (and they would have to be very lacking in imagination to not at least try).

We have to rely entirely on Microsoft's intentions and internal review processes to ensure that that Windows remains safe.

It would make a lot more sense if infrastructure was run on Linux or OpenBSD, which are developed completely in the open and can be customised as much as anyone wants. Somehow I can't see the government deciding to support its own Linux distribution (and I wouldn't trust it to do a good job in the medium term anyway), but in theory this would be far preferable to relying on American corporations.

As an aside, I think it's equally insane that our nuclear deterrent's missile system is American. America has very different geopolitical incentives than us and they might be willing to sacrifice Europe if it means American soil doesn't face a nuclear strike. The warheads are ours, but the missiles aren't. So if we launch against America's wishes, are those missiles going to go where we aim them?