Lessons Learned from 'Microsoft Global IT Outage'

I woke up to discover Crowdstrike, a cybersecurity vendor trying to protect organizations from cyber attacks like availability outages on Friday . Essentially, it is the opposite of a business hiring itself.

This PC entered a 'boot loop' and is essentially unusable without repair steps (or very lucky with manual reboot timing) as it enters a blue screen on boot.

It was branded as a Microsoft Global IT outage and will forever be known as such. However, it was not a Microsoft outage.

By my count, it is the largest IT and cyber incident to date.

When it comes to cybersecurity, shit happens when this post comes in. CrowdStrike is not the only company doing this work, nor will it be the last. However, they were the first security vendor to crash many systems too quickly.

To give credit to CrowdStrike, they attempted to take the lead on the situation within an hour and a half (but by that point, many had already been affected globally). I think CrowdStrike responded really well overall, and I hope they don’t blame the one person they touched later.

I think this incident highlights many broad concerns.

But before getting into the interesting stuff, enjoy the video of the engineer fixing the PC.

Media Reporting Was Very Fast

When I first saw the headline on my phone, it was about the Microsoft Global IT outage unfolding. My first thought was ransomware. So I logged in and started to look around to see what was happening. I am a CrowdStrike customer.

Microsoft Azure experienced an outage earlier that day. This was resolved before I woke up. Azure has frequent outages (don’t kill Microsoft) - this is not unusual.
CrowdStrike made a boo-boo and pushed a channel update with a significant percentage of customers.

The media connected and confused these two incidents. They were not connected.

At this point, I was engaged in saying. Some publications were corrected in the process, but the crowd... just didn’t and branded it as a Microsoft outage.

Microsoft did not show the favor of looking outward. It took them hours to start briefing for the record, so it was a “third-party” issue. They still did not name CrowdStrike until late in the incident. Tip for MS: If it’s the headline story for every media outlet worldwide and it’s not an accident, name the company quickly and loudly, even if it’s a partner or customer.

However, the real fault lies with the media here. It was very clear that some major publications did not have the knowledge they needed or sources to walk them through this. They had to point people toward the BBC while trying to figure out what was actually happening.

We Should Not Trust Cybersecurity Vendors

This is controversial. I think we trust cybersecurity giants like me too much and lack transparency and accountability.

To be honest, the security industry and government have not done the best job so far, and the threat of ransomware has led the cybersecurity industry to fail to the point of being able to access almost every PC on the planet. The necessity.

Organizations are rushing to install EDR. This agent is often updated automatically in relatively uncontrolled ways to face cybersecurity threats.

Overall - this is a good thing and has helped companies and the economy stay wealthy from some cyber incidents.

But there is a flip side.

The security industry has gotten into bed with governments worldwide, and regulatory standards have mandated these EDR agents across multiple sectors. Looking at many of the standards and terms that the government is pushing as standards, it is actually a few cybersecurity vendors that set them due to whispers in the ears of government and industry groups.

There are transparency and control issues.

Almost all major EDR vendors have kernel access on Windows (think of it as the highest level 'God Mode') and are installed on millions of systems. There is essentially no problem with this.

However, where it starts to get sketchy is that almost all software obfuscates software to obscure it for analysis (stopping crooks but avoiding research and locking down research and testing for non-disclosure agreements). They are pushing updates multiple times a day with zero customer visibility, zero accountability, and zero regulatory scrutiny.

Some of these EDR vendors, including CrowdStrike, can execute detection code in an unsafe manner using updates from the kernel, potentially triggering blue screens.

The 'R' in EDR stands for response, allowing the organization to respond to cyber incidents in real-time by isolating systems and retrieving files. Most EDR vendors have pivoted toward cloud solutions. Therefore, the EDR vendor itself can fully control this setup.

… and the groups gaining access to EDR vendors via proxies are the same. Hackers or nation-states using national security laws that prohibit vendors from disclosing when they should provide access.

The whole situation feels like handing the keys to the kingdom (essentially the world economy) over. Essentially, Microsoft Windows reaches a small group of private cybersecurity companies without external governance or assurance. It has always felt sketchy, and today it feels very sketchy.

In my opinion, customers should demand from vendors.

Provide more transparency on how to test and roll back endpoint security updates.
Disclose locations with risky interactions with the Windows kernel and develop a roadmap for safer drivers.
Include a commitment to disclose all security incidents affecting managed platforms in a timely manner, providing transparency about their own security so customers can make informed choices about temporarily disconnecting.
Disclose in advance if national security laws apply when accessing customer data.
Commit to full transparency in reports explaining what happened, why it happened, and what actions were taken to improve it going forward when bad updates are posted.

I believe these steps will help customers make informed decisions about risk levels, and vendors should commit to them.

Looking at how performance is judged for current EDR vendors, most in the industry judge it based on detection performance and CPU impact. I think there should be a category reporting on stability, which should form part of independent testing.

For example, the CrowdStrike issue can be reproduced by placing a broken .sys file in the CrowdStrike system folder for anyone. Verifying it is the first few bytes of the file. If it’s an invalid channel file, the machine will not boot with a blue screen. That would probably be a security vulnerability in itself. This kind of thing should be selected by independent testing. But since no one is looking at it, vendors actively prevent this level of testing from being made public.

I am not here to slam all vendors, nor do I want to slam CrowdStrike. The point I am making is that as a customer, there is no visibility and data to trust vendors regarding resilience. That is a big problem.

I personally know that some vendors are better than others in this area. But that cannot be extended beyond my experience.

I believe customers have the right and responsibility to push cybersecurity vendors to make endpoint products much better and more transparent. Cybersecurity vendors are uniquely positioned in these access points and have responsibilities they have yet to fulfill. Detection percentages should not be the only prism through which endpoint protection products are viewed. Businesses need availability. It has been lost over time.

I know board members pointing out Orgs to avoid cyber incidents that have already been interrupted have purchased CrowdStrike at great cost. Now they are announcing bills for major recovery efforts to get the business back online.

That is a cyber industry problem. Everyone thinks they went back because of the CrowdStrike incident. What other businesses should not lose control? Who knows. How is that allowed?

History shows that the cyber and IT industries have goldfish bowl memories when it comes to this kind of thing. In my career, I have seen two different security vendors post bad updates to organizations twice. It was a really bad day. Hopefully, now is the time for customers to push and demand better.

I also hope vendors will lead from the front. Both CrowdStrike and Microsoft have a window to jointly see measures to prevent this kind of thing from happening again.

Microsoft has tried with kernel drivers and system stability in the past but has dealt with regulatory and competitive issues. We currently have a small number of cybersecurity companies effectively operating in a new mode in the world economy, and these discussions need to be revisited because there needs to be a way for all vendors to implement less risky behavior over time. This should also include Microsoft’s security solutions.

I think the cybersecurity industry is still very immature compared to the journey ahead, and we are currently just a lot of idiots with too much power.

People Prefer Conspiracy Theories Over Boring Truths

What I recently read about the incident includes millions of tweets with far more opinions than actual factual information.

It was Crowdstrike that released various diversity, equity, and inclusion employees at Microsoft.
Crowdstrike wiped customer PCs to cover up an assassination attempt on Donald Trump.
The Ukrainian company Crowdstrike is coming back for Clinton.

All of this is complete and utter nonsense.

It is also spreading on LinkedIn. I have seen cybersecurity people copy and paste this nonsense on LinkedIn from profiles with employer names. It’s a foolishly foolish mindset.

What actually happened is really simple. CrowdStrike made a mistake in testing with amazing access to millions of PCs.

By the way, that mistake is not the fault of one analyst. It points to a series of failures at CrowdStrike. It also highlights structural issues in the cybersecurity industry.

In Summary

I believe our customers should demand more transparency from cybersecurity vendors selling our endpoint tooling.

I have seen numerous LinkedIn posts from people at cybersecurity vendors (not CrowdStrike) claiming that the disaster recovery is the fault of the customer.

I also think it points to a broader point. Some people in the cybersecurity vendor industry make customers feel foolish. They are the true wizards. They have all the threat intelligence. The performance of EDR is purely about detection. We will keep buying and take what is given.

Maybe we are not fools, and perhaps we will not keep buying.

Users who liked

I do not trust that Dorks should have root access worldwide.

Media Reporting Was Very Fast

We Should Not Trust Cybersecurity Vendors

People Prefer Conspiracy Theories Over Boring Truths

In Summary