Blamestorming for fun and profit: The debris has mostly cleared from the CrowdStrike outage. But we still haven’t recognized some of the IT impacts that matter most.

In early 2000, following IT’s unprecedentedly effective response to the Y2K situation, the world botched its after-action review. Consumed by the need to have someone to blame, influencers from around the world proclaimed it was a hoax perpetrated by IT to inflate technology budgets and its perceived importance.

Happy to have a scapegoat, the world ignorantly sneered and then moved on to the next pseudo-culprit.

So now we have CrowdStrike. Consumed by the need to have someone to blame, and with Microsoft as a sitting duck perched precariously atop a dubious patch pile, influencers around the world are determined to expend their energies and (speaking of dubious) expertise, blamestorming instead of constructing a systems view of the situation.

But before we can even get started: It appears that, no matter how appealing the story, Southwest Airlines wasn’t immune to the CrowdStrike bug because its servers run on Windows 3.1. (For an in-depth view, see “No, Southwest Airlines is not still using Windows 3.1 — OSnews.”) Then there’s this simple plausibility test: We’re talking about a network that has to support several tens of thousands of end-users. Which is the more likely cause of systems failure in a Windows 3.1-based network that has to scale this far — a bad CrowdStrike patch or Windows 3.1 itself? It would be a bit like Southwest basing its engine technology on duct tape and aluminum foil. Maybe you could do it, but it would be just as crash-prone.

Implausibility won’t, sadly, persuade business executives who suffer from a confirmation bias that says IT’s request for funding lifecycle management is, like the Y2K bug remediation effort, unnecessary.

Sigh.

Just my opinion: In an era of AI-powered cyberattacks the last thing you ought to do is embrace obsolescence as a strategy.

Instead, it’s best to heed what follows from the CrowdStrike mess.

Insight #1: The CrowdStrike outage was more than a technical defect

Yes, Microsoft granted access to its kernel while Apple and most Linux variants did not, enabling the bad patches that caused the problem. This wasn’t laziness and sloppy decision-making on Microsoft’s part, though. Microsoft did this because EU regulators insisted on it.

Nor did the EU regulators insist because they were fools. Their goal was ensuring fair competition in the European OS marketplace. It was a trade-off that didn’t pay off. But then, trade-offs don’t always pay off. That’s why they’re called “trade-offs” and not “un-mixed blessings.”

Insight #2: Want someone to blame? Blame the Red Queen

CrowdStrike is in the cybersecurity business. Many, and perhaps most cybersecurity providers recognize they’re trapped in a “Red Queen Strategy.” Like Alice’s Wonderland nemesis they have to run as fast as they can to stay in one place.

They are, that is, under unrelenting pressure to release newer and more sophisticated responses to newer and more sophisticated threats.

It’s another way this is a systemic problem. Cybersecurity vendors like CrowdStrike have to deploy patches and releases more quickly than prudence would otherwise dictate, with “more quickly” translating to “insufficiently tested.”

Vendors are trapped by the Red Queen. They can defend against new malware on bad actors’ release schedules, taking the risk of sending out buggy patches, or they can fail to defend their customers against new malware and leave their customers vulnerable.

The faster new malware releases barge in, the more likely cybersecurity vendors are to miss defects in their patches and releases.

As CIO you aren’t immune from the Red Queen effect either. IT is under constant pressure to deliver stuff quickly, and nobody wants to hear that slowing things down to reduce risk is a necessity.

Rock, meet hard place. Then, meet DevOps.

Insight #3: We need to take a close, hard look at DevOps

DevOps isn’t just the place user acceptance testing has gone to die any more. It’s the place where continuous integration / continuous delivery (CI/CD) was supposed to be “best practice.” But too many adopters have substituted deployment for delivery, the difference being delivering means creating releasable builds for further quality assurance, not deploying them to PROD right away.

Insight #4: The lines have blurred

Once upon a time there were bugs. Once upon the same time there was malware. Now, the only difference between bugs and the destructive forms of malware is the author’s intent.

Insight #5: Preparation is everything

Businesses that were resilient and recoverable in the face of the CrowdStrike bug were resilient and recoverable because they had prepared themselves for ransomware attacks and other recovery situations. See “Inside CIOs’ response to the CrowdStrike outage — and the lessons they learned” for insights into this perspective.

Insight #6: Proselytizing the ELT on IT’s trade-offs will pay off

All of which gets us back to a challenge all CIOs must overcome if they’re going to retain even the slightest shred of sanity: Making sure the company’s executive leadership team embraces the trade-off-laden nature of the IT trade. The CrowdStrike debacle gives you a case study you can use to highlight key IT trade-offs. The Red Queen dilemma — the speed vs. risk choice described above — is a good place to launch the conversation.

Then you can enlist the ELT’s help in setting the right balance point for some key trade-offs your own IT organization has to contend with.