Every affected company should be extremely thankful that this was an accidental bug, because if crowdstrike gets hacked, it means the bad actors could basically ransom I don’t know how many millions of computers overnight
Not to mention that crowdstrike will now be a massive target from hackers trying to do exactly this
and its not just opeerational costs. what happens when an outage lasts 3+ days and affects all communication and travel? thats another massive shock to the system.
If I had to bet my money, a bad machine with corrupted memory pushed the file at a very final stage of the release.
The astonishing fact is that for a security software I would expect all files being verified against a signature (that would have prevented this issue and some kinds of attacks
So here’s my uneducated question: Don’t huge software companies like this usually do updates in “rollouts” to a small portion of users (companies) at a time?
I mean yes, but one of the issuess with “state of the art av” is they are trying to roll out updates faster than bad actors can push out code to exploit discovered vulnerabilities.
The code/config/software push may have worked on some test systems but MS is always changing things too.
The CEO made a statement to the effect of “It’s not an attack, it’s just me and my company being shockingly incompetent.” He didn’t use exactly those words but that was the gist.
That’s what the BSOD is. It tries to bring the system back to a nice safe freshly-booted state where e.g. the fans are running and the GPU is not happily drawing several kilowatts and trying to catch fire.
I’m gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO’s have been hyping AI up so much, what could possibly go wrong?
Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn’t written code for, then there will be a crash.
When talking about the driver level, you can’t always just proceed to the next thing when an error happens.
Imagine if you went in for open heart surgery but the doctor forgot to put in the new valve while he was in there. He can’t just stitch you up and tell you to get on with it, you’ll be bleeding away inside.
In this specific case we’re talking about security for business devices and critical infrastructure. If a security driver is compromised, in a lot of cases it may legitimately be better for the computer to not run at all, because a security compromise could mean it’s open season for hackers on your sensitive device. We’ve seen hospitals held random, we’ve seen customer data swiped from major businesses. A day of downtime is arguably better than those outcomes.
The real answer here is crowdstrike needs a more reliable CI/CD pipeline. A failure of this magnitude is inexcusable and represents a major systemic failure in their development process. But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.
The file is used to store values to use as denominators on some divisions down the process. Being all zeros is caused a division by zero erro. Pretty rookie mistake, you should do IFERROR(;0) when using divisions to avoid that.
I disagree. I’d rather things crash than silently succeed or change the computation. They should have done better input and output validation, and gracefully fail into a recoverable state that sends a message to an admin to correct. A divide by zero doesn’t crash a system, it’s a recoverable error they should 100% detect and handle, hot sweep under the rug.
Life pro tip: if you’re a python programmer you should use try: func() except: continue every time you run a function, that way ypu would never have errors on your code.
twiiit.com
Top