this post was submitted on 19 Nov 2025
276 points (99.3% liked)
Technology
76918 readers
3286 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Honest question (I don't work in IT): this sounds like a contradiction or at the very least deliberately placating choice of words. Isn't a config change the opposite of routine?
Not really. Sometimes there are processes designed where engineers will make a change as a reaction or in preparation for something. They could have easily made a mistake when making a change like that.
E.g.: companies that advertise on a large sporting event might preemptively scale up (maybe warm up depending on language) their servers in preparation for a large load increase following some ad or mention of a coupon or promo code. Failure to capture the market it could generate would be seen as wasted $$$
Edit: auto-scale does not count on non essential products, people would not come back if the website failed to load on the first attempt.
I don't think it was a bug making the configuration change, I think there was a bug as a result of that change.
That specific combination of changes may not have been tested, or applied in production for months, and it just happened to happen today when they were needed for the first time since an update some time ago, hence the latent part.
But they do changes like that routinely.
Yeah, I just read the postmortem. My response was more about the confusion that any configuration change is inherently non-routine.
No, in "DevOps" environments "configuration changes" is most of what you do every day
They probably mean that they did a change in a config file that is uploaded in their weekly or bi-weekly change window, and that that file was malformed for whichever reason that made the process that reads it crash. The main process depends on said process, and all the chain failed.
Things to improve:
I'm sure they have several of these and sometimes shit happens, but for something as critical as CloudFlare to not have automated integration tests in a testing environment before anything touches prod is pretty bad.
Fail open vs fail closed. Bot detection is a security feature. If the security feature fails, do you disable it and allow unchecked access to the client data? Or do you value Integrity over Availability
Imagine the opposite: they disable the feature and during that timeframe some customers get hacked. The hacks could have been prevented by the Bot detection (that the customer is paying for).
Yes, bot detection is not the most critical security feature and probably not the reason someone gets hacked but having "fail closed" as the default for all security features is absolutely a valid policy. Changing this policy should not be the lesson from this disasters.
You don't get hacking protection from bots, you get protection from DDoS attacks. Yeah some customers would have gone down, instead everyone went down... I said that instead of crashing the system they should have something that takes an intentional decision and informs properly about what's happening. That decision might have been to clo
You can keep the policy and inform everyone much better about what's happening. Half a day is a wild amount of downtime if it were properly managed.
So you agree that if this were controlled instead of open crahsing everything them being able to make an informed decision and opening or closing things, with the suggestion of opening in the case of not detection is the correct approach. What's the point of your complaint if you do agree? C'mon.
I disagree. I don't know the details of cloudflares bot detecion, but there are many automated vulnerability scanners that this could protect against.
I agree. Every crash is a failure by the designers. Instead it should be caught by the program and result in a useful error state. They probably have something like that but it didn't work because the crash was to severe.
I am not complaining. I am informing you that you are missing an angle in your consideration. You can never prevent every crash ever. So when designing your product you have to consider what should happen if every safeguard fails and you get an uncontrolled crash. In that case you have to design for "fail open" or "fail closed". Cloudflare fucked up. The crash should not have happened and if it did it should have been caught. They didn't. They fucked up. But, i agree with the result of the fuck up causing a fail closed state.