this post was submitted on 19 Nov 2025
358 points (98.4% liked)

Technology

76918 readers
3409 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

you are viewing a single comment's thread
view the rest of the comments
[–] IphtashuFitz@lemmy.world 11 points 1 day ago

You would do well to go read up on the 1990 AT&T long distance network collapse. A single line of changed code, rolled out months earlier, ultimately triggered what you might call these days a DDoS attack that took down all 114 long distance telephone switches in their global network. Over 50 million long distance calls were blocked in the 9 hours it took them to identify the cause and roll out a fix.

AT&T prided itself on the thoroughness of their testing & rollout strategy for any code changes. The bug that took them down was both timing-dependent and load-dependent, making it extremely difficult to test for, and required fairly specific real world conditions to trigger. That’s how it went unnoticed for months before it triggered.