this post was submitted on 14 Dec 2023
180 points (97.9% liked)

Asklemmy

43810 readers
1 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy ๐Ÿ”

If your post meets the following criteria, it's welcome here!

  1. Open-ended question
  2. Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
  3. Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
  4. Not ad nauseam inducing: please make sure it is a question that would be new to most members
  5. An actual topic of discussion

Looking for support?

Looking for a community?

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 6 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[โ€“] dan@upvote.au 16 points 2 years ago* (last edited 2 years ago) (3 children)

I broke the home page of a big tech (FAANG) company.

I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.

What I didn't realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn't designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.

Always use feature flags for risky work! It would have been broken for a lot longer if I didn't add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.

[โ€“] jjjalljs@ttrpg.network 13 points 2 years ago

Always use feature flags for risky work! It would have been broken for a lot longer if I didnโ€™t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins

This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.

What language? PHP, python?

I work on a SOC team and were really trying to hammer the usage of feature flags into our devs.