gepandz

joined 2 years ago
MODERATOR OF
sre
6
A dirty, dirty pun (icosahedron.website)
submitted 2 years ago by gepandz to c/sre
[โ€“] gepandz 4 points 2 years ago

I'm in PDX (Portland, Oregon), and I love quite a few things about living here. I grew up in Texas, and the willingness to engage with the conservative BS is one of the things I value most. We have the attitudes just like we did in Texas, but at least we deal with it in the open. Far better and healthier than the polite silence that smothers.

The fact that today's high is in the low 70s is also a strong selling-point, and the mix of river, mountains (what a friend calls "land-bumps"), and the fact that I'm an hour from the ocean, all tell me I'm in the right place.

I've visited Seattle a few times, including to interview, and I like Portland, which has a vibe of "Seattle's cooler younger brother." ๐Ÿ˜‚

[โ€“] gepandz 2 points 2 years ago (1 children)

At what scope? Like, to post within a community, at the comment-level, or is this to screen registrations?

[โ€“] gepandz 2 points 2 years ago
  1. Build your coding chops, especially for scripting-friendly languages like Bourne shell (sh, bash, zsh, etc. -- I use zsh almost exclusively, but that's a holy war I'd like to sidestep for now ๐Ÿ˜‰), PowerShell (if you can't avoid Windows systems, it's a decent tool), Python, Ruby, etc. It helps to know a little about a lot, so if you can at least do some online tutorials for whatever language your developers are writing in (Java, C++, C#, PHP, JavaScript, TypeScript, etc.), that'll come in handy if you have to dive in and figure out what that log error message is trying to tell you. Uh, I learned most of my programming from a combination of a bachelor's in CS and a master's in CS, plus a LOT of cursing... My dad let me have my first computer in my room when I turned 12, a ZeOS running MS-DOS 6.2.2 and Windows 3.1.1 For Workgroups, along with an inch-thick book of printed source code for QBASIC games and told me I could play any game I could write. ๐Ÿ˜… We've come a long way since the mid-90s, thankfully! There are some great tutorials online that I recommend running through, then look to sites like Advent of Code, a code Advent Calendar that comes out every Christmas season with 50 coding prompts that you can try to solve in any language (it's how I taught myself Rust). You only get better by cursing at something, in my experience.

Once you've got some basic coding skills (or as a way to build those skills ๐Ÿ˜‰), work with your devs on their testing suites. I've never worked in a shop with too many tests, and knowing what your code should do helps a lot when you're trying to figure out what it's actually doing; tests are how you, well, test that your code is working as expected. Understand how their tests get run, see if they have a test-runner like CircleCI, Jenkins, Travis CI, etc., and if they don't, see if you can help them build some kind of basic pipeline so once they commit their code to their repository (assuming they're using Git or another VCS (version-control system) -- and if they're not, they're wrong ๐Ÿ˜œ), a job will kick off to run their entire test suite for them. There are TONS of guides for how to set that kinda thing up, and it depends on the language they're using, whether you're using a cloud, how your repositories are set up and where they're hosted, etc. I once set up a basic but functional CI pipeline using an in-house server running nothing but Git and some Git Hooks (scripts that run when certain events occur; it's built into Git itself and provides a LOT of power that few, if any, people even knows exists). The trick is to get the devs to teach you how their stuff works, what they find painful and annoying, and then help them out by making it suck less.

This is a hard question to answer, since SREs and DevOps Engineers have to have a broad background. Everything could affect site reliability, so you have to know a little bit about a LOT, usually enough to be able to find an error message, figure out if it means anything, and either fix it or identify who you need to wake up next to fix it, since you're usually doing this at 3AM. Oh, and the error messages could be spread across a dozen or thousands of servers, each running some part of your multi-tier, microservices application, so I hope you can read logs for your load balancer, web server, application server, database, and whatever storage system your DB is using... ๐Ÿ˜‚๐Ÿ˜ญ ... While trying to get your kid up for school and finding someone who can pick up the threads for you for an hour or so while you drive your kid to school or explain to the incident commander and your manager that they're gonna have to hold their horses until you get back online (yes, I've had this EXACT experience... ๐Ÿ˜ฌ).

The real force-multipliers for any SRE are two things: their Rolodex (who they can call in when things go beyond their experience level) and a shared library of bitter experiences (usually a wiki, though I've used a shared text-file back in my mainframe days; at one shop I set a wiki up and dubbed it the Bitter Experience Document, or BED, "Since the incident isn't over until it's been put to BED." ๐Ÿ˜‚ To quote a former, now-retired, mentor, "I'm only laughing to keep from crying."). Knowing who to call in when for what helps you in the immediate term and gets the site back up faster. Documenting the root cause in a blameless way, recording what the error that you (the collective "you") observed as the trigger, the chain of events that led to the failure (especially if you can build a timeline), points where you could have avoided the situation with better automation or alerting, and action items for future-you to take if it crops up, again, all help improve reliability in the long-term, and it CANNOT be done in one person's notes or kept between one person's ears; that's not scalable, and it falls apart if that person does a person-thing, like get sick, take a vacation, has a kid, retires, or drops dead.

Tl;dr: improve your coding skills, make friends with your coworkers and know who to lean on for what, and document the daylights out of every event that occurs, since you WILL forget it, especially at 2AM.

  1. Not to be flippant, but the best tool is one you use that fits its purpose. I've used Puppet, Ansible, Jenkins, Pivotal CloudFoundry, debugged another team's CFEngine config tool... Too many to remember, let alone list. The thing to keep in mind is that, as long as the tool works for what you need it to do and is something that you can teach your coworkers to use, it's meeting your needs and is fine. If you find a tool that works faster, is easier, or that more of your coworkers will use (so you're not perpetually on-call), switch to that one. ๐Ÿคทโ€โ™‚๏ธ Tools are far less important than people, so use the tool that helps people the most.

  2. (I'm answering this in the context of what makes an SRE's life easier...) Write good tests and error messages, for the love of all that's holy, above or below. A good test is quick, verifies what it needs to verify, and either passes quietly or fails fast and loud. One of the most unforgivable things I've seen was a developer choice in a Spring Boot app that every response should always come back with an HTTP 200 (https://http.cat/200) "All is well" code, even when it REALLY threw an HTTP 500 (https://http.cat/500) or 503 (https://http.cat/503) error, and the only way to spot it was to parse the body of the response message. ๐Ÿคฌ I couldn't set up monitors and traps on actual error codes, because they were hiding them as clean responses, which made understanding the health of the application significantly more difficult. We had to depend on app logs going into an ELK stack and the couple of the couple-dozen microservices that also logged their HTTP responses to a different ELK stack, but in a way that I could scan the message bodies for error messages. ๐Ÿ˜ญ

I often get asked, "What should I test?" My simple answer is, "Only test the things you care about." If your app doesn't depend on running on a specific OS, don't test which OS you're running under (and you're hopefully writing code that can be as OS-agnostic as possible). If your code could fail because some user (NB: there's no "L" in "user" ๐Ÿ˜œ) sends in a bunch of garbage in your input field, write a test for garbage in your input field, but don't test every possible combination of characters from 'a' to "ZZZZZZZZZZZZZZZZZZZZZ" -- cover your bases, especially the bizarre edge-cases, and call it good. Every failure, every page, every incident, should result in at least one new test in each affected system, since that happened because your group missed a real event that could occur and didn't test for it. You're smarter, now, if less-well-rested... ๐Ÿ˜‚

It's really nice if you can work with your observability folks (also usually SREs) to add event-hooks to your code with performance and health-check data to feed into their monitoring system, like Grafana, New Relic, DataDog, etc., since more data usually correlates to more information, which we can mine for better intelligence, and, ideally, take more effective and earlier action (a trick I learned from a VP a LOOONG time ago was the Data -> Information -> Intelligence -> Action pipeline; I, an engineer, was talking Data at a VP, who was looking for the Action he needed to take, and it didn't go well).

Tl;dr: Tests that test just enough to be trustworthy that grow over time, error messages that are meaningful and that point you to what is likely wrong (especially if you link to a wiki page or something about how to fix the error state ๐Ÿ™Œ), and coordinate with the monitoring and observability folks to get them as many useful metrics as you can.


Sorry to wall-of-text at you, but these are questions I've gotten a LOT, and I think they're important to ask and to answer. I usually answer them over beer or when teaching an intern or something, and verbal communication is so much faster than written comms in some ways...

Hope this helps and that I didn't scare you off from being an SRE! It's ... never dull, which is what I like about it, but the constant interrupt-driven nature of it can be wearing at times, but that's another problem that's really a culture-problem, not an engineering one. ๐Ÿ˜…

[โ€“] gepandz 1 points 2 years ago

As someone who's done both jobs, there is a difference between Site Reliability Engineering and DevOps, though it's mainly a distinction without a difference. Site Reliability Engineering focuses on the health of a website, monitoring it, improving uptime and resilience to the incidents that come at it. DevOps is a cultural movement that focuses on how the code gets up and online, and how the site or app goes through its entire lifecycle.

The net effect is that you basically do the same actions as both a Site Reliability Engineer and a DevOps Engineer, but you focus on different aspects of the problem. For DevOps, you're focusing on the overall efficiency of the deployment pipeline, ensuring that the whole thing has healthchecks, consistent monitoring and control mechanisms, version-controlled tracking of code, and automation facilitating all of that throughout. There should be cooperation between Dev and Ops (and Security in DevSecOps) throughout the entire lifecycle of the application from end to end, and the "success metric" for DevOps is how seamless that cooperation is -- it's a cultural shift, not a directly mechanical one.

SRE takes the approach that site reliability is paramount, and the best way to achieve site reliability is to implement the same above, but only those parts that make sense as the need arises. If poor deployment processes impact site reliability, the SRE tackles the site deployment processes. If the site isn't able to roll back a change because there's no source-control, then it's time to implement version-control on your source files. You only really implement what you need as it comes up or you foresee it coming up and causing problems in the near-ish future.

SRE is, in my experience, a much more engineering-centric approach to solving the same problems that DevOps arose to address, but both wind up making very similar design and tooling choices. I would not say either is necessarily "wrong" beyond whether they work in the shop's culture or not. Managed poorly, SRE can become a cowboy fun-park, where only the cool and shiny things get worked on as things like safety get deprioritized. DevOps can become an ivory tower, devoid of connection to the real world, assuming the shop isn't just offloading all of the app maintenance from the Devs onto the Ops or SRE folks in a "Your problem now" situation. Both are bad, but both can be recovered from, and neither is an engineering problem that engineers can fix with tooling. I've said this for years, but in my 16 years of doing IT professionally, I've never encountered an actual engineering problem; every problem I've ever encountered has been a cultural or organizational problem presenting itself as an engineering problem, and fixing cultural problems with engineering solutions is like trying to solve hardware problems with software: expensive, and prone to weird failure-states.

Tl;dr: Both start from different places, but they end up in pretty much the same healthy design patterns. Close enough that you should be able to swap from one to another fairly readily.

 

Google published the original book on Site Reliability Engineering, and it's still a strong resource.

Time for the age-old questions that I've asked every interviewer for the past almost-decade, "What does SRE mean to you? Does it differ from DevOps?"

[โ€“] gepandz 6 points 2 years ago

In professional IT since '06, but I've been playing with (and fixing) computers since playing with my dad's Timex Sinclair at the age of two. I'm a generalist who focuses mainly in OSes and automation, but I've had experience working in databases, mainframe performance tuning, security, cloud infrastructure, pretty much anything but web front-ends outside of tweaking a really basic Livejournal page's HTML. ๐Ÿ˜‚

I've gotten as high as a domain and solutions architect; my three domains were Servers (Linux), Security, and Automation/SOA/DevOps (yes, I was a DevOps Architect, as much of an antipattern as that is... I didn't make the titles! ๐Ÿ˜…). Currently looking for a new SRE/DevOps/DevSecOps gig after getting let go during some layoffs recently.

As far as InfoSec is concerned, I take a very strong "Security is everyone's responsibility" approach, since it's us humans against the machines and "bad guys." Even a newbie can pull a Cliff Stoll and say, "Huh, that's funny." We need to band together in community, so I'm hopeful this will be a net-positive.