DevOps for Developers: Building an Effective Ops Org

Charity Majors @mipsytipsy DevOps for Developers Hi, Craftconf!! Thank you!!
I’m SO happy to be here. I was so jealous of all the live tweets coming last year, and this year has been just as amazing as I had hoped.

@mipsytipsy engineer, cofounder, CTO My name is Charity Majors. I’ve
been working on computers an engineer or engineering manager since I was what, 17 years old? I was a sysadmin back when they were still called that. I’ve done a little bit of everything, from ops to data to software engineering to management. my software engineering stints were not particularly glorious. :). when I was like 19 I was responsible for maintaining a qmail fork and adding the very first spam filters / sieve implementation, I’ve written distributed load testing frameworks for databases, etc. I’ve been a DBA — I have the dubious honor of sorta being one of the top MongoDB experts in the world — seriously, come at me, see if you can stump me on this one. :) I’ve also been a manager off and on, I’ve spent a lot of time building teams, lately building a new company. I’m mentioning all of this, not because i think you really give a shit about my biography, but because I want you to hear where I’m coming from. I’ve been all over the stack and the org chart, but my heart belongs to operations. I care really deeply about doing operations well. I care about it as a discipline, about building services that people love and rely on. I care about the people who do it, and making sure they love what they do. There’s this perception, I think, that ops is a horrible profession that’s abusive to its practitioners and they take it out on everyone else. So I originally planned to come here and talk to you guys about how to build really great ops teams. But then I realized that I don’t think that’s what really most needs to be said right now. So instead, I’m talking bout:

Why your software engineers need to get better at operations,
and how to do it. DevOps for Developers: Software engineers need to give a shit about operations. and I don’t just mean the process parts and the devopsy parts about how to break down silos between teams. I mean very speciﬁcally that most of your software engineers suck at operations engineering. Some software engineers are brilliant and passionate about operations, and mad respect and love to all of you out there who already get this. Lots of other don’t engineers honestly understand that this is a problem, others do but aren’t really sure what to do about it. A couple months ago a friend of mine pinged me and he was like “hey, I need help. I have a small team of software engineers, and we SUCK at ops. it’s taking up all of our time to do really stupid trivial things just to deploy software — help! how can we get better? are there any books or resources you would recommend?” And I looked around a bit, and realized … there really *isn’t* much out there that’s targeted at helping software engineers level up at operations! So, devops. Devops is this … thing.

“Dear operations people, learn to be more like software engineers.”
Love, DevOps (2009-2016) And we’ve been “devopsing” for a few years now, but it feels like most of it is either communication and empathy stuﬀ, breaking down silos etc — or it’s technical content aimed at operations engineers. We’ve had years now of people lecturing the sysadmins and the ops kids about how they really needed to level up when it comes to writing code and tests and managing infrastructure with more traditional software engineering techniques. And this is all great! Like, on the ops side, we have received this message so hard. Operations, site reliability, whatever you want to call it — we’e really gone from being the assholes who typically block change to the ones who are really driving change and pushing forward, and pushing the envelope when it comes to really hard problems of scale, reliability, and process and continuous improvement. So I feel like it’s about time — maybe past time to turn the lens around and talk about the other side.

“Dear software engineers: your turn. Time to get better at
ops.” Love, Everyone in Tech Which is: it’s time for software engineers — across the board — to get better at operating systems, and owning their own services. And feel some urgency about it. The complexity of these systems we are all expected to support is exploding. It’s not enough to write beautiful code and understand your data structures and algorithms and the code you wrote, you need to understand *systems*. You need to understand how your code and your services interact with other services and storage layers, and instrumentation, and how to make good technical decisions that, and own your systems and debug them from end to end. I’m not saying that no SWEs do this. There are plenty of software engineers who are really grounded in the consequences of operational impact, but it’s still seen in our industry as kind of “optional”, or nice-to-have.

This is not optional, this is not “nice-to-have” This is
table stakes. This is not a “nice-to-have”. it should not be optional. It’s 2016. You wouldn’t hire an sysadmin who literally can’t write a line of code. Not optional. Likewise, And you shouldn’t hire a SWE who can’t or won’t be on call and own /debug their own services. This should no longer be considered optional! even for FE and mobile devs. Lots of places will claim or even brag that they don’t do ops or don’t need ops, so let’s quickly dispense with that nonsense. What is operations?

What is operations? Operations is the constellation of your org’s
technical skills, practices, and cultural values around designing, building and maintaining systems, shipping software, and solving problems with technology. Operations, in the way i think we should be thinking about it, is the constellation of all of the technical knowledge, skills, practices, and cultural values that your company has built up around shipping software and building systems. This includes all of your informal habits, tribal knowledge, reward systems. Probably includes a lot of things that you and your leaders don’t even know exist. I don’t really think about operations as a *role*. even though you probably do have roles or individuals who are more dedicated to operability problems than others.

Operations is a social contract. But I think it’s more
helpful to think of operations as a social contract that everyone participates in, from the CEO all the way down to tech support. You could also say: operations is eﬀectively an emergent property of how your organization executes on its technical mission. And — I don’t want to be the language police here, but IF you choose to brag about how you and your badass org “don’t do ops” or don’t need ops, you are in serious danger of devaluing the entire range of skills and processes that actually produce high quality software and systems. So many of the companies out there who have been the most vocal in their disdain for operations as a discipline, actually then end up spending an incredible amount of their precious software engineering cycles doing stupid, low hanging crap work. They’re so proud of themselves for not “doing ops” but mostly what they’ve succeeded in doing is devaluing the skill set and explicitly establishing the precedent that *this organization does not value operational excellence*. In general, you don’t make a thing better by ignoring that it exists. You make a thing better by naming it, claiming it, and consciously working to improve it. I get where these people are coming from, when they use ops as a synonym for “nothing is automated”, but think that they’re discarding way too much value when they do that.

Do you need an “ops team”? Do you need quality
operations engineering skills and culture? ¯\_(ϑ)_/¯ YES. This doesn’t mean you need a dedicated ops team! Do you? I don’t know, it depends on a lot of things. But do you need engineers who care about building systems that are maintainable and reliable, sustainable, systems that people can understand and debug and explore? Yes, you do. You need to have these people, and a culture that values these things, or the systems that you build will be terrible.

So you have an Ops Org … So by this
deﬁnition, you have this thing that we’re gonna call an “ops org”, and you probably have some ops problems, because we all do. And usually when orgs start to have ops problems, they either try to hire people to ﬁx those problems, or they drill down on the teams who are responsible for “reliability”, etc.

Your Mission 1. Support your people in developing new skill
sets 2. Express institutional value (and mean it) There are two sides to this story. First, how do you support your engineers in developing and building these skill sets? I wrote like a 3 hour long talk and just chopped a shit ton of it oﬀ in the past 60 minutes — blergh. Second, how do you express and sustain to your engineers that these skills are *valued*? That they are not optional, that they are visible, that they are seen, just as much as the engineers who ship shiny features and products. This starts with the interview process, the promotion and review process, the full lifecycle of that engineer’s development at your company. These are some of the tools in your toolbox when it comes to helping your engineers level up technically. Your ﬁrst and most important tool is about recognizing the power of the feedback loop. Interviewing and hiring Performance reviews, feedback and promotions Cultural values and reward cycles

Software engineers need to get better at ops. (And they
should WANT TO!! Ops is like a superpower!!!) So first: you need to support your engineers in developing new skills. First of all, you need to convince them that they should WANT it. and they SHOULD. This is the weird key secret about building a really powerful engineering ops org: Here’s the thing about building an effective ops org: it's mostly not about hiring more operations engineers or SREs or whatever. It's about helping everyone level up their game at ops. It’s about making this table stakes, not “extra”. And since most ops engineers are already pretty good at this, honestly, it’s often (not always) the case that you can have more impact on the quality of your reliability or whatever ops problems you have, by focusing less on them and more on helping your software engineers understand operational impact of their code. This is about baking operational excellence in from the start; making it a first class citizen of your processes and your values, instead of trying to tack it on afterwards by hiring more ops bodies. And if you’re a software engineer who wants to be a fucking world class badass engineer/tech lead/system architect, you should CRAVE these skills. A solid grounding in operations is often what separates the “ok” software engineers from the kind of engineers you can build a company or a team around. so: HOW?

Developing new skill sets Let’s start with feedback loops, and
how to instill a feeling of ownership.. This is about ownership, and understanding the lifecycle of your code in terms of months or years instead of hours or days. (Like Bridget said yesterday, software is never really “done” until it’s been decommissioned.) There are two quick and easy things you can do that will tighten this loop and, I guarantee, increase their ownership and investment in quality? Is that software engineers should always be on call for their own services, and always deploy their own code. Create feedback loops — deploying code, putting people on call for their own services Make it not horrible and miserable

Engineers should be on call for their own services. The
on call rotation is not only your ﬁrst and best tool for building a healthy ops culture, it’s also your most eﬀective tool for *helping them become better software engineers*. They get better at debugging. They get better at creating debuggable systems. They learn to think in terms of interdependencies instead of abstractions. They very quickly learn that things are going to fail a lot more than they expected. If your SWEs aren’t used to being on call, they may resist this. Which is why it is 100% on you as a technical leader to do two things: hep them understand why this is a GOOD THING, not a punishment, and then make sure that oncall doesn’t fucking suck.

Common protests: * learned helplessness * fear of breaking things
* strategic incompetence * “my time is too valuable!” If your SWEs aren’t used to being on call, they may push back for one of these reasons. Because being on call — it’s never gonna be glamorous, but it also has a really bad rep because, frankly, ops people have a history of martyrdom and self-abuse. Operations teams have a long and sordid history of developing martyr/hero complexes. That is what gives oncall duty such a bad rep. Do not let them get away with this.

• Guard your people’s time and sleep • No hero
complexes. No martyrs. • Don’t over-page. Align engineering pain with customer pain • Roll up non-urgent alerts for daytime hours • Your most valuable paging alerts are end-to-end checks on critical code paths. Corollary: on-call must not be hell. We are not here to be martyrs. Suffering is not a badge of honor. Being on call should not regularly diminish your quality of life. If you’re a leader, it is YOUR JOB to monitor how often people are getting woken up or interrupted out of work hours, and it is on you to *fix that* anytime it gets out of hand. Make it culturally ok for someone to stay home / sleep in / hand over pager if they’ve had a rough night, without having to ask permission. Not just culturally ok — culturally *encouraged*. Your most valuable checks are end-to-end checks that traverse the most important code paths that correspond to your KPIs. Page only on the health of the service, not individual metrics, especially after hours. Have two different categories of paging alerts. Problems that are customer-impacting are worth waking someone up for. If it’s not hurting your customers, it can wait until morning. Make it your goal to have as few 24x7 paging alerts as possible. Make this a key part of your design phase. Revisit the list of paging alerts regularly and audit them.

Software engineers should deploy their own code. On call duty
and deploys are deeply interconnected. Unless you have a robot that auto-deploys from master on each commit (in which case you’re already pretty advanced along this path …), your software engineers should always deploy their own code. It’s worth investing into instrumentation here, e.g. canarying, blue-green etc.

Build guard-rails, not walls Feedback needs to be fast to
be effective This is one of those catchphrases of devops: build guard rails, not walls. It’s a good catchphrase! This is how you empower developers. Give the developer enough feedback that they can have confidence in what they’ve just done. Alert them directly if something went wrong. Feedback needs to be quick in order to be maximally effective. If an engineer broke something and finds out a few days later in the post mortem, that’s not nearly as visceral and educational as if he or she got paged two minutes after they did it.

The most powerful weapon in your arsenal is always cause
and eﬀect. So … use cause and eﬀect as creatively and as often as you can. People generally want to do well, but they can’t care about what they don’t know about. Tightening up that loop creates empowerment and excitement and accountability, which is how you get great engineers and great engineering teams. The next tool we’re gonna talk about is … knowledge transfer and education.

Pair your SWEs with ops/DBA for debugging, oncall “cool! let’s
sit down and ﬁgure this out together, and I’ll show you how to do it next time!” Oncall, obviously. If you have ops engineers, pair ops and SWEs together as primary/secondary on call buddies to encourage collaboration. Get code reviews from your SWEs on major diﬀs so your ops team levels up on code quality and test coverage. Pair on major production pushes or migrations if they’re scary. There are a few technical and social keys to getting this right. First of all, you should all be using the same tools. I’m not saying that everybody needs to be uniformly expert at all aspects of software development and infrastructure automation. There’s plenty of value in specialization or domain knowledge between teams. But common, ordinary tasks should be completely fungible. If a SWE is coming to an ops engineer several times a week just to get a variable changed and deployed, it’s way past time for them to learn how to do it themselves.

Your eng teams should share the same review processes, tasks
and tools. The more your processes, tools, test pipelines, and workﬂows match those of your other teams, the less impedance mismatch there will be for collaboration. The less it will feel like you’re hanging out on someone else’s turf, and the more you’ll feel like complementary limbs of the same org, which you ARE, right? Another really key point? Get your operational feedback *early* and *often*, from the very ﬁrst design phase.

Emphasize ops feedback in early design phase. What are the
reliability requirements? How do we distribute load or degrade gracefully? Are we reusing components that are already known & supported as much as possible? Who supports this service, how is it going to fail, what are the ripple eﬀects when it does? What instrumentation and metrics will we need? A lot of teams end up wasting a ton of engineering time because they don’t ask operationalized questions until it’s “close to launch”. And then you end up shipping services that are shitty or fragile, or they just get nixed because they were a bad idea. You need operational buy-in from the beginning, you need hard questions from the beginning. If someone is trying to add a new persistent store, or a new language, or the architecture doesn’t make sense or doesn’t leverage existing components, it’s better for everyone to ﬁnd out *early* before eng time has been spent on it.

Like Caitie says, … you’re fucked down the line, if
you don’t think about these things early on. You will not be able to recruit and hire as many engineers as you need to keep a growing product running if you can’t bend the operational cost curve down as your service is scaling up.

The cost and pain of developing software is approximately zero
compared to the operational cost of maintaining it over time. h/t @mcfunley, “choose boring technology” The most important concept for your engineers to internalize is this: if you aren’t literally a startup starting from scratch, the development time and pain are approximately zero when compared to the amortized cost of maintaining and scaling and operating this beast over time. And so, the rules are: - The best code is no code. - The second-best code is code someone else wrote and maintained and battle-tested for your use case - If you must have code you write yourselves, the best code is the simplest. Save your innovation tokens for core business diﬀerentiators.

Dear fellow ops/DBAs: BE NICE The grumpy ops roadblock stereotype
isn’t helpful. Would you like to get paged less and work with a higher caliber of engineer? You have a specialized skill set, it’s on you to help them get there. It’s tempting to be a hero and a gatekeeper. I know! It feels really good to be needed! Don’t do it. Model blameless post mortem and pairing Let’s zoom out a bit now, from the individual level to the team and org level. How do you build *teams* of engineers who value operational excellence? How do you interview and hire these people, and how do you cultivate an environment where operational skills are highly valued?

Creating Institutional Value Let’s talk about interviewing software engineers. You
probably construct a loop of interviewers and have a normal set of questions you ask. Are any of those questions about operations? The way you handle interviewing, leveling, performance reviews, pay scales, promotions etc will convey more about how much you actually value ops than anything you can say.

• Interviewing • Promoting • Performance Reviews • Compensation How?
If you have a performance review cycle system at your company, this should be a component of every feedback cycle for software engineers. Communicate this up front. I don’t mean “score them based on how many times they broke something.” Remember, our goal is not to punish people for mistakes or make them too paranoid to touch production. It does mean you can evaluate them on how well they perform their on call duties — do they dig in deeply when a problem is reported, or do they brush it off and close the ticket if they don’t know how to fix it? Do they ask for help? Do they share knowledge, participate in post mortems and close out their followup tasks? The most valuable signal here usually comes from their peers. Ask specific questions like these in your 1x1s. Ask other SWEs which of their peers are the most diligent and impressive engineers when it comes to on-call work. Ask your ops engineers or SREs questions like, “who are the top 2-3 engineers that you would most trust to deploy some random code at 11 pm on a saturday night given absolutely zero context?” and “which engineers would make you roll out of bed and scramble to your laptop to make sure they weren’t doing something stupid?” Ask your support team which engineers they trust and value the most, and who is the most responsive to user reports. It’s important to ask these questions specifically, because you will get very different answers than if you just ask things like “who writes the best code” or “who do you like working with the most?” Those are also interesting questions, but they surface very different performance characteristics.

Probe every software engineering candidate for their ops experience &
attitude. … yep, even FE/mobile devs! If you care about operational quality, you will ask every prospective software engineering hire some of these questions. (samples in a sec) It’s common practice at lots of companies now to have a software engineer in the loop for hiring site reliability engineers to evaluate their coding abilities. It should be just as common to have an ops engineer in the loop for a SWE hire, especially for any SWE who is being considered for a key senior position. And yes, I mean *all* engineers! Even your ios/android engineers and website developers should be SOMEWHAT interested in what happens to their code after they hit deploy. They should know things about instrumentation and debugging.

• “Tell me about the last time you caused a
production outage.” • “What are your favorite tools for visibility, instrumentation, and debugging?” • “How would you design a deploy process?” • “You developed service $x, and latency is 5x higher today than yesterday. How do you start debugging the problem?” • “What happens when you type “google.com” into a browser? Good operational questions for SWEs I have some sample questions here, I’m not going to through them because time, but the slides will be on the web. Good questions are simple, leading, and have lots of reasonable answers. And stress up front, *it’s okay not to know*. This is not a pass-fail quiz. But it’s important to ask. Because it sets the tone.

Good engineers should be able to communicate in great detail
everything that SUCKS about their favorite technologies. Another question I really like is: “what’s your favorite API (or database, or language) and why?” Followed up shortly by “… and what do you really hate about it?” Specific technologies and techniques really don’t matter. There are a million ways to write and instrument a web app. I would rather hire someone who has built things on a few different languages or platforms and can identify their flaws and tradeoffs, than a fanboy who actually believes whatever they’re using is flawless. You’re also evaluating them here on communication skills, which is severely underrated by most people but is actually as a key technical skill.

Do they expect the network to be reliable, disks to
be fast, databases to respond, retries to succeed … Signals … How do they react to the idea of being on call for their own services? Are they overly clever? Ugh. When you’re asking them questions, make it clear up front that you aren’t going to fail them for not being an ops expert. It’s ok not to know things. You *are* teasing out signals for how they will perform on a team where software engineers are expected to own their shit. How much do they know about the world outside of their own code? How much are they *willing* to know? Are they overly clever? God, I hate clever software engineers. The best engineers try to be as simple as possible. A key instinct for architectural design is having to understand as few things as possible. Talk to them about what it means to be responsible for a service. Are they oﬀended at the idea of being on call for their own software? Well, don’t hire that person.

“Operations is valued here.” you are signaling … What it
says is, you’re establishing expectations from the start that you run an org where OPERATIONS IS VALUED. This is not a shop where you push to master and go home for the day and somebody else gets paged for your shit. Some people won’t want to work that way! Better to ﬁnd that out now, before you hire them.

• Solicit regular feedback from peers, ops, support teams •
Ask questions about relevant operational skills: • “Who would you most like to be paired with on call? Least?” • “Who do you ask for help when you’re completely stumped?” • “Whose code would you be least willing to maintain?” • Include this feedback every cycle, it should not be a surprise. Performance reviews If you have a performance review cycle system at your company, this should be a component of every feedback cycle for software engineers. I don’t mean “score them based on how many times they broke something.” Remember, our goal is not to punish people for mistakes or make them too paranoid to touch production. It does mean you can evaluate them on how well they perform their on call duties — do they dig in deeply when a problem is reported, or do they brush it off and close the ticket if they don’t know how to fix it? Do they ask for help? Do they share knowledge, participate in post mortems and close out their followup tasks? The most valuable signal here usually comes from their peers. Ask specific questions like these in your 1x1s. Ask other SWEs which of their peers are the most diligent and impressive engineers when it comes to on-call work. Ask your ops engineers or SREs questions like, “who are the top 2-3 engineers that you would most trust to deploy some random code at 11 pm on a saturday night given absolutely zero context?” and “which engineers would make you roll out of bed and scramble to your laptop to make sure they weren’t doing something stupid?” Ask your support team which engineers they trust and value the most, and who is the most responsive to user reports. It’s important to ask these questions specifically, because you will get very different answers than if you just ask things like “who writes the best code” or “who do you like working with the most?” Those are also interesting questions, but they surface very different performance characteristics.

Senior software engineers should be reasonably good at these things.
So if they are not, don’t promote them. Operations engineering is about making systems maintainable, reliable, and comprehensible. Senior engineers understand the lifecycle of their code, and the impact of their technical decisions over time. Senior engineers are capable of logging in to a server or inspecting their own metrics and debugging what the hell just happened. This is table stakes. Senior engineers set a good example for junior engineers, give sound advice and have good technical judgment. That’s what being senior *means*. These are the role models you are creating for your team. So if you value operations, factor basic ops hygiene into the expectations you set for promotions and leveling.

You need to actively solicit this feedback by asking different
questions. It is much, much harder to recognize and reward operational excellence than shipping shiny features. To tease out this signal, you have to ask the right questions. And then you have to act on this information in a way that demonstrates that you value it, as much as you value engineers who ship shiny features (if you in fact do). EVERYBODY publicly cheers on those engineers. It’s much harder to identify and celebrate the engineers whose services ship cleanly and don't break. Which brings us to the last part, on culture and recognition.

Your operational priorities must be clearly communicated by management, details
left up to the engineers/teams. In a company with a strong, eﬀective ops culture, your entire management chain values it and clearly communicates their values and priorities. Your leadership needs to clearly set priorities, establish which metrics they care about, and then leave implementation details up to the teams who are responsible for hitting them. And hold them accountable for doing so.

The patterns you call out and celebrate in your culture
will get repeated. What do you valorize? What do you celebrate as a culture and as a company? People on your teams are absolutely going to internalize the kind of behavior and technical prowess that gets called out and gloriﬁed, and be motivated to do more of that. Where does your leadership lavish their praise? Are you praising people for shipping features, or performing unsustainable heroic moves? Thank them for practicing good self care, not for burning themselves out. It’s ok to pull people aside and thank them for pulling a hero move, but you should deliver that message privately and pair it with an apology that the organization placed them in a position where heroics were necessary. And then post mortem how you can prevent it from happening again. Hero/martyr complexes are one of the unhealthiest patterns in ops culture and very hard to vanquish. It’s like the ﬂu, it just keeps coming back.

In conclusion … So, in conclusion: do you even need
operations engineers as a dedicated role? (whether SRE, DBA, operations, etc)

Yes, you need an ops team, IF you have hard
operational problems. You should try to not have hard operational problems. Hard ops problems are things like extremely rapid growth, or very high reliability requirements, or high security demands, or you’re trying to solve an infrastructure problem for the entire internet as a service. Can you run your company on Heroku? Can you run your company on AWS Lambda and Dynamo and Travis-CI? Then you should probably do that, and not hire an ops team. Operations engineers are very expensive, and good ones won’t want to stick around if you don’t have hard problems for them.

Needing a dedicated operations engineering team is a sign of
success. Good job! The more mature your company gets, the more operational impact trumps every other technical decision you make. This is a sign of success, so enjoy it.

• Bootstrapping a world-class ops team: • www.heavybit.com/library/video/2015-02-24-charity-majors • Allspaw
on blameless post mortems • https://codeascraft.com/2012/05/22/blameless-postmortems/ • Choose boring technology: • http://mcfunley.com/choose-boring-technology • DevOps Weekly: devopsweekly.com • SRE Weekly: sreweekly.com Useful links:

with special thanks to: Caitie McCaﬀrey Mark Ferlatte Mihasya (Pancakes)
Bridget Kromhout Dan McKinley

Charity Majors @mipsytipsy

DevOps for Developers: Building an Effective Op...

DevOps for Developers: Building an Effective Ops Org

More Decks by Charity Majors

Other Decks in Technology

Featured

Transcript