You Should Be On Call, Too (MWRC)

Slide 1

Slide 1 text

You Should Be On Call, too Joshua Timberman @jtimberman [email protected] 1 Friday, April 12, 13 Earlier Mike said that speakers spend 6-8 weeks working on their presentations. Unfortunately most speakers only knew their talk was accepted 3 weeks ago, so they wrote it on the plane to the conference, which is why they're nervous. The timing for this talk in the schedule is fortunate, as I'm sure everyone really just wants to hear Jesse talk about ChatOps at GitHub. I do.

Slide 2

Slide 2 text

% whoami • I work for Opscode • Community Manager • System Administrator • Father, Gamer, CrossFitter totally legit mustache! 2 Friday, April 12, 13 Who am I, and how am I qualiﬁed to talk about this to you? I work for Opscode, a company that makes some automation software for operations teams and developers you might have heard of. In my role, I am a technical community manager. Basically, I write cookbooks and help others do so. I'm also a system administrator. While I'm not on call at Opscode, I have been on call for the majority of my career. And in a way, I kind of am on call for everyone's infrastructure that uses Chef and Opscode cookbooks, since I participate in front line community support via mailing lists, IRC, and twitter. You just don't get my phone number. Though, it's on my business card. I'm also a man of many interests - I like video and table top games, brewing my own beer, and I'm a husband and father. My career as a system administrator has caused many interruptions in these areas, of course.

Slide 3

Slide 3 text

Who are you? (Show of hands) • Sysadmins? • Developers? • Business people? (Consultants?) • On call (for production)? http://www.ﬂickr.com/photos/timyates/2854357446/ 3 Friday, April 12, 13 One thing I like to do is get an idea of whom I'm talking to

Slide 4

Slide 4 text

Let's talk about ... http://www.ﬂickr.com/photos/huﬀstutterrobertl/7195106982/ 4 Friday, April 12, 13 Okay, now that the introduction is out of the way, let's talk about...

Slide 5

Slide 5 text

5 Friday, April 12, 13 Chef! Amusingly enough, I submitted a few talks about Chef for this conference. But the organizers wanted to hear what I have to say about....

Slide 6

Slide 6 text

6 Friday, April 12, 13 How software developers should be on call. After all, one of the essential parts of DevOps is that Developers should carry pagers, right? That's culture?

Slide 7

Slide 7 text

http://www.ﬂickr.com/photos/xenithorg/81959734/ 7 Friday, April 12, 13 Of course, that's not culture.

Slide 8

Slide 8 text

http://www.ﬂickr.com/photos/gottgraphicsdesign/5863884809/ 8 Friday, April 12, 13 See, operations people need your help. I need your help.

Slide 9

Slide 9 text

That's not a "we're hiring"... http://www.ﬂickr.com/photos/mag3737/2742681177/ 9 Friday, April 12, 13 That's not to say we're hiring...

Slide 10

Slide 10 text

Although, we are! :-) http://www.opscode.com/careers 10 Friday, April 12, 13 Though, who isn't hiring these days?

Slide 11

Slide 11 text

http://www.ﬂickr.com/photos/gottgraphicsdesign/5863884809/ 11 Friday, April 12, 13 No, the help is on behest of the business.

Slide 12

Slide 12 text

http://devopsdotcom.ﬁles.wordpress.com/2012/11/screen-shot-2012-11-11-at-10-31-19-am.png 12 Friday, April 12, 13 Because this doesn't work, and you know it. "Works on my machine" is just plain irresponsible.

Slide 13

Slide 13 text

No deploys past this point 13 Friday, April 12, 13 Operations people get into road block mode when all they get is code that brings the site down, and gets them paged at 2am. Whether that is reality, or survivor bias, doesn't matter. This doesn't work either.

Slide 14

Slide 14 text

♥ Dev+Ops Culture Slide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-ﬂickr 14 Friday, April 12, 13 What we need is collaboration, and sharing. Not sharing "on call" but sharing the responsibility for the applications "you" write and "we" run.

Slide 15

Slide 15 text

http://www.ﬂickr.com/photos/inazakira/4532418098/ 15 Friday, April 12, 13 That's a nice segue into a story...

Slide 16

Slide 16 text

About silos, and separation of duties http://www.ﬂickr.com/photos/lostvegas/373800727/ Jez Humble: http://bit.ly/devopsteam 16 Friday, April 12, 13 Let me tell you about silos, and the separation of duties. I used to work for a large enterprise IT services company.

Slide 17

Slide 17 text

http://www.ﬂickr.com/photos/cdevers/5777944933/ IT Services 17 Friday, April 12, 13 We provided system administration services to other large enterprise corporations in our little slice of hosting. This is for "separation of duties" or for change management/control/itil/cobit or some business reasons supposedly, but what it meant was silozation, a practice unheard of in startups and other small companies, right?

Slide 18

Slide 18 text

18 Friday, April 12, 13 The team I was on had an on call "hotpager" rotation. Each person had their own primary accounts they worked on, but one person per week was the ﬁrst contact. We were all system administrators responsible only for the OS. Filesystems, network services, host-based ﬁrewall rules, security policy, user management, that kind of thing. Everyone had one of these pagers, because a) it was the early 2000's so cell phones weren't as widespread yet, and cell phones didn't get reception in the data centers (or, weren't allowed, but that's another story).

Slide 19

Slide 19 text

A typical problem... • Help desk receives "disk full" alert • Help desk pages "on call" • On call looks at system • On call determines alert is a full filesystem for customer data • On call pages primary sysadmin • Primary sysadmin looks at system • Primary sysadmin doesn't know what of the giant log files is safe to delete (oldest, right? Maybe!) • Primary sysadmin pages application support (sometimes the customer) 19 Friday, April 12, 13 This is probably the most common example.

Slide 20

Slide 20 text

Repeat for "CPU Utilization" 20 Friday, April 12, 13

Slide 21

Slide 21 text

Metrics are awesome! https://github.com/obfuscurity/tasseo https://github.com/obfuscurity/descartes 21 Friday, April 12, 13 Don't get me wrong, metrics are awesome. https://github.com/obfuscurity/tasseo https://github.com/obfuscurity/descartes

Slide 22

Slide 22 text

But... • Metric based alerts generally suck • Gross generalization... • But most of the time, this alerting is non-actionable • It isn't necessarily indicative of the problem • If it is, it's not clear why, or necessarily where to look from the outside. 22 Friday, April 12, 13 Everyone loves metrics and graphs. #monitoringlove is here to stay. But they're not super helpful without context. Sometimes, developers know the context best, when it wouldn't be obvious to anyone else. "Oh, yeah, sometimes the CPU usage goes up. There's a IO deadlock due to a janson rod misalignment."

Slide 23

Slide 23 text

If it's an application outage? http://www.ﬂickr.com/photos/eschipul/4610999148/ 23 Friday, April 12, 13 What if the problem is that the application isn't starting up? Or if it's starting, it's not connecting to the database? I wasn't the DBA, I didn't know we had a schema update, or how to recover a bad partition tablespace tuple.

Slide 24

Slide 24 text

24 Friday, April 12, 13 I can watch dashboards and trends, but I don't necessarily know what is "normal" for an app, or what nuances to look for. Developers do. This is exacerbated by the high turnover rate seen in operations positions. That is, I know a lot of sysadmins who don't stay at a company more than a year or so. That means retraining on new applications all the time.

Slide 25

Slide 25 text

Case Studies http://www.ﬂickr.com/photos/tk-link/2575598759/ 25 Friday, April 12, 13 Stand back, I'm about to try science.

Slide 26

Slide 26 text

Companies Studied • Opscode • Heroku • Etsy • SourceFire • CustomInk http://www.ﬂickr.com/photos/g_kat26/4060301657/ "You build it, you run it." - Werner Vogels, Amazon 26 Friday, April 12, 13 I talked to operations managers or team leads some companies that have "Developers on-call" policies. Or did at the time :-). Except Amazon. That is a quote from Werner Vogels, Amazon CTO, but it's conﬁrmed from former Amazon employees that work at, or worked at Opscode, like our own CTO Chris Brown, or founding CEO Jesse Robbins.

Slide 27

Slide 27 text

Overall Results • More developers than sysadmins • Developers have application domain knowledge • People like accountability, responsibility! • People don't like stress "We found that when we woke up developers at 2am, defects got ﬁxed faster than ever" - Patrick Lightbody, ceo browsermob 27 Friday, April 12, 13 I think it is safe to say that in most companies doing web operations, there are more developers than sysadmins. In most companies I've worked in, or with, there are more developers than sysadmins across the board. These developers bring valuable application domain knowledge to the table, since generally speaking, they *wrote* the application. Also, as it turns out, at least in the companies I spoke to, people actually like the accountability and responsibility. It's empowering. What I mean by people don't like stress is twofold. First, with more developers, more people share the load of on-call rotations. This helps the operations people not be stressed out, which will improve team morale, and improve the culture.

Slide 28

Slide 28 text

Overall Results • Nagios + PagerDuty are popular • Operations/sysadmin teams are escalation • Greater collaboration • Team building, learning/knowledge transfer 28 Friday, April 12, 13 Unsurprisingly, Nagios is the most popular tool for alerting. Knowing how it works is useful, and it's a great help to operating the application if those who write it also write the checks for it. Pagerduty is likewise popular for actually managing the alerts and escalation. In all the companies I talked to, the operations team gets escalated to for resolving issues/outages that are beyond the application (network issues, ﬁrewall conﬁguration, third party services). Collaboration between team members increased because developers worked closer with operations both for making the application easier to manage, but also in the events of an outage. This naturally leads to team building and learning/knowledge transfer.

Slide 29

Slide 29 text

You Heard Gene, right? The Second Way www.realgenekim.me 29 Friday, April 12, 13

Slide 30

Slide 30 text

Quiz Time! 30 Friday, April 12, 13 Audience participation time. This is for science.

Slide 31

Slide 31 text

Show of hands, do you... • Write application code? • Write application code that runs in production? • Write application code for clients? (consultants) • Are the first to get paged/called if there's a problem? 31 Friday, April 12, 13

Slide 32

Slide 32 text

Instrument your application http://www.ﬂickr.com/photos/benbunch/6081948074/ 32 Friday, April 12, 13 Do yourself, your operations team, and your clients a favor, and instrument your application.

Slide 33

Slide 33 text

Instrumentation • Monitoring tools/services • Heaps of blog posts • Other talks? • Coda Hale's metrics library? https://github.com/codahale/metrics https://github.com/johnewart/ruby-metrics 33 Friday, April 12, 13 I don't have speciﬁc advice on instrumentation. There's a lot of material about this, including talks at other RubyConferences. Coda Hale's metrics library for java has inspired a lot of people to build similar libraries for other languages.

Slide 34

Slide 34 text

http://www.ﬂickr.com/photos/peteredin/3174490145/ Make your application operable 34 Friday, April 12, 13 Again, this is for you, your operations team, and your clients. Make the application operable. Instrumentation is a good step, but so are good tools.

Slide 35

Slide 35 text

35 Friday, April 12, 13 Opscode manages a production application, a Chef Server as a Service, called Opscode Hosted Chef.

Slide 36

Slide 36 text

Opscode builds tools • Hosted Chef is an SOA • Private Chef is built on Hosted Chef, basically • There's a lot of moving parts • Enter `private-chef-ctl` 36 Friday, April 12, 13 We have Chef, and we leverage that for managing the Chef Servers.

Slide 37

Slide 37 text

private-chef-ctl sudo private-chef-ctl service-list sudo private-chef-ctl tail sudo private-chef-ctl reconfigure sudo private-chef-ctl restart nagios 37 Friday, April 12, 13 The private-chef-ctl command makes private chef operable.

Slide 38

Slide 38 text

Benefits • Hosted Chef is easier for Opscode's developers and operations team to manage. • Private Chef is easier for Opscode's customers and our support team to manage • Adapted tools for Open Source were released w/ Chef 11's Erlang port, too 38 Friday, April 12, 13 We've adapted the '-ctl' command to other products, too.

Slide 39

Slide 39 text

Building tools • Make applications operable • This isn't just for operations • It's for you. • Future you. •3AM you. http://www.ﬂickr.com/photos/robotson/236366629/ 39 Friday, April 12, 13 Future you will thank you.

Slide 40

Slide 40 text

http://www.flickr.com/photos/mararie/2904598732/ If you're a consultant... • Write tools that make it easier for your clients to operate the applications you've delivered to them. • This is a HUGE value add. If you have a shareable toolbox you can re-deliver, all the better. • Everyone knows this, right, but how many do it in practice? 40 Friday, April 12, 13 If you're a consultant, you may not be on call. But you can deliver more value for your customers by making it easier for them to manage, operate the app that you've written for them. Partnering with an operations focused consultant/firm can be mutually beneficial, too.

Slide 41

Slide 41 text

41 Friday, April 12, 13 Another way you can make your applications operable is write the automation code for managing/deploying them. Whether that is Chef or another tool matters less than working with the operations team to automate consistently.

Slide 42

Slide 42 text

Take aways • Collaboration • Responsibility • Operability http://www.ﬂickr.com/photos/hades2k/5880470447/ 42 Friday, April 12, 13 DevOps is a professional and cultural movement. It's practices focus on the business beneﬁts of collaboration, culture, sharing. Follow the lead from other companies and have developers be on call for production applications. This will increase their responsibility, and vested interest in building more robust, resilient applications. By having developers participate more actively in the operation of the site through response to outages, they will naturally help build better tools to operate the application.

Slide 43

Slide 43 text

Thank you! Joshua Timberman @jtimberman [email protected] Jez Humble on "DevOps Teams": http://bit.ly/devopsteam Interview with Werner Vogels: http://queue.acm.org/detail.cfm?id=1142065 43 Friday, April 12, 13