Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You Should Be On Call, Too (MWRC)

You Should Be On Call, Too (MWRC)

Slides for the talk I gave at MountainWest RubyConf 2013.

Joshua Timberman

April 12, 2013
Tweet

More Decks by Joshua Timberman

Other Decks in Technology

Transcript

  1. You Should Be On Call, too Joshua Timberman @jtimberman [email protected]

    1 Friday, April 12, 13 Earlier Mike said that speakers spend 6-8 weeks working on their presentations. Unfortunately most speakers only knew their talk was accepted 3 weeks ago, so they wrote it on the plane to the conference, which is why they're nervous. The timing for this talk in the schedule is fortunate, as I'm sure everyone really just wants to hear Jesse talk about ChatOps at GitHub. I do.
  2. % whoami • I work for Opscode • Community Manager

    • System Administrator • Father, Gamer, CrossFitter totally legit mustache! 2 Friday, April 12, 13 Who am I, and how am I qualified to talk about this to you? I work for Opscode, a company that makes some automation software for operations teams and developers you might have heard of. In my role, I am a technical community manager. Basically, I write cookbooks and help others do so. I'm also a system administrator. While I'm not on call at Opscode, I have been on call for the majority of my career. And in a way, I kind of am on call for everyone's infrastructure that uses Chef and Opscode cookbooks, since I participate in front line community support via mailing lists, IRC, and twitter. You just don't get my phone number. Though, it's on my business card. I'm also a man of many interests - I like video and table top games, brewing my own beer, and I'm a husband and father. My career as a system administrator has caused many interruptions in these areas, of course.
  3. Who are you? (Show of hands) • Sysadmins? • Developers?

    • Business people? (Consultants?) • On call (for production)? http://www.flickr.com/photos/timyates/2854357446/ 3 Friday, April 12, 13 One thing I like to do is get an idea of whom I'm talking to
  4. Let's talk about ... http://www.flickr.com/photos/huffstutterrobertl/7195106982/ 4 Friday, April 12, 13

    Okay, now that the introduction is out of the way, let's talk about...
  5. 5 Friday, April 12, 13 Chef! Amusingly enough, I submitted

    a few talks about Chef for this conference. But the organizers wanted to hear what I have to say about....
  6. 6 Friday, April 12, 13 How software developers should be

    on call. After all, one of the essential parts of DevOps is that Developers should carry pagers, right? That's culture?
  7. No deploys past this point 13 Friday, April 12, 13

    Operations people get into road block mode when all they get is code that brings the site down, and gets them paged at 2am. Whether that is reality, or survivor bias, doesn't matter. This doesn't work either.
  8. ♥ Dev+Ops Culture Slide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr

    14 Friday, April 12, 13 What we need is collaboration, and sharing. Not sharing "on call" but sharing the responsibility for the applications "you" write and "we" run.
  9. About silos, and separation of duties http://www.flickr.com/photos/lostvegas/373800727/ Jez Humble: http://bit.ly/devopsteam

    16 Friday, April 12, 13 Let me tell you about silos, and the separation of duties. I used to work for a large enterprise IT services company.
  10. http://www.flickr.com/photos/cdevers/5777944933/ IT Services 17 Friday, April 12, 13 We provided

    system administration services to other large enterprise corporations in our little slice of hosting. This is for "separation of duties" or for change management/control/itil/cobit or some business reasons supposedly, but what it meant was silozation, a practice unheard of in startups and other small companies, right?
  11. 18 Friday, April 12, 13 The team I was on

    had an on call "hotpager" rotation. Each person had their own primary accounts they worked on, but one person per week was the first contact. We were all system administrators responsible only for the OS. Filesystems, network services, host-based firewall rules, security policy, user management, that kind of thing. Everyone had one of these pagers, because a) it was the early 2000's so cell phones weren't as widespread yet, and cell phones didn't get reception in the data centers (or, weren't allowed, but that's another story).
  12. A typical problem... • Help desk receives "disk full" alert

    • Help desk pages "on call" • On call looks at system • On call determines alert is a full filesystem for customer data • On call pages primary sysadmin • Primary sysadmin looks at system • Primary sysadmin doesn't know what of the giant log files is safe to delete (oldest, right? Maybe!) • Primary sysadmin pages application support (sometimes the customer) 19 Friday, April 12, 13 This is probably the most common example.
  13. Metrics are awesome! https://github.com/obfuscurity/tasseo https://github.com/obfuscurity/descartes 21 Friday, April 12, 13

    Don't get me wrong, metrics are awesome. https://github.com/obfuscurity/tasseo https://github.com/obfuscurity/descartes
  14. But... • Metric based alerts generally suck • Gross generalization...

    • But most of the time, this alerting is non-actionable • It isn't necessarily indicative of the problem • If it is, it's not clear why, or necessarily where to look from the outside. 22 Friday, April 12, 13 Everyone loves metrics and graphs. #monitoringlove is here to stay. But they're not super helpful without context. Sometimes, developers know the context best, when it wouldn't be obvious to anyone else. "Oh, yeah, sometimes the CPU usage goes up. There's a IO deadlock due to a janson rod misalignment."
  15. If it's an application outage? http://www.flickr.com/photos/eschipul/4610999148/ 23 Friday, April 12,

    13 What if the problem is that the application isn't starting up? Or if it's starting, it's not connecting to the database? I wasn't the DBA, I didn't know we had a schema update, or how to recover a bad partition tablespace tuple.
  16. 24 Friday, April 12, 13 I can watch dashboards and

    trends, but I don't necessarily know what is "normal" for an app, or what nuances to look for. Developers do. This is exacerbated by the high turnover rate seen in operations positions. That is, I know a lot of sysadmins who don't stay at a company more than a year or so. That means retraining on new applications all the time.
  17. Companies Studied • Opscode • Heroku • Etsy • SourceFire

    • CustomInk http://www.flickr.com/photos/g_kat26/4060301657/ "You build it, you run it." - Werner Vogels, Amazon 26 Friday, April 12, 13 I talked to operations managers or team leads some companies that have "Developers on-call" policies. Or did at the time :-). Except Amazon. That is a quote from Werner Vogels, Amazon CTO, but it's confirmed from former Amazon employees that work at, or worked at Opscode, like our own CTO Chris Brown, or founding CEO Jesse Robbins.
  18. Overall Results • More developers than sysadmins • Developers have

    application domain knowledge • People like accountability, responsibility! • People don't like stress "We found that when we woke up developers at 2am, defects got fixed faster than ever" - Patrick Lightbody, ceo browsermob 27 Friday, April 12, 13 I think it is safe to say that in most companies doing web operations, there are more developers than sysadmins. In most companies I've worked in, or with, there are more developers than sysadmins across the board. These developers bring valuable application domain knowledge to the table, since generally speaking, they *wrote* the application. Also, as it turns out, at least in the companies I spoke to, people actually like the accountability and responsibility. It's empowering. What I mean by people don't like stress is twofold. First, with more developers, more people share the load of on-call rotations. This helps the operations people not be stressed out, which will improve team morale, and improve the culture.
  19. Overall Results • Nagios + PagerDuty are popular • Operations/sysadmin

    teams are escalation • Greater collaboration • Team building, learning/knowledge transfer 28 Friday, April 12, 13 Unsurprisingly, Nagios is the most popular tool for alerting. Knowing how it works is useful, and it's a great help to operating the application if those who write it also write the checks for it. Pagerduty is likewise popular for actually managing the alerts and escalation. In all the companies I talked to, the operations team gets escalated to for resolving issues/outages that are beyond the application (network issues, firewall configuration, third party services). Collaboration between team members increased because developers worked closer with operations both for making the application easier to manage, but also in the events of an outage. This naturally leads to team building and learning/knowledge transfer.
  20. Show of hands, do you... • Write application code? •

    Write application code that runs in production? • Write application code for clients? (consultants) • Are the first to get paged/called if there's a problem? 31 Friday, April 12, 13
  21. Instrument your application http://www.flickr.com/photos/benbunch/6081948074/ 32 Friday, April 12, 13 Do

    yourself, your operations team, and your clients a favor, and instrument your application.
  22. Instrumentation • Monitoring tools/services • Heaps of blog posts •

    Other talks? • Coda Hale's metrics library? https://github.com/codahale/metrics https://github.com/johnewart/ruby-metrics 33 Friday, April 12, 13 I don't have specific advice on instrumentation. There's a lot of material about this, including talks at other RubyConferences. Coda Hale's metrics library for java has inspired a lot of people to build similar libraries for other languages.
  23. http://www.flickr.com/photos/peteredin/3174490145/ Make your application operable 34 Friday, April 12, 13

    Again, this is for you, your operations team, and your clients. Make the application operable. Instrumentation is a good step, but so are good tools.
  24. 35 Friday, April 12, 13 Opscode manages a production application,

    a Chef Server as a Service, called Opscode Hosted Chef.
  25. Opscode builds tools • Hosted Chef is an SOA •

    Private Chef is built on Hosted Chef, basically • There's a lot of moving parts • Enter `private-chef-ctl` 36 Friday, April 12, 13 We have Chef, and we leverage that for managing the Chef Servers.
  26. private-chef-ctl sudo private-chef-ctl service-list sudo private-chef-ctl tail sudo private-chef-ctl reconfigure

    sudo private-chef-ctl restart nagios 37 Friday, April 12, 13 The private-chef-ctl command makes private chef operable.
  27. Benefits • Hosted Chef is easier for Opscode's developers and

    operations team to manage. • Private Chef is easier for Opscode's customers and our support team to manage • Adapted tools for Open Source were released w/ Chef 11's Erlang port, too 38 Friday, April 12, 13 We've adapted the '-ctl' command to other products, too.
  28. Building tools • Make applications operable • This isn't just

    for operations • It's for you. • Future you. •3AM you. http://www.flickr.com/photos/robotson/236366629/ 39 Friday, April 12, 13 Future you will thank you.
  29. http://www.flickr.com/photos/mararie/2904598732/ If you're a consultant... • Write tools that make

    it easier for your clients to operate the applications you've delivered to them. • This is a HUGE value add. If you have a shareable toolbox you can re-deliver, all the better. • Everyone knows this, right, but how many do it in practice? 40 Friday, April 12, 13 If you're a consultant, you may not be on call. But you can deliver more value for your customers by making it easier for them to manage, operate the app that you've written for them. Partnering with an operations focused consultant/firm can be mutually beneficial, too.
  30. 41 Friday, April 12, 13 Another way you can make

    your applications operable is write the automation code for managing/deploying them. Whether that is Chef or another tool matters less than working with the operations team to automate consistently.
  31. Take aways • Collaboration • Responsibility • Operability http://www.flickr.com/photos/hades2k/5880470447/ 42

    Friday, April 12, 13 DevOps is a professional and cultural movement. It's practices focus on the business benefits of collaboration, culture, sharing. Follow the lead from other companies and have developers be on call for production applications. This will increase their responsibility, and vested interest in building more robust, resilient applications. By having developers participate more actively in the operation of the site through response to outages, they will naturally help build better tools to operate the application.
  32. Thank you! Joshua Timberman @jtimberman [email protected] Jez Humble on "DevOps

    Teams": http://bit.ly/devopsteam Interview with Werner Vogels: http://queue.acm.org/detail.cfm?id=1142065 43 Friday, April 12, 13