Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Our On-Call Journey - From Nothing To ~Full Own...

Our On-Call Journey - From Nothing To ~Full Ownership

Follow me on a whirl wind tour of Capital One Canada's on-call journey, where we started, how we got where we are and where we're going. You'll get to see the struggles and triumphs of bringing a full ownership culture to a large FinTech organization. Presented at PagerDuty Connect Toronto 2018.

Arthur Maltson

May 24, 2018
Tweet

More Decks by Arthur Maltson

Other Decks in Technology

Transcript

  1. From Nothing To Full Ownership Our On-Call Journey Arthur Maltson

    @amaltson To almost full ownership, still a work in progress. While the view looks good from here..
  2. @amaltson But eventually we got everyone aligned, got support from

    the top and we all started marching towards a full ownership goal. But I’m getting ahead of myself, let’s start from the beginning.
  3. @amaltson The Canada Software Studio is actually very young. We

    started in February 2015 to satisfy the technology hunger of the Canada business.
  4. February 2015 @amaltson The Canada Software Studio is actually very

    young. We started in February 2015 to satisfy the technology hunger of the Canada business.
  5. @amaltson Being new, there really wasn’t anything we were managing

    in production. It was pretty barren. But we knew out in the distance we were going to be managing some pretty big products in production. But in the meantime…
  6. @amaltson We duct taped together some web application and API.

    It seemed to work, so we duct taped some deployment scripts and shipped it…
  7. @amaltson But these applications didn’t really have a lot of

    usage, so when they were down it really wasn’t that big a deal and we ignore it until the business complained. Probably wasn’t a great idea though….
  8. @amaltson But as time passed and those much large applications

    with more users was looming, we recognized it was going to be a tough climb and we needed to mature and suit up…
  9. @amaltson Suit up with the proper gear. We started to

    make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.
  10. APM Monitoring @amaltson Suit up with the proper gear. We

    started to make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.
  11. APM Monitoring Log Aggregation @amaltson Suit up with the proper

    gear. We started to make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.
  12. APM Monitoring Log Aggregation @amaltson Suit up with the proper

    gear. We started to make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.
  13. APM Monitoring Log Aggregation Incident Response/On-call @amaltson Suit up with

    the proper gear. We started to make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.
  14. But when we looked around, we found the existing tool.

    While it worked, it was old and clunky. We didn’t want to touch it with a 10 foot pole. Fortunately, we work in a large and dynamic tech environment where different teams are trying different things. And sometimes those teams will…
  15. @amaltson … lend a helping hand. In this case, another

    team was presenting on their use of a (competing) on-call management system. We quickly reached out to them and found that they were fairly far along. They lent us a helping hand and we were able to setup the tool for the Canada Software Studio to use.
  16. APM Monitoring Log Aggregation @amaltson Incident Response/On-call But after we

    did the integrations with our tools, since it wasn’t an Enterprise blessed tool, not all integrations worked. We fought the good fight but we weren’t able to get all our tools working, specifically log aggregation which the developers depended on.
  17. APM Monitoring @amaltson Incident Response/On-call But after we did the

    integrations with our tools, since it wasn’t an Enterprise blessed tool, not all integrations worked. We fought the good fight but we weren’t able to get all our tools working, specifically log aggregation which the developers depended on.
  18. @amaltson But it was better than nothing, we at least

    had some tool in hand. At the same time as DevOps team we were also maturing.
  19. @amaltson We had built a mature and usable CI/ CD

    tool that was no longer duct tape and bubble gum. This tool got dev teams part of the way up the mountain. It wasn’t a heated high speed gondola, but it got their applications to the Cloud.
  20. @amaltson With this confluence of tools in place, we were

    able to start presenting and advocating this idea of full ownership and maturing our production support practice. We introduced the concept of Blameless Postmortems and got feature teams involved.
  21. @amaltson But the initial reception was definitely skeptical. I mean,

    as a traditional developer you’ve never really dealt with production support. Wasn’t this Ops’ problem anyway? But our team had a secret weapon…
  22. @amaltson We were really small, and we were supporting dozens

    of engineers in the Studio. How could we possibly own and operate all the applications that are getting produced? Fortunately, we also had leadership buy in, and selling the idea was a bit easier because…
  23. @amaltson We would still be there, on-call with all the

    other teams to help them when they ran into infrastructure problems. The feature teams had to own the application itself, but our team would help them get comfortable with on- call.
  24. @amaltson Being on-call for the whole Studio takes its toll

    too. We were paged for _every_ team’s incident. And without formal on-call, we were often the only ones picking up the call. To avoid burnout, we need to rotate regularly.
  25. So we introduced and worked on-call into our Foreground/Background team

    configuration. Part of the team was on Foreground to answer ad-hoc requests but also be on-call. This left the rest of the team on Background to continue improving their tools and automation.
  26. @amaltson But we still had to work through the skepticism

    on feature teams about being on-call.
  27. @amaltson And who could blame them, it was painful to

    be on-call when not all your tools are supported. We were still missing the log aggregation setup.
  28. @amaltson Fortunately, late last year the company bought into PagerDuty

    as an Enterprise wide solution. With that also came a large initiative to move feature teams to full owners, i.e. “you build it, you own it”. This meant we had to migrate, but it came with Enterprise backing.
  29. @amaltson As we started to migrate though, we realized we

    got pretty used to the other Incident Management system we were using. PagerDuty looked similar, but it was subtly different.
  30. @amaltson And the various components of PagerDuty were pretty confusing

    until we saw the flow of data. Monitoring system connects to a PagerDuty Service, which unfortunately has a 1:1 mapping with an Escalation Policy. Which then finds a schedule and finds the person currently on-call. I say unfortunately because we wanted to keep our team on-call to help the feature teams.
  31. Fortunately, after working with PagerDuty support, we started using a

    new feature in PagerDuty called Response Plays. Response Plays can have escalation policies…
  32. And within a PagerDuty Service, you can automatically trigger a

    Response Play which can help us achieve the feature parity we were looking for. However, after complaining about some of the down sides of Response Plays (pages every contact info at once) with Mark Matta at PagerDuty, he asked the excellent question of “why are you on-call for everyone?”. Good question.
  33. APM Monitoring @amaltson Incident Response/On-call Being an Enterprise wide solution,

    we got integration with ALL our monitoring tool for free. Now we had log aggregation integrated, so for feature teams they got the tool they were comfortable. In addition, the chat integration was even better than before, which really helped uptake.
  34. APM Monitoring Log Aggregation @amaltson Incident Response/On-call Being an Enterprise

    wide solution, we got integration with ALL our monitoring tool for free. Now we had log aggregation integrated, so for feature teams they got the tool they were comfortable. In addition, the chat integration was even better than before, which really helped uptake.
  35. APM Monitoring Log Aggregation @amaltson Incident Response/On-call Chat Platform Being

    an Enterprise wide solution, we got integration with ALL our monitoring tool for free. Now we had log aggregation integrated, so for feature teams they got the tool they were comfortable. In addition, the chat integration was even better than before, which really helped uptake.
  36. @amaltson And the ability to use the API to easily

    create incidents meant that we could integrate PagerDuty into shell based batch job systems, so older technologies that can’t easily be integrated in existing monitoring tools could still benefit from PagerDuty.
  37. @amaltson With marching orders from the top to take on

    full ownership, we’re definitely moving in the right direction. This sets us up for eventually let go.
  38. @amaltson But before we let go we need to setup

    teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.
  39. Better Troubleshooting Tools @amaltson But before we let go we

    need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.
  40. Better Troubleshooting Tools Incident Response Checklist @amaltson But before we

    let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.
  41. Better Troubleshooting Tools Incident Response Checklist Define Roles @amaltson But

    before we let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.
  42. Better Troubleshooting Tools Incident Response Checklist Define Roles Game Days

    @amaltson But before we let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.
  43. Better Troubleshooting Tools Incident Response Checklist Define Roles Game Days

    PagerDuty Automation @amaltson But before we let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.
  44. @amaltson With all those steps in place, we might be

    able to ascend together to the peak. But it’s important to keep in mind that we’re continuously improving, there’s always going to be another peak to ascend.
  45. Lessons Learnt Proactively address production issues, don’t ignore them Use

    the tools you have, especially (good) Enterprise standards @amaltson
  46. Lessons Learnt Proactively address production issues, don’t ignore them Use

    the tools you have, especially (good) Enterprise standards Get leadership buy-in @amaltson
  47. Lessons Learnt Proactively address production issues, don’t ignore them Use

    the tools you have, especially (good) Enterprise standards Get leadership buy-in Advocate and level up others @amaltson
  48. Lessons Learnt Proactively address production issues, don’t ignore them Use

    the tools you have, especially (good) Enterprise standards Get leadership buy-in Advocate and level up others Fill in the gaps as you transition @amaltson
  49. Lessons Learnt Proactively address production issues, don’t ignore them Use

    the tools you have, especially (good) Enterprise standards Get leadership buy-in Advocate and level up others Fill in the gaps as you transition Let go @amaltson
  50. What Rethinking Code Reuse in Software Design - Leveraging Microservices

    for Experimentation When Wednesday, May 30th, 2018 6:00 pm @amaltson
  51. What Rethinking Code Reuse in Software Design - Leveraging Microservices

    for Experimentation When Wednesday, May 30th, 2018 6:00 pm Where 2nd Floor Events 461 King St W Toronto, ON M5V 1K4 @amaltson
  52. What Rethinking Code Reuse in Software Design - Leveraging Microservices

    for Experimentation When Wednesday, May 30th, 2018 6:00 pm Where 2nd Floor Events 461 King St W Toronto, ON M5V 1K4 How $5 to Holland Bloorview @amaltson
  53. What Rethinking Code Reuse in Software Design - Leveraging Microservices

    for Experimentation When Wednesday, May 30th, 2018 6:00 pm Where 2nd Floor Events 461 King St W Toronto, ON M5V 1K4 How $5 to Holland Bloorview bit.ly/c1-tech-series @amaltson
  54. ARTHUR MALTSON Slides: https://speakerdeck.com/amaltson Stats: 70% Dev / 30% Ops,

    110% DadOps Work: Capital One Canada Loves: Automation, Ruby, Ansible, Terraform Hates: Manual processes @amaltson maltson.com
  55. CREDITS ▸ Slide 1, 2, 41 - twiga269 ॐ FEMEN,

    Let's go down, facing Condoriri, https://flic.kr/p/6ESGGq ▸ Slide 3, 37 - Icelandic Air Policing, USAFE AFAFRICA, https://flic.kr/p/nFNKSg ▸ Slide 4, 17, 38 - Mark Somerville, Ric Waterton abseiling off Freak Out, Glen Coe, https://flic.kr/p/b3KaC ▸ Slide 5, 39 - 180124-M-DV652-0008, Alaskan Command, https://flic.kr/p/H25ZXQ ▸ Slide 6 - rithban, Baby Swallows, https://flic.kr/p/8gvnmE ▸ Slide 7 - craig Cloutier, American Salt, https://flic.kr/p/8C2K6B ▸ Slide 8 - DevIQ, Duct Tape Coder, http://deviq.com/duct-tape-coder/ ▸ Slide 9 - Know Your Meme, Old Man Yells At Cloud, http://knowyourmeme.com/photos/1044247-old-man-yells-at-cloud ▸ Slide 10 - Vox, 2016 update from “this is fine” dog: things are not, in fact, fine, https://www.vox.com/2016/8/3/12368874/this-is-fine-dog-meme-update ▸ Slide 11 - Frans de Wit, Tour de Suisse par l’Extérieur, https://flic.kr/p/oz6ZQq ▸ Slide 12 - Chris Hunter, Pinterest, https://www.pinterest.com/pin/530017449888367258/ ▸ Slide 13, 16, 33 - john skewes, KXJS2388 - Aitkenvale 2011 Climbing gear, https://flic.kr/p/jxhbuj ▸ Slide 14 - stanze, rusty, https://flic.kr/p/HegXWA ▸ Slide 15 - We Love Cats and Kittens, Black Cats Are Awesome – 31 October 2016, https://welovecatsandkittens.com/cat-pictures/black-cats-are-awesome-31-october-2016/ ▸ Slide 18 - Nic Redhead, ski lift to heaven, https://flic.kr/p/9gt36M ▸ Slide 20, 25, 29 - imgflip, Futurama Fry Meme Generator, https://imgflip.com/memegenerator/Futurama-Fry ▸ Slide 26 - whizchickenonabun, hurt, https://flic.kr/p/iEYok ▸ Slide 30 - PagerDuty, PagerDuty Concepts Visualized, https://community.pagerduty.com/t/pagerduty-concepts-visualized/215 ▸ Slide 35 - David Fisher, Old servers, https://flic.kr/p/9FacEL ▸ Slide 36 - Moyan Brenn, Happiness, https://flic.kr/p/nMmBGs ▸ Slide 40 - cabodevassoura, Guard Rail, https://flic.kr/p/q8eHhd ▸ Slide 46 - Nate Grigg, Thank You, https://flic.kr/p/6K41qv ▸ Others: DepositPhotos @amaltson