Our On-Call Journey - From Nothing To ~Full Ownership

From Nothing To Full Ownership Our On-Call Journey Arthur Maltson
@amaltson

From Nothing To Full Ownership Our On-Call Journey Arthur Maltson
@amaltson To almost full ownership, still a work in progress. While the view looks good from here..

@amaltson The journey up had some pretty steep and slippery
slopes.

@amaltson And on occasion we were barely holding on.

@amaltson But eventually we got everyone aligned, got support from
the top and we all started marching towards a full ownership goal. But I’m getting ahead of myself, let’s start from the beginning.

@amaltson The Canada Software Studio is actually very young. We
started in February 2015 to satisfy the technology hunger of the Canada business.

February 2015 @amaltson The Canada Software Studio is actually very
young. We started in February 2015 to satisfy the technology hunger of the Canada business.

@amaltson Being new, there really wasn’t anything we were managing
in production. It was pretty barren. But we knew out in the distance we were going to be managing some pretty big products in production. But in the meantime…

@amaltson We duct taped together some web application and API.
It seemed to work, so we duct taped some deployment scripts and shipped it…

@amaltson …to the Cloud. We use AWS extensively, so at
least there’s that.

@amaltson But these applications didn’t really have a lot of
usage, so when they were down it really wasn’t that big a deal and we ignore it until the business complained. Probably wasn’t a great idea though….

@amaltson But as time passed and those much large applications
with more users was looming, we recognized it was going to be a tough climb and we needed to mature and suit up…

@amaltson Err… not that kind of suit.

@amaltson Suit up with the proper gear. We started to
make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.

APM Monitoring @amaltson Suit up with the proper gear. We
started to make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.

APM Monitoring Log Aggregation @amaltson Suit up with the proper
gear. We started to make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.

APM Monitoring Log Aggregation Incident Response/On-call @amaltson Suit up with
the proper gear. We started to make better use of our APM solution to do monitoring. Started to fully leverage the log aggregation system, which developers and business loved using since they can log any business metric and graph/ alert on it. But ultimately we needed that safety hook, something that was going to wake us up when an incident actually happened. We needed incident response and on-call management.

But when we looked around, we found the existing tool.
While it worked, it was old and clunky. We didn’t want to touch it with a 10 foot pole. Fortunately, we work in a large and dynamic tech environment where different teams are trying different things. And sometimes those teams will…

@amaltson … lend a helping hand. In this case, another
team was presenting on their use of a (competing) on-call management system. We quickly reached out to them and found that they were fairly far along. They lent us a helping hand and we were able to setup the tool for the Canada Software Studio to use.

APM Monitoring Log Aggregation @amaltson Incident Response/On-call But after we
did the integrations with our tools, since it wasn’t an Enterprise blessed tool, not all integrations worked. We fought the good ﬁght but we weren’t able to get all our tools working, speciﬁcally log aggregation which the developers depended on.

APM Monitoring @amaltson Incident Response/On-call But after we did the
integrations with our tools, since it wasn’t an Enterprise blessed tool, not all integrations worked. We fought the good ﬁght but we weren’t able to get all our tools working, speciﬁcally log aggregation which the developers depended on.

@amaltson But it was better than nothing, we at least
had some tool in hand. At the same time as DevOps team we were also maturing.

@amaltson We had built a mature and usable CI/ CD
tool that was no longer duct tape and bubble gum. This tool got dev teams part of the way up the mountain. It wasn’t a heated high speed gondola, but it got their applications to the Cloud.

@amaltson With this conﬂuence of tools in place, we were
able to start presenting and advocating this idea of full ownership and maturing our production support practice. We introduced the concept of Blameless Postmortems and got feature teams involved.

@amaltson But the initial reception was deﬁnitely skeptical. I mean,
as a traditional developer you’ve never really dealt with production support. Wasn’t this Ops’ problem anyway? But our team had a secret weapon…

@amaltson We were really small, and we were supporting dozens
of engineers in the Studio. How could we possibly own and operate all the applications that are getting produced? Fortunately, we also had leadership buy in, and selling the idea was a bit easier because…

@amaltson We would still be there, on-call with all the
other teams to help them when they ran into infrastructure problems. The feature teams had to own the application itself, but our team would help them get comfortable with on- call.

@amaltson Being on-call for the whole Studio takes its toll
too. We were paged for _every_ team’s incident. And without formal on-call, we were often the only ones picking up the call. To avoid burnout, we need to rotate regularly.

So we introduced and worked on-call into our Foreground/Background team
conﬁguration. Part of the team was on Foreground to answer ad-hoc requests but also be on-call. This left the rest of the team on Background to continue improving their tools and automation.

@amaltson But we still had to work through the skepticism
on feature teams about being on-call.

@amaltson And who could blame them, it was painful to
be on-call when not all your tools are supported. We were still missing the log aggregation setup.

@amaltson Fortunately, late last year the company bought into PagerDuty
as an Enterprise wide solution. With that also came a large initiative to move feature teams to full owners, i.e. “you build it, you own it”. This meant we had to migrate, but it came with Enterprise backing.

Sufﬁce it to say, we were pretty excited!

@amaltson As we started to migrate though, we realized we
got pretty used to the other Incident Management system we were using. PagerDuty looked similar, but it was subtly different.

@amaltson And the various components of PagerDuty were pretty confusing
until we saw the flow of data. Monitoring system connects to a PagerDuty Service, which unfortunately has a 1:1 mapping with an Escalation Policy. Which then finds a schedule and finds the person currently on-call. I say unfortunately because we wanted to keep our team on-call to help the feature teams.

Fortunately, after working with PagerDuty support, we started using a
new feature in PagerDuty called Response Plays. Response Plays can have escalation policies…

And within a PagerDuty Service, you can automatically trigger a
Response Play which can help us achieve the feature parity we were looking for. However, after complaining about some of the down sides of Response Plays (pages every contact info at once) with Mark Matta at PagerDuty, he asked the excellent question of “why are you on-call for everyone?”. Good question.

APM Monitoring @amaltson Incident Response/On-call Being an Enterprise wide solution,
we got integration with ALL our monitoring tool for free. Now we had log aggregation integrated, so for feature teams they got the tool they were comfortable. In addition, the chat integration was even better than before, which really helped uptake.

APM Monitoring Log Aggregation @amaltson Incident Response/On-call Being an Enterprise
wide solution, we got integration with ALL our monitoring tool for free. Now we had log aggregation integrated, so for feature teams they got the tool they were comfortable. In addition, the chat integration was even better than before, which really helped uptake.

APM Monitoring Log Aggregation @amaltson Incident Response/On-call Chat Platform Being
an Enterprise wide solution, we got integration with ALL our monitoring tool for free. Now we had log aggregation integrated, so for feature teams they got the tool they were comfortable. In addition, the chat integration was even better than before, which really helped uptake.

The ability to acknowledge, resolve, add notes, etc right from
the chat client, we love it.

@amaltson And the ability to use the API to easily
create incidents meant that we could integrate PagerDuty into shell based batch job systems, so older technologies that can’t easily be integrated in existing monitoring tools could still beneﬁt from PagerDuty.

@amaltson So is this Operator and Developer bliss? Well, not
exactly.

@amaltson We’re still very much trekking up the mountain.

@amaltson But at least we aren’t barely hanging on like
before.

@amaltson With marching orders from the top to take on
full ownership, we’re deﬁnitely moving in the right direction. This sets us up for eventually let go.

@amaltson But before we let go we need to setup
teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.

Better Troubleshooting Tools @amaltson But before we let go we
need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.

Better Troubleshooting Tools Incident Response Checklist @amaltson But before we
let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.

Better Troubleshooting Tools Incident Response Checklist Define Roles @amaltson But
before we let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.

Better Troubleshooting Tools Incident Response Checklist Define Roles Game Days
@amaltson But before we let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.

Better Troubleshooting Tools Incident Response Checklist Define Roles Game Days
PagerDuty Automation @amaltson But before we let go we need to setup teams for success. We need to improve the troubleshooting tools, make it easier for feature teams to understand where the problem is. We need an incident response checklist, because during stressful events, humans tend to forget some steps during a response. You want to make sure the roles are defined, i.e what’s an incident command and what does he/she do, etc. Ultimately, the best way to prepare for events is to practice. This is where Game Days provide a structured, regular exercise of incident response and something we we strive to do. Importance of practice was best illustrated when I was at a fire station with my daughters for a birthday, and during the tour they had a call. The firefighters were out in ~2 minutes, clearly well practiced. There’s a lot we can learn from emergency response. After doing some initial configuration, we gave feature teams full access to manage their PagerDuty setup but have found that consistency breaks down. This is where you need automation to keep consistency.

@amaltson With all those steps in place, we might be
able to ascend together to the peak. But it’s important to keep in mind that we’re continuously improving, there’s always going to be another peak to ascend.

Lessons Learnt @amaltson

Lessons Learnt Proactively address production issues, don’t ignore them @amaltson

Lessons Learnt Proactively address production issues, don’t ignore them Use
the tools you have, especially (good) Enterprise standards @amaltson

the tools you have, especially (good) Enterprise standards Get leadership buy-in @amaltson

the tools you have, especially (good) Enterprise standards Get leadership buy-in Advocate and level up others @amaltson

the tools you have, especially (good) Enterprise standards Get leadership buy-in Advocate and level up others Fill in the gaps as you transition @amaltson

the tools you have, especially (good) Enterprise standards Get leadership buy-in Advocate and level up others Fill in the gaps as you transition Let go @amaltson

@amaltson

WE’RE HIRING @amaltson

@amaltson

What Rethinking Code Reuse in Software Design - Leveraging Microservices
for Experimentation @amaltson

for Experimentation When Wednesday, May 30th, 2018 6:00 pm @amaltson

for Experimentation When Wednesday, May 30th, 2018 6:00 pm Where 2nd Floor Events 461 King St W Toronto, ON M5V 1K4 @amaltson

for Experimentation When Wednesday, May 30th, 2018 6:00 pm Where 2nd Floor Events 461 King St W Toronto, ON M5V 1K4 How $5 to Holland Bloorview @amaltson

for Experimentation When Wednesday, May 30th, 2018 6:00 pm Where 2nd Floor Events 461 King St W Toronto, ON M5V 1K4 How $5 to Holland Bloorview bit.ly/c1-tech-series @amaltson

ARTHUR MALTSON Slides: https://speakerdeck.com/amaltson Stats: 70% Dev / 30% Ops,
110% DadOps Work: Capital One Canada Loves: Automation, Ruby, Ansible, Terraform Hates: Manual processes @amaltson maltson.com

@amaltson

CREDITS ▸ Slide 1, 2, 41 - twiga269 ॐ FEMEN,
Let's go down, facing Condoriri, https://flic.kr/p/6ESGGq ▸ Slide 3, 37 - Icelandic Air Policing, USAFE AFAFRICA, https://flic.kr/p/nFNKSg ▸ Slide 4, 17, 38 - Mark Somerville, Ric Waterton abseiling off Freak Out, Glen Coe, https://flic.kr/p/b3KaC ▸ Slide 5, 39 - 180124-M-DV652-0008, Alaskan Command, https://flic.kr/p/H25ZXQ ▸ Slide 6 - rithban, Baby Swallows, https://flic.kr/p/8gvnmE ▸ Slide 7 - craig Cloutier, American Salt, https://flic.kr/p/8C2K6B ▸ Slide 8 - DevIQ, Duct Tape Coder, http://deviq.com/duct-tape-coder/ ▸ Slide 9 - Know Your Meme, Old Man Yells At Cloud, http://knowyourmeme.com/photos/1044247-old-man-yells-at-cloud ▸ Slide 10 - Vox, 2016 update from “this is fine” dog: things are not, in fact, fine, https://www.vox.com/2016/8/3/12368874/this-is-fine-dog-meme-update ▸ Slide 11 - Frans de Wit, Tour de Suisse par l’Extérieur, https://flic.kr/p/oz6ZQq ▸ Slide 12 - Chris Hunter, Pinterest, https://www.pinterest.com/pin/530017449888367258/ ▸ Slide 13, 16, 33 - john skewes, KXJS2388 - Aitkenvale 2011 Climbing gear, https://flic.kr/p/jxhbuj ▸ Slide 14 - stanze, rusty, https://flic.kr/p/HegXWA ▸ Slide 15 - We Love Cats and Kittens, Black Cats Are Awesome – 31 October 2016, https://welovecatsandkittens.com/cat-pictures/black-cats-are-awesome-31-october-2016/ ▸ Slide 18 - Nic Redhead, ski lift to heaven, https://flic.kr/p/9gt36M ▸ Slide 20, 25, 29 - imgflip, Futurama Fry Meme Generator, https://imgflip.com/memegenerator/Futurama-Fry ▸ Slide 26 - whizchickenonabun, hurt, https://flic.kr/p/iEYok ▸ Slide 30 - PagerDuty, PagerDuty Concepts Visualized, https://community.pagerduty.com/t/pagerduty-concepts-visualized/215 ▸ Slide 35 - David Fisher, Old servers, https://flic.kr/p/9FacEL ▸ Slide 36 - Moyan Brenn, Happiness, https://flic.kr/p/nMmBGs ▸ Slide 40 - cabodevassoura, Guard Rail, https://flic.kr/p/q8eHhd ▸ Slide 46 - Nate Grigg, Thank You, https://flic.kr/p/6K41qv ▸ Others: DepositPhotos @amaltson

Our On-Call Journey - From Nothing To ~Full Own...

Our On-Call Journey - From Nothing To ~Full Ownership

More Decks by Arthur Maltson

Other Decks in Technology

Featured

Transcript