A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

A DevOps story How we turned our most significant outage
into lasting positive change! @martincronje

💡 Cohesive collaborative teams with end-to-end ownership of their services
(build it, run it)

2000’s First online product launched Second major iteration launched

Writing code during the early 2000’s

❤ The heart of the product was built in the
2000’s

💡 All companies have heritage systems that enabled them to
become successful businesses!

Question: What is the heritage in your environment?

2000’s First online product launched Second major iteration launched Starts
building innovative user experience 2010’s Moved to the cloud | Disrupts market and expands globally

System Context S Orignal UI Core Product (Monolith) 🥰

System Context Z S New API Orignal UI UI UI
UI Service Service Service Core Product (Monolith) 🥰

Environment Context Z S New API Orignal UI UI UI
UI Service Service Service Core Product (Monolith) •New Product • Cloud-native • Continuous deployment • Some auto-infrastructure provisioning •Heritage Product • Virtual servers • Manual deployments • Manual infrastructure provisioning

Environment Context Z S New API Orignal UI UI UI
UI Service Service Service Core Product (Monolith) • Agile delivery • Low test-automation • Incidents highly disruptive • Reliance on long-tenured engineers • Traditional ops team (HIGH TOIL)

40+ major incidents per month

The perfect storm was brewing!

Question: Do you have a perfect storms brewing?

2000’s First online product launched Second major iteration launched Starts
building innovative user experience 2010’s Moved to the cloud | Disrupts market and expands globally 💥 Super outage

The super outage > Regular deployment with hundreds unique changes
> Unusable for numerous weeks > Busy time of the year > Most customers affected > System performance continued to degrade

Unable to isolate the problem

Unable to restore a previous version

Unable to failover

Unable to troubleshoot

Unable to create a new environment

🤕 😴 🤬

Opportunity Strikes! > Reduce the size of changes > Make
rollback effortless > Reduce configuration change risk > Make it easy to setup a new environment > Make it easy to analyse and troubleshoot problems > Give everyone the skills to fix problems > Handle production outages cool, calm and controlled

With 40+ incidents/month a catastrophic outage was only a matter
of time

Question: What would you do address this risk?

😱 We needed to remove fear

Technology Improvement Plan > Incident Response > Deep Ownership >
Production Insights > Deploy Monolith Safely > Decouple Monolith Z S New API Orignal UI UI UI UI Service Service Service Core Product (Monolith)

Pitching our plan…. > Economic argument on supporting scale >
Proposed a stop to delivery to focus on resilience > Ended up securing about 1/5 of engineering capacity https://www.wired.com/2013/04/linkedin-software-revolution/

💡 Initial objective was to reduce the impact of failure
on customers

Technology Improvement Plan > Incident Response > Deep Ownership >
Production Insights > Deploy Monolith Safely > Decouple Monolith Z S New API Orignal UI UI UI UI Service Service Service Core Product (Monolith)

Planning approach • Outcomes focus charting (Auftragsklärung) • 1-pager plan
outlining - Context - Problem - Objective - Outputs - Success measures • Incremental approach where possible https://auftragsklaerung.com/

Enabling us to break-up the monolith

Incident Response > Primary Objective: Shorten repair and recovery times
> Secondary Objective: Make sure they don’t happen again

Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1 New incident manager (ITIL) Incident
response process and prioritisation matrix Engineers on-call PagerDuty 🤯 🤑 Marked reduction in incidents 💥 Incident Command Weekly Ops Review 💥 Disestablish tech-support 🎯 Incident recovery up to 95% SLA Incident recovery at <80% SLA Engineering doing post-mortems for Sev2’s 💥 Engineering doing post-mortems for Sev0 and Sev1 incidents 💡 Make sure incidents don’t happen again 💡 Time to acknowledge incidents 💡 Awareness of sensitive customers

💡 Started with old-school incident management, because it was the
right thing to do.

Production Insights > Primary Objective: Standard measurement of health >
Secondary Objective: Don’t call us, we’ll call you

Service Level Objectives Availability = Successful requests Total requests

Implementing SLO’s > Top down approach > Key interactions workflow
> Failed requests included slow and errors > Differentiated between desired, current and failing > Teams implemented SLI dashboards to back these up

Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1 🤯 Manual SLO recording in
Excel! 🎯 First basic dashboards internally (incl. support) 💥 Dashboard released to some customers Prioritised improvement SLOs to 75% compliance 🤯 Engineering site-visits 🎯 Customers stopped complaining Dedicated work-streams to dramatically improve performance

💡 We created a standard definition of what good looks
like for the customer

Deployment Safety > Primary Objective: Reduce the impact of bad
deployments > Secondary Objective: Reduce undifferentiated heavy-lifting

Canary deployments on monolith C1 v2 C3 v2 C2 v2
C4 v2 C1 v1 C3 v1 C2 v1 C4 v1 Active Passive

Canary deployments C1 v2 C3 v2 C2 v2 C1 v1
C3 v1 C2 v1 C4 v1 Active Passive Easy rollback C4 v2

C1 v3 C3 v1 C2 v1 C4 v1 Active Passive Deploy v3

C3 v1 C2 v1 C4 v1 Active Passive C1 v2 Make live

C3 v1 C2 v3 C4 v1 Active Passive C1 v2 Deploy v3

C3 v1 C2 v3 C4 v1 Active Passive C1 v2 Make live

Database Blue / Green Schema v2 Schema v1

Other things • Moved SQL Server to management instances •
Moved all config management into source-control • Eliminate toil where possible

💡 Not only did we reduce impact of issues but
we also reduced snowflakes and toil.

This is was only the start…

Lessons > Spend A LOT of time with stakeholders >
Need technical executive support > Back out of failed experiments early > Getting this right is hard

Achievements • Single definition of good using Service Level Objectives
• We started detecting issues before out customers! • Incidents resolved within SLA (improved from 80% to 95%+) • Mostly automated deployment and provisioning on our monolith • Reducing the impact of bad deployments • Rollback within minutes instead of hours (or even days) • Enabled modernisation roadmap • Happy customers!!

Question: What is your key takeaway?

💡 There’s no perfect environment. Find the right DevOps topology
for you.

💡 Cohesive collaborative teams with end-to-end ownership of their services
(build it, run it)

💡 All companies have heritage systems that enabled them to
become successful businesses!

💡 Find incremental gains by looking at the biggest fires
in your environment

💡 Find your initial objective! We reduced failure impact

💡 Started with old-school processes because it was the right
thing to do. …and then we matured

@martincronje

A DevOps story - Turning our most significant p...

A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

More Decks by Martin Cronjé

Other Decks in Technology

Featured

Transcript