Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

Martin Cronjé

June 10, 2020
Tweet

More Decks by Martin Cronjé

Other Decks in Technology

Transcript

  1. A DevOps story How we turned our most significant outage

    into lasting positive change! @martincronje
  2. 2000’s First online product launched Second major iteration launched Starts

    building innovative user experience 2010’s Moved to the cloud | Disrupts market and expands globally
  3. System Context Z S New API Orignal UI UI UI

    UI Service Service Service Core Product (Monolith) 🥰
  4. Environment Context Z S New API Orignal UI UI UI

    UI Service Service Service Core Product (Monolith) •New Product • Cloud-native • Continuous deployment • Some auto-infrastructure provisioning •Heritage Product • Virtual servers • Manual deployments • Manual infrastructure provisioning
  5. Environment Context Z S New API Orignal UI UI UI

    UI Service Service Service Core Product (Monolith) • Agile delivery • Low test-automation • Incidents highly disruptive • Reliance on long-tenured engineers • Traditional ops team (HIGH TOIL)
  6. 2000’s First online product launched Second major iteration launched Starts

    building innovative user experience 2010’s Moved to the cloud | Disrupts market and expands globally 💥 Super outage
  7. The super outage > Regular deployment with hundreds unique changes

    > Unusable for numerous weeks > Busy time of the year > Most customers affected > System performance continued to degrade
  8. Opportunity Strikes! > Reduce the size of changes > Make

    rollback effortless > Reduce configuration change risk > Make it easy to setup a new environment > Make it easy to analyse and troubleshoot problems > Give everyone the skills to fix problems > Handle production outages cool, calm and controlled
  9. Technology Improvement Plan > Incident Response > Deep Ownership >

    Production Insights > Deploy Monolith Safely > Decouple Monolith Z S New API Orignal UI UI UI UI Service Service Service Core Product (Monolith)
  10. Pitching our plan…. > Economic argument on supporting scale >

    Proposed a stop to delivery to focus on resilience > Ended up securing about 1/5 of engineering capacity https://www.wired.com/2013/04/linkedin-software-revolution/
  11. Technology Improvement Plan > Incident Response > Deep Ownership >

    Production Insights > Deploy Monolith Safely > Decouple Monolith Z S New API Orignal UI UI UI UI Service Service Service Core Product (Monolith)
  12. Planning approach • Outcomes focus charting (Auftragsklärung) • 1-pager plan

    outlining - Context - Problem - Objective - Outputs - Success measures • Incremental approach where possible https://auftragsklaerung.com/
  13. Incident Response > Primary Objective: Shorten repair and recovery times

    > Secondary Objective: Make sure they don’t happen again
  14. Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1 New incident manager (ITIL) Incident

    response process and prioritisation matrix Engineers on-call PagerDuty 🤯 🤑 Marked reduction in incidents 💥 Incident Command Weekly Ops Review 💥 Disestablish tech-support 🎯 Incident recovery up to 95% SLA Incident recovery at <80% SLA Engineering doing post-mortems for Sev2’s 💥 Engineering doing post-mortems for Sev0 and Sev1 incidents 💡 Make sure incidents don’t happen again 💡 Time to acknowledge incidents 💡 Awareness of sensitive customers
  15. Production Insights > Primary Objective: Standard measurement of health >

    Secondary Objective: Don’t call us, we’ll call you
  16. Implementing SLO’s > Top down approach > Key interactions workflow

    > Failed requests included slow and errors > Differentiated between desired, current and failing > Teams implemented SLI dashboards to back these up
  17. Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1 🤯 Manual SLO recording in

    Excel! 🎯 First basic dashboards internally (incl. support) 💥 Dashboard released to some customers Prioritised improvement SLOs to 75% compliance 🤯 Engineering site-visits 🎯 Customers stopped complaining Dedicated work-streams to dramatically improve performance
  18. Deployment Safety > Primary Objective: Reduce the impact of bad

    deployments > Secondary Objective: Reduce undifferentiated heavy-lifting
  19. Canary deployments on monolith C1 v2 C3 v2 C2 v2

    C4 v2 C1 v1 C3 v1 C2 v1 C4 v1 Active Passive
  20. Canary deployments C1 v2 C3 v2 C2 v2 C1 v1

    C3 v1 C2 v1 C4 v1 Active Passive Easy rollback C4 v2
  21. Canary deployments C1 v2 C3 v2 C2 v2 C4 v2

    C1 v3 C3 v1 C2 v1 C4 v1 Active Passive Deploy v3
  22. Canary deployments C3 v2 C2 v2 C4 v2 C1 v3

    C3 v1 C2 v1 C4 v1 Active Passive C1 v2 Make live
  23. Canary deployments C3 v2 C2 v2 C4 v2 C1 v3

    C3 v1 C2 v3 C4 v1 Active Passive C1 v2 Deploy v3
  24. Canary deployments C3 v2 C2 v2 C4 v2 C1 v3

    C3 v1 C2 v3 C4 v1 Active Passive C1 v2 Make live
  25. Other things • Moved SQL Server to management instances •

    Moved all config management into source-control • Eliminate toil where possible
  26. 💡 Not only did we reduce impact of issues but

    we also reduced snowflakes and toil.
  27. Lessons > Spend A LOT of time with stakeholders >

    Need technical executive support > Back out of failed experiments early > Getting this right is hard
  28. Achievements • Single definition of good using Service Level Objectives

    • We started detecting issues before out customers! • Incidents resolved within SLA (improved from 80% to 95%+) • Mostly automated deployment and provisioning on our monolith • Reducing the impact of bad deployments • Rollback within minutes instead of hours (or even days) • Enabled modernisation roadmap • Happy customers!!