Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

Martin Cronjé

June 10, 2020
Tweet

More Decks by Martin Cronjé

Other Decks in Technology

Transcript

  1. A DevOps story


    How we turned our most significant outage into
    lasting positive change!
    @martincronje

    View full-size slide

  2. 💡


    Cohesive collaborative teams
    with end-to-end ownership of
    their services (build it, run it)

    View full-size slide

  3. 2000’s


    First online product


    launched
    Second major iteration
    launched

    View full-size slide

  4. Writing code during
    the early 2000’s

    View full-size slide


  5. The heart of the product was built in the 2000’s

    View full-size slide

  6. 💡


    All companies have heritage
    systems that enabled them to
    become successful businesses!

    View full-size slide

  7. Question:


    What is the heritage in your
    environment?

    View full-size slide

  8. 2000’s


    First online product


    launched
    Second major iteration
    launched
    Starts building innovative


    user experience
    2010’s


    Moved to the cloud


    |
    Disrupts market and


    expands globally

    View full-size slide

  9. System Context
    S


    Orignal UI
    Core Product


    (Monolith)
    🥰

    View full-size slide

  10. System Context
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)
    🥰

    View full-size slide

  11. Environment Context
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)
    •New Product


    • Cloud-native


    • Continuous deployment


    • Some auto-infrastructure
    provisioning


    •Heritage Product


    • Virtual servers


    • Manual deployments


    • Manual infrastructure
    provisioning


    View full-size slide

  12. Environment Context
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)
    • Agile delivery


    • Low test-automation


    • Incidents highly disruptive


    • Reliance on long-tenured engineers


    • Traditional ops team (HIGH TOIL)

    View full-size slide

  13. 40+
    major incidents per month

    View full-size slide

  14. The perfect storm was brewing!

    View full-size slide

  15. Question:


    Do you have a perfect storms
    brewing?

    View full-size slide

  16. 2000’s


    First online product


    launched
    Second major iteration
    launched
    Starts building innovative


    user experience
    2010’s


    Moved to the cloud


    |
    Disrupts market and


    expands globally 💥


    Super outage

    View full-size slide

  17. The super outage
    > Regular deployment with hundreds unique changes


    > Unusable for numerous weeks


    > Busy time of the year


    > Most customers affected


    > System performance continued to degrade

    View full-size slide

  18. Unable to isolate the problem

    View full-size slide

  19. Unable to restore a previous version

    View full-size slide

  20. Unable to failover

    View full-size slide

  21. Unable to troubleshoot

    View full-size slide

  22. Unable to create a new environment

    View full-size slide

  23. 🤕 😴 🤬

    View full-size slide

  24. Opportunity Strikes!
    > Reduce the size of changes


    > Make rollback effortless


    > Reduce configuration change risk


    > Make it easy to setup a new environment


    > Make it easy to analyse and troubleshoot problems


    > Give everyone the skills to fix problems


    > Handle production outages cool, calm and controlled

    View full-size slide

  25. With 40+ incidents/month
    a catastrophic outage was only a matter of time

    View full-size slide

  26. Question:


    What would you do address
    this risk?

    View full-size slide

  27. 😱 We needed to remove fear

    View full-size slide

  28. Technology Improvement Plan
    > Incident Response


    > Deep Ownership


    > Production Insights


    > Deploy Monolith Safely


    > Decouple Monolith
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)

    View full-size slide

  29. Pitching our plan….
    > Economic argument on supporting scale


    > Proposed a stop to delivery to focus on resilience


    > Ended up securing about 1/5 of engineering capacity
    https://www.wired.com/2013/04/linkedin-software-revolution/

    View full-size slide

  30. 💡


    Initial objective was to reduce
    the impact of failure on
    customers

    View full-size slide

  31. Technology Improvement Plan
    > Incident Response


    > Deep Ownership


    > Production Insights


    > Deploy Monolith Safely


    > Decouple Monolith
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)

    View full-size slide

  32. Planning approach
    ● Outcomes focus charting (Auftragsklärung)


    ● 1-pager plan outlining


    - Context


    - Problem


    - Objective


    - Outputs


    - Success measures


    • Incremental approach where possible
    https://auftragsklaerung.com/

    View full-size slide

  33. Enabling us to break-up the
    monolith

    View full-size slide

  34. Incident Response


    > Primary Objective: Shorten repair and recovery times


    > Secondary Objective: Make sure they don’t happen again

    View full-size slide

  35. Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1
    New incident


    manager (ITIL)
    Incident response process
    and prioritisation matrix
    Engineers on-call
    PagerDuty
    🤯 🤑 Marked reduction


    in incidents
    💥 Incident Command
    Weekly Ops Review
    💥 Disestablish


    tech-support
    🎯 Incident recovery up to 95% SLA
    Incident recovery at <80% SLA
    Engineering doing


    post-mortems for Sev2’s
    💥 Engineering doing


    post-mortems for


    Sev0 and Sev1 incidents
    💡 Make sure incidents don’t happen again


    💡 Time to acknowledge incidents


    💡 Awareness of sensitive customers

    View full-size slide

  36. 💡


    Started with old-school incident
    management, because it was
    the right thing to do.

    View full-size slide

  37. Production Insights


    > Primary Objective: Standard measurement of health


    > Secondary Objective: Don’t call us, we’ll call you

    View full-size slide

  38. Service Level Objectives
    Availability =
    Successful requests
    Total requests

    View full-size slide

  39. Implementing SLO’s
    > Top down approach


    > Key interactions workflow


    > Failed requests included slow and errors


    > Differentiated between desired, current and failing


    > Teams implemented SLI dashboards to back these up

    View full-size slide

  40. Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1
    🤯 Manual SLO recording


    in Excel!
    🎯 First basic dashboards


    internally (incl. support)
    💥 Dashboard released


    to some customers
    Prioritised improvement
    SLOs to 75% compliance
    🤯 Engineering


    site-visits
    🎯 Customers stopped
    complaining
    Dedicated work-streams to
    dramatically improve
    performance

    View full-size slide

  41. 💡


    We created a standard
    definition of what good looks
    like for the customer

    View full-size slide

  42. Deployment Safety


    > Primary Objective: Reduce the impact of bad deployments


    > Secondary Objective: Reduce undifferentiated heavy-lifting

    View full-size slide

  43. Canary deployments on monolith
    C1 v2
    C3 v2
    C2 v2
    C4 v2
    C1 v1
    C3 v1
    C2 v1
    C4 v1
    Active Passive

    View full-size slide

  44. Canary deployments
    C1 v2
    C3 v2
    C2 v2
    C1 v1
    C3 v1
    C2 v1
    C4 v1
    Active Passive
    Easy rollback
    C4 v2

    View full-size slide

  45. Canary deployments
    C1 v2
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v1
    C4 v1
    Active Passive
    Deploy v3

    View full-size slide

  46. Canary deployments
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v1
    C4 v1
    Active Passive
    C1 v2
    Make live

    View full-size slide

  47. Canary deployments
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v3
    C4 v1
    Active Passive
    C1 v2
    Deploy v3

    View full-size slide

  48. Canary deployments
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v3
    C4 v1
    Active Passive
    C1 v2
    Make live

    View full-size slide

  49. Database Blue / Green
    Schema
    v2
    Schema
    v1

    View full-size slide

  50. Other things
    ● Moved SQL Server to management instances


    ● Moved all config management into source-control


    ● Eliminate toil where possible

    View full-size slide

  51. 💡


    Not only did we reduce impact
    of issues but we also reduced
    snowflakes and toil.

    View full-size slide

  52. This is was only the start…

    View full-size slide

  53. Lessons
    > Spend A LOT of time with stakeholders


    > Need technical executive support


    > Back out of failed experiments early


    > Getting this right is hard

    View full-size slide

  54. Achievements
    ● Single definition of good using Service Level Objectives


    ● We started detecting issues before out customers!


    ● Incidents resolved within SLA (improved from 80% to 95%+)


    ● Mostly automated deployment and provisioning on our monolith


    ● Reducing the impact of bad deployments


    ● Rollback within minutes instead of hours (or even days)


    ● Enabled modernisation roadmap


    ● Happy customers!!

    View full-size slide

  55. Question:


    What is your key takeaway?

    View full-size slide

  56. 💡


    There’s no perfect environment.
    Find the right DevOps
    topology for you.

    View full-size slide

  57. 💡


    Cohesive collaborative teams
    with end-to-end ownership of
    their services (build it, run it)

    View full-size slide

  58. 💡


    All companies have heritage
    systems that enabled them to
    become successful businesses!

    View full-size slide

  59. 💡


    Find incremental gains by
    looking at the biggest fires in
    your environment

    View full-size slide

  60. 💡


    Find your initial objective!


    We reduced failure impact

    View full-size slide

  61. 💡


    Started with old-school processes
    because it was the right thing to do.


    …and then we matured

    View full-size slide

  62. @martincronje

    View full-size slide