Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

A DevOps story - Turning our most significant production outage into a driver for positive lasting change. A DevOps story.

Martin Cronjé

June 10, 2020
Tweet

More Decks by Martin Cronjé

Other Decks in Technology

Transcript

  1. A DevOps story


    How we turned our most significant outage into
    lasting positive change!
    @martincronje

    View Slide

  2. View Slide

  3. 💡


    Cohesive collaborative teams
    with end-to-end ownership of
    their services (build it, run it)

    View Slide

  4. 2000’s


    First online product


    launched
    Second major iteration
    launched

    View Slide

  5. Writing code during
    the early 2000’s

    View Slide


  6. The heart of the product was built in the 2000’s

    View Slide

  7. 💡


    All companies have heritage
    systems that enabled them to
    become successful businesses!

    View Slide

  8. Question:


    What is the heritage in your
    environment?

    View Slide

  9. 2000’s


    First online product


    launched
    Second major iteration
    launched
    Starts building innovative


    user experience
    2010’s


    Moved to the cloud


    |
    Disrupts market and


    expands globally

    View Slide

  10. System Context
    S


    Orignal UI
    Core Product


    (Monolith)
    🥰

    View Slide

  11. System Context
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)
    🥰

    View Slide

  12. Environment Context
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)
    •New Product


    • Cloud-native


    • Continuous deployment


    • Some auto-infrastructure
    provisioning


    •Heritage Product


    • Virtual servers


    • Manual deployments


    • Manual infrastructure
    provisioning


    View Slide

  13. Environment Context
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)
    • Agile delivery


    • Low test-automation


    • Incidents highly disruptive


    • Reliance on long-tenured engineers


    • Traditional ops team (HIGH TOIL)

    View Slide

  14. 40+
    major incidents per month

    View Slide

  15. The perfect storm was brewing!

    View Slide

  16. Question:


    Do you have a perfect storms
    brewing?

    View Slide

  17. 2000’s


    First online product


    launched
    Second major iteration
    launched
    Starts building innovative


    user experience
    2010’s


    Moved to the cloud


    |
    Disrupts market and


    expands globally 💥


    Super outage

    View Slide

  18. The super outage
    > Regular deployment with hundreds unique changes


    > Unusable for numerous weeks


    > Busy time of the year


    > Most customers affected


    > System performance continued to degrade

    View Slide

  19. Unable to isolate the problem

    View Slide

  20. Unable to restore a previous version

    View Slide

  21. Unable to failover

    View Slide

  22. Unable to troubleshoot

    View Slide

  23. Unable to create a new environment

    View Slide

  24. 🤕 😴 🤬

    View Slide

  25. Opportunity Strikes!
    > Reduce the size of changes


    > Make rollback effortless


    > Reduce configuration change risk


    > Make it easy to setup a new environment


    > Make it easy to analyse and troubleshoot problems


    > Give everyone the skills to fix problems


    > Handle production outages cool, calm and controlled

    View Slide

  26. With 40+ incidents/month
    a catastrophic outage was only a matter of time

    View Slide

  27. Question:


    What would you do address
    this risk?

    View Slide

  28. 😱 We needed to remove fear

    View Slide

  29. View Slide

  30. Technology Improvement Plan
    > Incident Response


    > Deep Ownership


    > Production Insights


    > Deploy Monolith Safely


    > Decouple Monolith
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)

    View Slide

  31. Pitching our plan….
    > Economic argument on supporting scale


    > Proposed a stop to delivery to focus on resilience


    > Ended up securing about 1/5 of engineering capacity
    https://www.wired.com/2013/04/linkedin-software-revolution/

    View Slide

  32. 💡


    Initial objective was to reduce
    the impact of failure on
    customers

    View Slide

  33. Technology Improvement Plan
    > Incident Response


    > Deep Ownership


    > Production Insights


    > Deploy Monolith Safely


    > Decouple Monolith
    Z
    S


    New API
    Orignal UI
    UI UI UI
    Service Service Service
    Core Product


    (Monolith)

    View Slide

  34. Planning approach
    ● Outcomes focus charting (Auftragsklärung)


    ● 1-pager plan outlining


    - Context


    - Problem


    - Objective


    - Outputs


    - Success measures


    • Incremental approach where possible
    https://auftragsklaerung.com/

    View Slide

  35. Enabling us to break-up the
    monolith

    View Slide

  36. Incident Response


    > Primary Objective: Shorten repair and recovery times


    > Secondary Objective: Make sure they don’t happen again

    View Slide

  37. Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1
    New incident


    manager (ITIL)
    Incident response process
    and prioritisation matrix
    Engineers on-call
    PagerDuty
    🤯 🤑 Marked reduction


    in incidents
    💥 Incident Command
    Weekly Ops Review
    💥 Disestablish


    tech-support
    🎯 Incident recovery up to 95% SLA
    Incident recovery at <80% SLA
    Engineering doing


    post-mortems for Sev2’s
    💥 Engineering doing


    post-mortems for


    Sev0 and Sev1 incidents
    💡 Make sure incidents don’t happen again


    💡 Time to acknowledge incidents


    💡 Awareness of sensitive customers

    View Slide

  38. 💡


    Started with old-school incident
    management, because it was
    the right thing to do.

    View Slide

  39. Production Insights


    > Primary Objective: Standard measurement of health


    > Secondary Objective: Don’t call us, we’ll call you

    View Slide

  40. Service Level Objectives
    Availability =
    Successful requests
    Total requests

    View Slide

  41. Implementing SLO’s
    > Top down approach


    > Key interactions workflow


    > Failed requests included slow and errors


    > Differentiated between desired, current and failing


    > Teams implemented SLI dashboards to back these up

    View Slide

  42. Y1Q1 Y1Q2 Y1Q3 Y1Q4 Y2Q1
    🤯 Manual SLO recording


    in Excel!
    🎯 First basic dashboards


    internally (incl. support)
    💥 Dashboard released


    to some customers
    Prioritised improvement
    SLOs to 75% compliance
    🤯 Engineering


    site-visits
    🎯 Customers stopped
    complaining
    Dedicated work-streams to
    dramatically improve
    performance

    View Slide

  43. 💡


    We created a standard
    definition of what good looks
    like for the customer

    View Slide

  44. Deployment Safety


    > Primary Objective: Reduce the impact of bad deployments


    > Secondary Objective: Reduce undifferentiated heavy-lifting

    View Slide

  45. Canary deployments on monolith
    C1 v2
    C3 v2
    C2 v2
    C4 v2
    C1 v1
    C3 v1
    C2 v1
    C4 v1
    Active Passive

    View Slide

  46. Canary deployments
    C1 v2
    C3 v2
    C2 v2
    C1 v1
    C3 v1
    C2 v1
    C4 v1
    Active Passive
    Easy rollback
    C4 v2

    View Slide

  47. Canary deployments
    C1 v2
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v1
    C4 v1
    Active Passive
    Deploy v3

    View Slide

  48. Canary deployments
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v1
    C4 v1
    Active Passive
    C1 v2
    Make live

    View Slide

  49. Canary deployments
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v3
    C4 v1
    Active Passive
    C1 v2
    Deploy v3

    View Slide

  50. Canary deployments
    C3 v2
    C2 v2
    C4 v2
    C1 v3
    C3 v1
    C2 v3
    C4 v1
    Active Passive
    C1 v2
    Make live

    View Slide

  51. Database Blue / Green
    Schema
    v2
    Schema
    v1

    View Slide

  52. Other things
    ● Moved SQL Server to management instances


    ● Moved all config management into source-control


    ● Eliminate toil where possible

    View Slide

  53. 💡


    Not only did we reduce impact
    of issues but we also reduced
    snowflakes and toil.

    View Slide

  54. This is was only the start…

    View Slide

  55. Lessons
    > Spend A LOT of time with stakeholders


    > Need technical executive support


    > Back out of failed experiments early


    > Getting this right is hard

    View Slide

  56. Achievements
    ● Single definition of good using Service Level Objectives


    ● We started detecting issues before out customers!


    ● Incidents resolved within SLA (improved from 80% to 95%+)


    ● Mostly automated deployment and provisioning on our monolith


    ● Reducing the impact of bad deployments


    ● Rollback within minutes instead of hours (or even days)


    ● Enabled modernisation roadmap


    ● Happy customers!!

    View Slide

  57. Question:


    What is your key takeaway?

    View Slide

  58. 💡


    There’s no perfect environment.
    Find the right DevOps
    topology for you.

    View Slide

  59. 💡


    Cohesive collaborative teams
    with end-to-end ownership of
    their services (build it, run it)

    View Slide

  60. 💡


    All companies have heritage
    systems that enabled them to
    become successful businesses!

    View Slide

  61. 💡


    Find incremental gains by
    looking at the biggest fires in
    your environment

    View Slide

  62. 💡


    Find your initial objective!


    We reduced failure impact

    View Slide

  63. 💡


    Started with old-school processes
    because it was the right thing to do.


    …and then we matured

    View Slide

  64. @martincronje

    View Slide