Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Paving the Road to Proactive Reliability

Avatar for Nikos Katirtzis Nikos Katirtzis
August 07, 2024
11

Paving the Road to Proactive Reliability

Avatar for Nikos Katirtzis

Nikos Katirtzis

August 07, 2024
Tweet

Transcript

  1. Paving the Road for Proactive Reliability K a u s

    h i k P a t e l S e n i o r E n g i n e e r i n g M a n a g e r N i k o s K a t i r t z i s S e n i o r S o f t w a r e E n g i n e e r
  2. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Expedia Group
  3. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y 35+ functional domains >26k+ apps 500+ Kubernetes clusters 40k+ nodes Multiple AWS regions Our Scale
  4. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Before After Path to Paved Road Runtime Platform 1 Observability Tool 1 Service Mesh 1 Runtime Platform 2 Observability Tool 2 Service Mesh 2 Runtime Platform 3 Observability Tool 3 Service Mesh 3 Runtime Platform Observability Tool Service Mesh
  5. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Change Everything Else https://www.subbu.org/articles/2019/incidents-trends-from-the-trenches/ Change Config Drift Unknown Infrastructure Failure Certificate Expiration Other Incidents grouped by theme Contributing factor that caused incidents Incident Analysis (2019)
  6. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Canary Flow Blue/Green Release Safety Strategies Router Blue Green Initiate Shift % traffic Bake Judge Promote/ Rollback
  7. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Progressive Deployment Primary Canary Baseline Initial State Primary Canary Baseline Traffic shifting 100% 0% 0% 50% 25% 25% V1 V2 V2
  8. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Progressive Deployment – User Interface User Interface
  9. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Verify Autoscaling Understand Tolerated Disruptions Practice past incidents Prevent future incidents Chaos Engineering – Goals
  10. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Chaos Engineering – Architecture Continuous Delivery Control Plane chaos-controller Block connectivity between Pod X and Pod Y running in Cluster Z Target Pod X in Cluster Z Cluster Z Pod X Experiment: Block connectivity Target Pod X Experiment: Block connectivity Experiment: Block connectivity Pod Y
  11. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Chaos Engineering – User Interface
  12. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – Use Cases Use case 1 – Cluster local (new platform) Region 1 app-A app-B app-B egress ingress Region 2 Cluster X Cluster X
  13. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – Use Cases Use case 2 – Cross Cluster (new platform) app-A app-A ingress app-C egress ingress Region 1 Cluster Y Region 1 Cluster X Region 2 Cluster X
  14. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – Use Cases Use case 3 – Legacy platform to cluster in new platform Region 1 app-A app-A ingress Region 2 Legacy Platform app-D ingress Cluster X Cluster X
  15. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – User Interface
  16. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Chaos Engineering Progressive Deployment Tracking Adoption and Usage
  17. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Byte Size Videos https://medium.com/expedia-group-tech Public Blogposts Internal Success Stories GameDays Promotion and Advocacy
  18. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y The internal Reliability Hub aims to: ü Provide best practices for engineers that want to design highly available and resilient systems ü Raise awareness of the Reliability Engineering solutions which are available ü Provide practical guidelines on how to verify resiliency using Chaos Engineering Reliability Hub
  19. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y 2021 2022 2023 Systems Design and Feedback Implementation Integrations Adoption Our Journey
  20. P A V I N G T H E R

    O A D F O R P R O A C T I V E R E L I A B I L I T Y E X P E D I A G R O U P Key Takeaways • Analyse incident data to identify themes • Use tools to prevent and prepare for incidents • Produce data when building tools to measure adoption and drive investment decisions • Create a paved road and provide great developer experience • Integrate with existing platforms and tools • Bridge the gaps between people, processes, and tools 01 | Drive reliability through tools and data 02 | Focus on your Customers