Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Paving the Road to Proactive Reliability

Nikos Katirtzis
August 07, 2024
6

Paving the Road to Proactive Reliability

Nikos Katirtzis

August 07, 2024
Tweet

Transcript

  1. Paving the Road for Proactive Reliability K a u s

    h i k P a t e l S e n i o r E n g i n e e r i n g M a n a g e r N i k o s K a t i r t z i s S e n i o r S o f t w a r e E n g i n e e r
  2. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Expedia Group
  3. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y 35+ functional domains >26k+ apps 500+ Kubernetes clusters 40k+ nodes Multiple AWS regions Our Scale
  4. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Before After Path to Paved Road Runtime Platform 1 Observability Tool 1 Service Mesh 1 Runtime Platform 2 Observability Tool 2 Service Mesh 2 Runtime Platform 3 Observability Tool 3 Service Mesh 3 Runtime Platform Observability Tool Service Mesh
  5. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Change Everything Else https://www.subbu.org/articles/2019/incidents-trends-from-the-trenches/ Change Config Drift Unknown Infrastructure Failure Certificate Expiration Other Incidents grouped by theme Contributing factor that caused incidents Incident Analysis (2019)
  6. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Canary Flow Blue/Green Release Safety Strategies Router Blue Green Initiate Shift % traffic Bake Judge Promote/ Rollback
  7. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Progressive Deployment Primary Canary Baseline Initial State Primary Canary Baseline Traffic shifting 100% 0% 0% 50% 25% 25% V1 V2 V2
  8. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Progressive Deployment – User Interface User Interface
  9. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Verify Autoscaling Understand Tolerated Disruptions Practice past incidents Prevent future incidents Chaos Engineering – Goals
  10. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Chaos Engineering – Architecture Continuous Delivery Control Plane chaos-controller Block connectivity between Pod X and Pod Y running in Cluster Z Target Pod X in Cluster Z Cluster Z Pod X Experiment: Block connectivity Target Pod X Experiment: Block connectivity Experiment: Block connectivity Pod Y
  11. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Chaos Engineering – User Interface
  12. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – Use Cases Use case 1 – Cluster local (new platform) Region 1 app-A app-B app-B egress ingress Region 2 Cluster X Cluster X
  13. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – Use Cases Use case 2 – Cross Cluster (new platform) app-A app-A ingress app-C egress ingress Region 1 Cluster Y Region 1 Cluster X Region 2 Cluster X
  14. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – Use Cases Use case 3 – Legacy platform to cluster in new platform Region 1 app-A app-A ingress Region 2 Legacy Platform app-D ingress Cluster X Cluster X
  15. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Region Failover as a Service – User Interface
  16. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Chaos Engineering Progressive Deployment Tracking Adoption and Usage
  17. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y Byte Size Videos https://medium.com/expedia-group-tech Public Blogposts Internal Success Stories GameDays Promotion and Advocacy
  18. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y The internal Reliability Hub aims to: ü Provide best practices for engineers that want to design highly available and resilient systems ü Raise awareness of the Reliability Engineering solutions which are available ü Provide practical guidelines on how to verify resiliency using Chaos Engineering Reliability Hub
  19. E X P E D I A G R O

    U P P A V I N G T H E R O A D F O R P R O A C T I V E R E L I A B I L I T Y 2021 2022 2023 Systems Design and Feedback Implementation Integrations Adoption Our Journey
  20. P A V I N G T H E R

    O A D F O R P R O A C T I V E R E L I A B I L I T Y E X P E D I A G R O U P Key Takeaways • Analyse incident data to identify themes • Use tools to prevent and prepare for incidents • Produce data when building tools to measure adoption and drive investment decisions • Create a paved road and provide great developer experience • Integrate with existing platforms and tools • Bridge the gaps between people, processes, and tools 01 | Drive reliability through tools and data 02 | Focus on your Customers