Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering at Scale - Systems at Scale 2022

Nikos Katirtzis
June 25, 2024
6

Chaos Engineering at Scale - Systems at Scale 2022

Nikos Katirtzis

June 25, 2024
Tweet

Transcript

  1. E X P E D I A G R O

    U P Chaos Engineering at Scale N i k o s K a t i r t z i s – E x p e d i a G r o u p
  2. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Intro - Expedia Group’s Scale 9 35+ functional domains >15k+ apps 500+ Kubernetes clusters 40k nodes 4 AWS regions
  3. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Intro - On-road Platform Experience Runtime Platform 1 Runtime Platform 2 Runtime Platform 3 Observability Tool 1 Service Mesh 1 Observability Tool 2 Service Mesh 2 Observability Tool 3 Service Mesh 3 Runtime Platform Observability Tool Service Mesh Before After 9
  4. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Intro - Benefits of Common Tools Tool 1 Tool 2 Common Tool 10
  5. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering - How it Started Runtime Platform Chaos Engineering Platform Before After Chaos Engineering Platform 1 Chaos Engineering Platform 2 Chaos Engineering Platform 3 Runtime Platform 1 Runtime Platform 2 Runtime Platform 3 5
  6. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering Platform - 30,000 Foot View 6
  7. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering Platform - Chaos Controller • Chaos Engineering Framework for Kubernetes • Execution of Chaos Experiments using labels to reduce the blast radius • Support for a wide range of fault injection types at Container, Pod, and Node level • Observability through metrics, logs, and Kubernetes events • Safety nets https://github.com/DataDog/chaos-controller 7
  8. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering Platform - Chaos Controller 8
  9. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Type State Node shutdown Container termination Resource CPU pressure Disk pressure Network Packet drop Packet corruption Delay with jitter Bandwidth struggle Host-level Availability Zone failure DNS spoofing Request gRPC status code manipulation Chaos Engineering Platform - Fault Injection Types https://github.com/DataDog/chaos-controller 📋 9
  10. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering Platform - Architecture 10 Custom Spinnaker plugin
  11. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering Platform - CI/CD Integration 11
  12. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering Platform - Testing and Debugging A simple REST/gRPC application A curl Pod calling a Web Server https://github.com/DataDog/chaos-controller 12
  13. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering - Scaling with Visibility 13
  14. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering - Promotion 14 Byte Size Videos https://medium.com/expedia-group-tech Public Blogposts Internal Success Stories Architecture-driven GameDays
  15. C H A O S E N G I N

    E E R I N G A T S C A L E E X P E D I A G R O U P Chaos Engineering - Closing the Feedback Loop 15 The internal Reliability Hub aims to: ü Provide guidelines for engineers that want to design highly available and resilient systems ü Raise awareness of the Reliability Engineering Products which are available Reliability practices come with guidelines on how to verify using Chaos Experiments
  16. Thank You © 2022 Expedia Group. All rights reserved Chaos

    Engineering at Expedia Group https://medium.com/expedia-group-tech