Scaling Reliability Engineering with Tools

E X P E D I A G R O
U P Scaling Reliability Engineering with Tools N i k o s K a t i r t z i s – D a n i e l A l b u q u e r q u e

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S Agenda SECTION CONTENT E X P E D I A G R O U P 01 | Intro 02 | Chaos Engineering 03 | Failover-as-a-Service Reliability Engineering On-road Platforms Concepts Chaos Engineering Framework Concepts Failover-as-a-Service Framework 04 | Key Takeaways Lessons Learned Key Takeaways 2

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P 3 • Reliability Engineering • On-road Platforms 01 Intro

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Intro - Reliability Engineering Models Single Reliability Engineering team Reactive Reliability Engineering 4

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Intro - Reliability Engineering Models Reactive Reliability Engineering Domain 1 Domain 2 Domain 3 5

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Intro - Reliability Engineering Models Reactive Reliability Engineering Centre of Excellence Domain 1 Domain 2 Domain 3 Proactive Reliability Engineering 6

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Intro - Reliability Engineering Models Reactive Reliability Engineering Proactive Reliability Engineering Reliability Tooling Centre of Excellence Domain 1 Domain 2 Domain 3 7

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Intro - Programming Models Scripting • Tactical • Limited up-front design • Requires limited time and investment • Limited focus on good practices • Limited focus on testing • Limited documentation • Hard to maintain • Act of an individual • Strategic • Design-first • Requires significant time and investment • Strong focus on good practices • Strong focus on testing • Extensive documentation • Easy to maintain • Team effort Software Engineering 8

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Intro - On-road Platform Experience Runtime Platform 1 Runtime Platform 2 Runtime Platform 3 Observability Tool 1 Service Mesh 1 Observability Tool 2 Service Mesh 2 Observability Tool 3 Service Mesh 3 Runtime Platform Observability Tool Service Mesh Before After 9

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Intro - Benefits of Common Tools Tool 1 Tool 2 Common Tool 10

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P 11 • Concepts • Chaos Engineering Framework 02 Chaos Engineering

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Concepts The discipline of experimenting to ensure that the impact of failure is mitigated Steady State Service X is healthy When terminating 1 instance of the service, the service remains healthy Hypothesis Fault Injection Terminate 1 instance of the service Verification Repeat Observation Action Alerts didn’t trigger Fix the alerts Not enough capacity Enable autoscaling 4 2 1 3 12

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - How it Started Runtime Platform Chaos Engineering Platform Before After Chaos Engineering Platform 1 Chaos Engineering Platform 2 Chaos Engineering Platform 3 Runtime Platform 1 Runtime Platform 2 Runtime Platform 3 13 Chaos Engineering at Expedia Group https://medium.com/expedia-group-tech

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Product Offering Chaos Engineering Framework Chaos Engineering Space • Custom Kubernetes Controller • Spinnaker Plugin • Demo Resources • Chaos Engineering Concepts • Internal Tools • Past Experiments from Teams 14

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - 30,000 Foot View 15

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Chaos Controller • Chaos Engineering Framework for Kubernetes • Execution of Chaos Experiments using labels to reduce the blast radius • Support for a wide range of fault injection types at Container, Pod, and Node level • Observability through metrics, logs, and Kubernetes events • Safety nets https://github.com/DataDog/chaos-controller 16

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Chaos Controller 17

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Demo Resources A simple REST/gRPC application A curl Pod calling a Web Server https://github.com/DataDog/chaos-controller 18

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Type State Node shutdown Container termination Resource CPU pressure Disk pressure Network Packet drop Packet corruption Delay with jitter Bandwidth struggle Host-level Availability Zone failure DNS spoofing Request gRPC status code manipulation Chaos Engineering - Fault Injection Types https://github.com/DataDog/chaos-controller 📋 19

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Architecture A Hitchhiker’s Guide to Spinnaker Plugins https://nkatirtzis.medium.com 20

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Scaling with Visibility 21

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Chaos Engineering - Scaling with Community https://medium.com/expedia-group-tech Past Experiments Documented Internally And in Blogposts 22

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P 23 • Concepts • Failover-as-a-Service Framework 03 Failover-as-a-Service

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Recovering from Failures 24 Source: https://commons.wikimedia.org

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Multi-Region 25

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Multi-Region Blue region Green region 26

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Fault Domains 27 Source: Getty Images

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Fault Domains Blue region Green region 28

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Fault Domains 🔥 🔥 🔥 Blue region Green region 29

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Regional Evacuation 30 Source: https://www.theintelligencer.net

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Small Isolated Failures 31 Source: https://www.geograph.ie

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Small Diversion of Traffic? Blue region Green region 🔥 32

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Small Diversion of Traffic? Blue region Green region 🔥 33

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - How? 34

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - DNS? 35

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Service Mesh 36 Source: Getty Images

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Service Mesh 37

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Failover-as-a-Service - Capabilities Scaling up infra Scaling up workloads Custom actions Gradual traffic shifting 38

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P 39 Lessons Learned & Key Takeaways 04 Key Takeaways

S C A L I N G R E L
I A B I L I T Y E N G I N E E R I N G W I T H T O O L S E X P E D I A G R O U P Key Takeaways • Reactive reliability should not be your only strategy • Use tools to facilitate prevention of and recovery from incidents • Identify and verify failure modes of your systems • Practice incidents • Automate recovery Invest in Proactive Reliability • Know your customers • Get early and constant feedback • Automate as much as you can • Provide great developer experience • Have ways to measure adoption • Users will not come to you, you should go to them Focus on Customer • Reduce fragmentation • Create an on-road platform experience • Integrate with existing platforms and tools • De-prioritise legacy systems • Focus on adoption Consolidate tooling 40

Scaling Reliability Engineering with Tools

Scaling Reliability Engineering with Tools

More Decks by Nikos Katirtzis

Other Decks in Technology

Featured

Transcript