chaos with computers since the 80’s. • Background in development and operations. • My own three chaos monkeys at home. • Skateboarder at heart. • Serverless aficionado. ABOUT ME AWS Pop-up Loft Stockholm @gunnargrosch
private and hybrid cloud platforms. • Expertise, quality, cost-effectiveness and automation pervade our delivery of critical business IT operations. • Delivering everything from magical customer support to fully managed cloud services. • Implementing chaos engineering for customers. ABOUT OPSIO AWS Pop-up Loft Stockholm @gunnargrosch
system in order to build confidence in the system’s capability to withstand turbulent conditions in production. principlesofchaos.org WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
if, but when. • Building resilient systems require experience with failure. • By simulating potential errors in advance we can verify that our systems behave as we expect – or fix them if they don’t. • Chaos Engineering is an emerging discipline but the underlying concepts are not. WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
our systems. • Inject something harmful in order to build an immunity. • We are “breaking things on purpose” to learn new information about our systems through experimentation. • By triggering incidents intentionally in a controlled way we gain confidence that our systems can deal with failure. • Chaos Engineering requires a base level of resilience! WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
distributed computing. WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch Fallacy Effect The network is reliable Error handling/retries needed Latency is zero Minimize number of requests Bandwidth is infinite Send small payloads The network is secure Secure data/Authenticate requests Topology doesn't change Changes affect latency, bandwidth and endpoints There is one administrator Changes affect ability to reach destination Transport cost is zero Costs must be budgeted The network is homogeneous Affects reliability, latency and bandwidth
Acceptance testing ADVANCE FROM TESTING TO EXPERIMENTING AWS Pop-up Loft Stockholm @gunnargrosch Input Component X Output Input Component X Component Y Output Service X Service Y
DNS outage Service X Service Y ⏱ ADVANCE FROM TESTING TO EXPERIMENTING AWS Pop-up Loft Stockholm @gunnargrosch Service X Service Y ❌ Service X Service Y
to identify bugs as far away from production as possible. • Reverse the strategy: Run your chaos experiments as close to production as possible. WHERE TO USE CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
move from physical infrastructure to AWS. • 2011 The Simian Army was born and added additional failure injection monkeys: Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, Chaos Gorilla and Latency Monkey. • 2012 Netflix shared the source code for Chaos Monkey on Github. THE HISTORY OF CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch
Gives more granular control over the “blast radius”. Failure-as-a-Service. • 2017 Chaos Engineering book released on O’Reilly Media by members of the Netflix team. THE HISTORY OF CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch
your hypothesis 4. Contain the blast radius 5. Notify the organization 6. Run your chaos experiment 7. Measure the results 8. Scale up or abort and fix HOW TO PERFORM A CHAOS EXPERIMENT? AWS Pop-up Loft Stockholm @gunnargrosch
Vital business metrics • Steady state is not necessarily continuous • Business metrics is more useful than system metrics for chaos experiments DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch
IO • Network: Latency, DNS, Packet loss • State: Processes, Shutdown, Time • Alerting metrics • Total alerts • Time to resolution • Self-resolving alerts • Most frequent alerts DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch
Total SEVs • MTTD, MTTR and MTBF for SEVs • Application metrics • Events • Error rates • Performance counters • Business metrics • Orders per time unit • Messages per time unit • Number of items placed in cart DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch
included. • What could go wrong? • Whiteboard the system, services and dependencies. • Make sure to have a “stop” button. PLAN THE CHAOS EXPERIMENT AWS Pop-up Loft Stockholm @gunnargrosch
stack to increase system resilience. FORM YOUR HYPOTHESIS AWS Pop-up Loft Stockholm @gunnargrosch API APPLICATION CACHES DATABASE OPERATING SYSTEM HARDWARE NETWORK DNS AVAILABILITY ZONE REGION PEOPLE
if the database becomes read-only? • What if availability zone B is down? • What if Slack is down? • What if an instance terminates? • What if microservice X isn’t responding? Don’t perform chaos experiments in production if you know that it will cause damage. Always fix known problems first! FORM YOUR HYPOTHESIS AWS Pop-up Loft Stockholm @gunnargrosch
small a failure will most likely not cause an outage or customer pain. • Remember “run experiments in production”? The closer to production the more you will learn. CONTAIN THE BLAST RADIUS AWS Pop-up Loft Stockholm @gunnargrosch
Inform your organization about what you’re doing, why you’re doing it, and when you’re doing it. • With more confidence skip “when you’re doing it”. NOTIFY THE ORGANIZATION AWS Pop-up Loft Stockholm @gunnargrosch
correct. • Is the system resilient to what you injected? • Did anything unexpected happen? • Did you have the proper metrics? You should have. • Share your progress and success! MEASURE THE RESULTS AWS Pop-up Loft Stockholm @gunnargrosch
scale up. • Increased scope can reveal effects that aren’t noticeable on smaller-scale. SCALE UP OR ABORT AND FIX AWS Pop-up Loft Stockholm @gunnargrosch
a free, open and community-driven toolkit and API to all the various forms of chaos engineering tools that the community needs. The Chaos Toolkit aims to be the simplest and easiest way to explore building, and automating, your own Chaos Engineering Experiments. Chaos Hub stands on the shoulders of the Chaos Toolkit to provide a complete, user-friendly, platform to automate and collaborate on your Chaos Engineering and Resiliency efforts. https://chaostoolkit.org/ https://chaoshub.org/ https://github.com/chaostoolkit TOOLS FOR CHAOS ENGINEERING – CHAOS TOOLKIT & CHAOS HUB AWS Pop-up Loft Stockholm @gunnargrosch
Toolkit and Chaos Hub, the Chaos Platform provides an HA, secure, collaborative, observable and customizable production-ready platform for your own chaos engineering experiments. https://chaosplatform.com/ https://chaosiq.io/ TOOLS FOR CHAOS ENGINEERING – CHAOSPLATFORM AWS Pop-up Loft Stockholm @gunnargrosch
simulate real outages with an ever-growing library of attacks. Downtime is expensive and damages customer trust. Gremlin's Failure as a Service finds weaknesses in your system before they cause problems. https://www.gremlin.com/ TOOLS FOR CHAOS ENGINEERING – GREMLIN AWS Pop-up Loft Stockholm @gunnargrosch
run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services. https://github.com/Netflix/chaosmonkey TOOLS FOR CHAOS ENGINEERING – CHAOS MONKEY AWS Pop-up Loft Stockholm @gunnargrosch
the experimentation on AWS Infrastructure to bring system security weaknesses to the forefront. Security is chaotic and the industry has traditionally put emphasis on the importance of preventative security control measures and defense-in-depth where-as our mission is to drive new knowledge and perspective into the attack surface by delivering proactively through detective experimentation.. https://github.com/Optum/ChaoSlingr TOOLS FOR CHAOS ENGINEERING – CHAOSSLINGR AWS Pop-up Loft Stockholm @gunnargrosch
with networking will cause errors, delays and other issues that affect the service. • Target: AWS Lightsail three tier web cluster (LB, web fronts, DB) • Attack: Network corruption • Scope: Single instance • Expected results: LET’S BREAK SOMETHING – NETWORK CORRUPTION AWS Pop-up Loft Stockholm @gunnargrosch
instance. How well does the architecture handle the loss or termination of an instance? • Target: AWS Wordpress reference architecture • Attack: Instance termination • Scope: Single instance • Expected results: LET’S BREAK SOMETHING – INSTANCE TERMINATION AWS Pop-up Loft Stockholm @gunnargrosch
ask what happens when it fails. • Prior to starting your chaos experiments it is vital to collect metrics. • Start with the smallest possible experiment that can cause impact. • Share your progress and success! TAKEAWAYS AWS Pop-up Loft Stockholm @gunnargrosch