Chaos Engineering - AWS Pop-up Loft Stockholm Oct 31 2018

Chaos Engineering – Breaking stuff on purpose 1 AWS Pop-up
Loft Stockholm Gunnar Grosch @gunnargrosch

• Cloud Evangelist at Opsio www.opsio.se • Have been creating
chaos with computers since the 80’s. • Background in development and operations. • My own three chaos monkeys at home. • Skateboarder at heart. • Serverless aficionado. ABOUT ME AWS Pop-up Loft Stockholm @gunnargrosch

• Opsio specializes in the operation and support of public,
private and hybrid cloud platforms. • Expertise, quality, cost-effectiveness and automation pervade our delivery of critical business IT operations. • Delivering everything from magical customer support to fully managed cloud services. • Implementing chaos engineering for customers. ABOUT OPSIO AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Don’t ask what happens if
a system fails, but ask what happens when it fails.

AWS Pop-up Loft Stockholm @gunnargrosch What is Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a distributed
system in order to build confidence in the system’s capability to withstand turbulent conditions in production. principlesofchaos.org WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch

• Sooner or later all complex systems will fail. Not
if, but when. • Building resilient systems require experience with failure. • By simulating potential errors in advance we can verify that our systems behave as we expect – or fix them if they don’t. • Chaos Engineering is an emerging discipline but the underlying concepts are not. WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch

• Thoughtful, planned experiments designed to reveal the weakness in
our systems. • Inject something harmful in order to build an immunity. • We are “breaking things on purpose” to learn new information about our systems through experimentation. • By triggering incidents intentionally in a controlled way we gain confidence that our systems can deal with failure. • Chaos Engineering requires a base level of resilience! WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch

• Many chaos experiments originate in the eight fallacies of
distributed computing. WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch Fallacy Effect The network is reliable Error handling/retries needed Latency is zero Minimize number of requests Bandwidth is infinite Send small payloads The network is secure Secure data/Authenticate requests Topology doesn't change Changes affect latency, bandwidth and endpoints There is one administrator Changes affect ability to reach destination Transport cost is zero Costs must be budgeted The network is homogeneous Affects reliability, latency and bandwidth

AWS Pop-up Loft Stockholm @gunnargrosch Advance from testing to experimenting

• Unit testing • Integration testing • System testing •
Acceptance testing ADVANCE FROM TESTING TO EXPERIMENTING AWS Pop-up Loft Stockholm @gunnargrosch Input Component X Output Input Component X Component Y Output Service X Service Y

• Add latency • Service failures • Exhaust resources •
DNS outage Service X Service Y ⏱ ADVANCE FROM TESTING TO EXPERIMENTING AWS Pop-up Loft Stockholm @gunnargrosch Service X Service Y ❌ Service X Service Y

AWS Pop-up Loft Stockholm @gunnargrosch Where to use Chaos Engineering?

• Tier 0 – Critical Services. • Services with critical
functionality. • Services with critical data. WHERE TO USE CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch

• "We’ll test it in prod!“ • Classical testing wants
to identify bugs as far away from production as possible. • Reverse the strategy: Run your chaos experiments as close to production as possible. WHERE TO USE CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch The history of Chaos Engineering

• 2010 Netflix created Chaos Monkey in response to the
move from physical infrastructure to AWS. • 2011 The Simian Army was born and added additional failure injection monkeys: Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, Chaos Gorilla and Latency Monkey. • 2012 Netflix shared the source code for Chaos Monkey on Github. THE HISTORY OF CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch

• 2014 • 2014 Failure Injection Testing (FIT) is announced.
Gives more granular control over the “blast radius”. Failure-as-a-Service. • 2017 Chaos Engineering book released on O’Reilly Media by members of the Netflix team. THE HISTORY OF CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch What is needed to perform
a chaos experiment?

• Monitoring and observability • Incident management • Impact of
downtime • A base level of resilience (worth repeating) WHAT IS NEEDED TO PERFORM A CHAOS EXPERIMENT? AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch How to perform a chaos
experiment?

1. Define steady state 2. Plan the experiment 3. Form
your hypothesis 4. Contain the blast radius 5. Notify the organization 6. Run your chaos experiment 7. Measure the results 8. Scale up or abort and fix HOW TO PERFORM A CHAOS EXPERIMENT? AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Define steady state

• The normal behavior of a system over time •
Vital business metrics • Steady state is not necessarily continuous • Business metrics is more useful than system metrics for chaos experiments DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch

• Infrastructure monitoring metrics • Resource: CPU, Disk, Memory and
IO • Network: Latency, DNS, Packet loss • State: Processes, Shutdown, Time • Alerting metrics • Total alerts • Time to resolution • Self-resolving alerts • Most frequent alerts DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch

• SEV metrics • Total incidents by SEV level •
Total SEVs • MTTD, MTTR and MTBF for SEVs • Application metrics • Events • Error rates • Performance counters • Business metrics • Orders per time unit • Messages per time unit • Number of items placed in cart DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Plan the experiment

• Only include teams and services that wants to be
included. • What could go wrong? • Whiteboard the system, services and dependencies. • Make sure to have a “stop” button. PLAN THE CHAOS EXPERIMENT AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Form your hypothesis

• Chaos can be injected at any layer of you
stack to increase system resilience. FORM YOUR HYPOTHESIS AWS Pop-up Loft Stockholm @gunnargrosch API APPLICATION CACHES DATABASE OPERATING SYSTEM HARDWARE NETWORK DNS AVAILABILITY ZONE REGION PEOPLE

• What if latency increases by 300 ms? • What
if the database becomes read-only? • What if availability zone B is down? • What if Slack is down? • What if an instance terminates? • What if microservice X isn’t responding? Don’t perform chaos experiments in production if you know that it will cause damage. Always fix known problems first! FORM YOUR HYPOTHESIS AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Contain the blast radius

• Always design the smallest possible experiment. • By starting
small a failure will most likely not cause an outage or customer pain. • Remember “run experiments in production”? The closer to production the more you will learn. CONTAIN THE BLAST RADIUS AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Notify the organization

• There can never be to much information initially. •
Inform your organization about what you’re doing, why you’re doing it, and when you’re doing it. • With more confidence skip “when you’re doing it”. NOTIFY THE ORGANIZATION AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Run your chaos experiment

• Dare to press the button. • Watch the metrics!
• Abort if needed. RUN YOUR CHAOS EXPERIMENT AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Measure the results

• Use the metrics to prove if your hypothesis is
correct. • Is the system resilient to what you injected? • Did anything unexpected happen? • Did you have the proper metrics? You should have. • Share your progress and success! MEASURE THE RESULTS AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Scale up or abort and
fix

• With the confidence from smaller-scale experiments the scope can
scale up. • Increased scope can reveal effects that aren’t noticeable on smaller-scale. SCALE UP OR ABORT AND FIX AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Tools for chaos engineering

Chaos Toolkit is a project whose mission is to provide
a free, open and community-driven toolkit and API to all the various forms of chaos engineering tools that the community needs. The Chaos Toolkit aims to be the simplest and easiest way to explore building, and automating, your own Chaos Engineering Experiments. Chaos Hub stands on the shoulders of the Chaos Toolkit to provide a complete, user-friendly, platform to automate and collaborate on your Chaos Engineering and Resiliency efforts. https://chaostoolkit.org/ https://chaoshub.org/ https://github.com/chaostoolkit TOOLS FOR CHAOS ENGINEERING – CHAOS TOOLKIT & CHAOS HUB AWS Pop-up Loft Stockholm @gunnargrosch

Built on top of the free and open source Chaos
Toolkit and Chaos Hub, the Chaos Platform provides an HA, secure, collaborative, observable and customizable production-ready platform for your own chaos engineering experiments. https://chaosplatform.com/ https://chaosiq.io/ TOOLS FOR CHAOS ENGINEERING – CHAOSPLATFORM AWS Pop-up Loft Stockholm @gunnargrosch

Gremlin provides you the framework to safely, securely, and easily
simulate real outages with an ever-growing library of attacks. Downtime is expensive and damages customer trust. Gremlin's Failure as a Service finds weaknesses in your system before they cause problems. https://www.gremlin.com/ TOOLS FOR CHAOS ENGINEERING – GREMLIN AWS Pop-up Loft Stockholm @gunnargrosch

Chaos Monkey randomly terminates virtual machine instances and containers that
run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services. https://github.com/Netflix/chaosmonkey TOOLS FOR CHAOS ENGINEERING – CHAOS MONKEY AWS Pop-up Loft Stockholm @gunnargrosch

ChaoSlingr is a Security Chaos Engineering Tool focused primarily on
the experimentation on AWS Infrastructure to bring system security weaknesses to the forefront. Security is chaotic and the industry has traditionally put emphasis on the importance of preventative security control measures and defense-in-depth where-as our mission is to drive new knowledge and perspective into the attack surface by delivering proactively through detective experimentation.. https://github.com/Optum/ChaoSlingr TOOLS FOR CHAOS ENGINEERING – CHAOSSLINGR AWS Pop-up Loft Stockholm @gunnargrosch

Remember that proper implementation of chaos engineering is more important
than what tools you use. TOOLS FOR CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Let’s break something!

Reliable network is vital for most (all) applications. Having issues
with networking will cause errors, delays and other issues that affect the service. • Target: AWS Lightsail three tier web cluster (LB, web fronts, DB) • Attack: Network corruption • Scope: Single instance • Expected results: LET’S BREAK SOMETHING – NETWORK CORRUPTION AWS Pop-up Loft Stockholm @gunnargrosch

Instance based architectures rely more or less on the single
instance. How well does the architecture handle the loss or termination of an instance? • Target: AWS Wordpress reference architecture • Attack: Instance termination • Scope: Single instance • Expected results: LET’S BREAK SOMETHING – INSTANCE TERMINATION AWS Pop-up Loft Stockholm @gunnargrosch

Using cache services like Memcached and Redis can greatly improve
the performance of your application. What happens if the cache is unavailable? • Target: AWS Wordpress reference architecture • Attack: Cache cluster reboot • Scope: ElastiCache cluster • Expected results: LET’S BREAK SOMETHING – CACHE UNAVAILABILITY AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Advanced usage of chaos

• Make chaos experiments a part of your CI/CD. •
Use chaos in blue/green and canary releases. • Automate! ADVANCED USAGE OF CHAOS AWS Pop-up Loft Stockholm @gunnargrosch

• Don’t ask what happens if a system fails, but
ask what happens when it fails. • Prior to starting your chaos experiments it is vital to collect metrics. • Start with the smallest possible experiment that can cause impact. • Share your progress and success! TAKEAWAYS AWS Pop-up Loft Stockholm @gunnargrosch

AWS Pop-up Loft Stockholm @gunnargrosch Thanks to Tammy Bütow, Ana
Medina, Kolton Andrus, Nora Jones, Adrian Hornsby and others for the inspiration.

• https://principlesofchaos.org/ • https://www.oreilly.com/ideas/chaos-engineering • http://chaos.community/ • https://www.gremlin.com/community/tutorials/ • https://github.com/dastergon/awesome-chaos-engineering
• https://www.gremlin.com/blog/ RESOURCES AWS Pop-up Loft Stockholm @gunnargrosch

58 Tynäsgatan 12 65216 Karlstad +46 10 252 55 01
[email protected] www.opsio.se Gunnar Grosch @gunnargrosch

Chaos Engineering - AWS Pop-up Loft Stockholm O...

Chaos Engineering - AWS Pop-up Loft Stockholm Oct 31 2018

More Decks by Gunnar Grosch

Other Decks in Technology

Featured

Transcript