Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Resilient Applications using Chaos Engineering on AWS

Building Resilient Applications using Chaos Engineering on AWS

Architectures have grown increasingly complex and hard to understand. Worse, the software systems running them have become extremely difficult to debug and test, increasing the risk of outages. With these new challenges, new tools are required and since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to reveal failures before they become outages. In this talk, I will talk about chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more robust systems. I will demo the tools and methods used to inject failures in order to make systems more resilient to failure.

E6c942c0f8e6042fbd47fcd3c4113b90?s=128

Adrian Hornsby

January 30, 2020
Tweet

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building Resilient Applications using Chaos Engineering on AWS Adrian Hornsby Principal Technical Evangelist Amazon Web Services
  2. Can you afford to lose $100,000?

  3. Because that is the average amount of one hour of

    downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research
  4. None
  5. • A volunteer firefighter • Created GameDay in 2006 to

    purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon
  6. “Simian Army to keep our cloud safe, secure, and highly

    available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Rise of the monkeys https://github.com/Netflix/SimianArmy
  7. Chaos Engineering formalized by Netflix (mid-2015 ) principlesofchaos.org

  8. Chaos engineering is NOT about breaking things randomly without a

    purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.
  9. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Chaos Engineering A

    scientific method
  10. Break your systems on purpose. Find out their weaknesses and

    fix them before they break when least expected.
  11. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos Engineering
  12. Operations Infrastructure Application Software

  13. https://aws.amazon.com/wellarchitected

  14. M ore Inform ation https://aws.amazon.com/builders-library

  15. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos Engineering
  16. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos

    Engineering
  17. What is steady state? • ”normal” behavior of your system

    https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  18. What is steady state? • Business + Ops Metric https://medium.com/netflix-techblog/

  19. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos

    Engineering
  20. What if…? “What if this load balancer breaks?” “What if

    Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  21. None
  22. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos

    Engineering
  23. Failure injection • Start small & build confidence • Application

    level (exceptions, errors, …) • Host level (services, processes, …) • Resource attacks (CPU, memory, IO, …) • Network attacks (dependencies, latency, packet loss…) • AZ attack • Region attack • People attack
  24. Rules of thumbs • Start very small • As close

    as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back (corrupt or incorrect data)
  25. Blast radius • How many customers? • What functionality? •

    How many locations?
  26. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos

    Engineering
  27. Quantifying the result of the experiment • Time to detect?

    • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  28. Postmortems – COE (Correction of Errors) • What happened? •

    What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.)
  29. Tools Processes Culture Technology

  30. Two rules to remember ALWAYS!

  31. DON’T blame that one person …

  32. There is no isolated ‘cause’ of an accident.

  33. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos

    Engineering
  34. Fix it!

  35. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Monkeys
  36. Start simple and local!! $ docker stop 94a214bbeebd

  37. DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

  38. Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t

    60s https://kernel.ubuntu.com/~cking/stress-ng/
  39. Adding latency to the network $ tc qdisc add dev

    eth0 root netem delay 300ms
  40. Blocks DNS resolution $ iptables -A INPUT -p tcp -m

    tcp --dport 53 -j DROP
  41. Other fun things to do • Fill up disk •

    Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with /etc/hosts
  42. https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly

    • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army
  43. The Chaos Toolkit • Simplifying Adoption of Chaos Engineering •

    An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines
  44. None
  45. None
  46. None
  47. None
  48. None
  49. ToxiProxy • HTTP API • Build for Automated testing in

    mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy
  50. https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

  51. Pumba https://github.com/alexei-led/pumba/

  52. Fault Injection Queries for Amazon Aurora SQL commands issued to

    simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html
  53. Fault Injection Queries for Amazon Aurora SQL commands issued to

    simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };
  54. APIs J

  55. ❯ aws lambda put-function-concurrency --function-name <value> -- reserved-concurrent-executions 0

  56. Injecting Chaos to AWS Lambda $ pip install chaos-lambda

  57. https://medium.com/@adhorn/injecting-chaos-to-aws-lambda-functions-using-lambda-layers-2963f996e0ba Injecting Chaos to AWS Lambda functions using Lambda Layers

  58. https://github.com/adhorn/aws-lambda-chaos-injection

  59. https://github.com/gunnargrosch/failure-lambda

  60. Injecting Chaos to Amazon EC2 using AWS System Manager https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

  61. SSM Run (send) Command $ aws ssm send-command --document-name "cpu-stress"

    --document-version "1" --targets '[{"Key":"InstanceIds","Values":[ " i-094c8367024633d96 ","i-04d0976f9fb658c23"]}]’ --parameters '{"duration":["60"],"cpu":["0"]}’ --timeout-seconds 600 --max-concurrency "50" --max-errors "0" --output-s3-bucket-name "adhorn-chaos-ssm-output" --region eu-west-1
  62. https://github.com/adhorn/chaos-ssm-documents

  63. https://github.com/adhorn/aws-chaos-scripts

  64. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos Engineering
  65. Big challenges to chaos engineering • Chaos Engineering won’t make

    your system more robust, People will. • Chaos Engineering won’t replace __all__ the rest (test, quality, …) • Chaos Engineering is NOT the only way to learn from failure • Rollbacks are HARD because of state. • Your systems will continue to fail, sorry. • Starting is perceived as hard!
  66. Big challenges to chaos engineering Mostly Cultural • no time

    or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  67. Tools Processes Culture Technology

  68. Thank you! © 2020, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Adrian Hornsby https://medium.com/@adhorn adhorn