Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building on System Resilience with Chaos Engine...

Building on System Resilience with Chaos Engineering for Serverless Applications on AWS

Slide deck for Chaos Carnival 2021 on 10 February, 2021

Renaldi Gondosubroto

February 10, 2021
Tweet

More Decks by Renaldi Gondosubroto

Other Decks in Technology

Transcript

  1. THE AGENDA 1. Chaos on the Cloud 2. Endeavours in

    the Cloud 3. The Architecture 4. Applying Chaos 5. What we Got Out of it
  2. A Bit About Myself • 12x AWS, 3x Azure Certified

    and 2x Google Cloud Certified • Personal field of interest is in penetration testing and accessibility practices • On the side: Running meetups, hackathons, doing tech talks and VR tech enthusiast Renaldi Gondosubroto Founder and Developer Advocate @ GReS Studio @Renaldig @renaldigondosubroto
  3. CHAOS ON THE CLOUD The way we work is changing

    due to Covid-19, including the bigger transition towards serverless applications.
  4. THE FIRST ENDEAVOR: DEFINING STABILITY  What is considered normal

    behavior?  How do you want to choose to measure success and what metrics would you like to use?  How does the normal behavior align with business objectives?
  5. THE SECOND ENDEAVOR: THE ESTIMATION PROCESS  Use experience to

    consider how to solve the problem  Think where Chaos can be injected based on that
  6. THE THIRD ENDEAVOR: EXECUTION  Create a flow diagram of

    how you plan to execute  Prepare containment methods  Notification
  7. THE FOURTH ENDEAVOR: EVALUATE  Define metrics that measure success

     How do you measure resiliency needed  Continuous evaluation
  8. CREATING YOUR OWN PLAYGROUND TO EXPERIMENT  Create a test

    account  Experiment with varying parameters
  9. LATENCY INJECTION IN AWS Latency spikes are something one should

    consider early API Gateway’s hard limit of 29s for timeout
  10. LATENCY INJECTION IN AWS CONT. Inject within the HTTP client

    library Test for function’s timeout and that it can degrade accordingly in the request’s timeout Depends on other services also used
  11. TWO SAMPLES OF APPLYING CHAOS  DynamoDB  Lambda 

    Can be done anywhere in AWS, not only in those two
  12. CREATING A POST-INCIDENT STRATEGY  What happened?  How did

    this affect your service?  How can you control the chaos better next time?  What metrics need to be reevaluated?  What are you already doing now to fix it?
  13. CREATING A POST-INCIDENT STRATEGY CONT.  Create a weekly inspection

    schedule for the chaos engineering flow  Focus on maintaining that every item is still working well  Discussions in an agile-scrum culture
  14. A CASE STUDY  Building a resilient system for an

    IoT network  Considering client constraints of low bandwidth in areas  How to best utilize the cloud platform to facilitate it
  15. A CASE STUDY: WHAT WE DID  Selection of testing

    within Lambda  Testing putting high pressure on bandwidth in locations  Schedule testing accordingly
  16. THE NEED FOR AGILE IN CHAOS ENGINEERING  Agile is

    all about iterative development  Chaos engineering needs fast reactions  Prioritizing disaster recovery