Building on System Resilience with Chaos Engineering for Serverless Applications on AWS

BUILDING ON SYSTEM RESILIENCE WITH CHAOS ENGINEERING FOR SERVERLESS APPLICATIONS
ON AWS RENALDI GONDOSUBROTO

THE AGENDA 1. Chaos on the Cloud 2. Endeavours in
the Cloud 3. The Architecture 4. Applying Chaos 5. What we Got Out of it

A Bit About Myself • 12x AWS, 3x Azure Certified
and 2x Google Cloud Certified • Personal field of interest is in penetration testing and accessibility practices • On the side: Running meetups, hackathons, doing tech talks and VR tech enthusiast Renaldi Gondosubroto Founder and Developer Advocate @ GReS Studio @Renaldig @renaldigondosubroto

CHAOS ON THE CLOUD The way we work is changing
due to Covid-19, including the bigger transition towards serverless applications.

THE MAIN PRINCIPLES OF THE CLOUD Resiliency Availability Adaptability

THE FIRST ENDEAVOR: DEFINING STABILITY  What is considered normal
behavior?  How do you want to choose to measure success and what metrics would you like to use?  How does the normal behavior align with business objectives?

THE SECOND ENDEAVOR: THE ESTIMATION PROCESS  Use experience to
consider how to solve the problem  Think where Chaos can be injected based on that

THE THIRD ENDEAVOR: EXECUTION  Create a flow diagram of
how you plan to execute  Prepare containment methods  Notification

THE FOURTH ENDEAVOR: EVALUATE  Define metrics that measure success
 How do you measure resiliency needed  Continuous evaluation

THE FIFTH ENDEAVOR: REITERATION  Improve  Understand the effects

WHAT TO WATCH OUT FOR Handling various responses Creating chaos
to an acceptable level Failover plans

A SIMPLE SAMPLE ARCHITECTURE

CREATING YOUR OWN PLAYGROUND TO EXPERIMENT  Create a test
account  Experiment with varying parameters

LATENCY INJECTION IN AWS Latency spikes are something one should
consider early API Gateway’s hard limit of 29s for timeout

LATENCY INJECTION IN AWS CONT. Inject within the HTTP client
library Test for function’s timeout and that it can degrade accordingly in the request’s timeout Depends on other services also used

TWO SAMPLES OF APPLYING CHAOS  DynamoDB  Lambda 
Can be done anywhere in AWS, not only in those two

APPLYING CHAOS IN DYNAMODB

APPLYING CHAOS IN LAMBDA

CREATING A POST-INCIDENT STRATEGY  What happened?  How did
this affect your service?  How can you control the chaos better next time?  What metrics need to be reevaluated?  What are you already doing now to fix it?

CREATING A POST-INCIDENT STRATEGY CONT.  Create a weekly inspection
schedule for the chaos engineering flow  Focus on maintaining that every item is still working well  Discussions in an agile-scrum culture

A CASE STUDY  Building a resilient system for an
IoT network  Considering client constraints of low bandwidth in areas  How to best utilize the cloud platform to facilitate it

A CASE STUDY: WHAT WE DID  Selection of testing
within Lambda  Testing putting high pressure on bandwidth in locations  Schedule testing accordingly

THE NEED FOR AGILE IN CHAOS ENGINEERING  Agile is
all about iterative development  Chaos engineering needs fast reactions  Prioritizing disaster recovery

WHAT WE GOT OUT OF IT AGILITY SATISFACTION CULTURE

THANK YOU RENALDI GONDOSUBROTO

Building on System Resilience with Chaos Engine...

Building on System Resilience with Chaos Engineering for Serverless Applications on AWS

Renaldi Gondosubroto

More Decks by Renaldi Gondosubroto

Other Decks in Technology

Featured

Transcript