Agenda
What is chaos engineering?
Motivations behind chaos engineering
Running chaos experiments
Challenges with serverless
Serverless chaos experiments
Slide 4
Slide 4 text
About me
Evangelist and cofounder Opsio
Background in development,
operations, and management
Organizer of user groups and
conferences
Advocate for serverless and chaos
engineering
Father of three
Motivations behind chaos engineering
Are your customers getting the experience they should?
Is downtime or issues costing you money?
Are you confident in your monitoring and alerting?
Is your organization ready to handle outages?
Are you learning from incidents?
Step 1: Define steady state
The normal behavior
of a system over
time
Business metrics
are usually more
useful
Steady state is
not necessarily
continuous
System metrics
and business
metrics
Slide 22
Slide 22 text
Step 2: Form your hypothesis
Use what ifs
to find it
Scientific
”If… then…”
method
Always fix known
problems first
Chaos can be
injected at any
layer of the stack
Slide 23
Slide 23 text
Step 3: Plan and run your experiment
Whiteboard the
experiment in
detail
Notify the
organization
Have a “stop”
button ready
Contain the
blast radius
Slide 24
Slide 24 text
Step 4: Measure and learn
Use metrics to
prove or disprove
the hypothesis
Did anything
unexpected
happen?
Share your
progress and
success
Was the system
resilient to the
injected failure?
Slide 25
Slide 25 text
Step 5: Scale up or abort and fix
Use the learnings
to improve
Increased scope
can reveal new
effects
With confidence
you can scale up
Challenges with serverless
“Serverless allows you to build and run
applications and services without thinking
about servers”
Slide 28
Slide 28 text
Challenges with serverless
“without thinking about servers”
Slide 29
Slide 29 text
Challenges with serverless
No servers to manage
Less heavy lifting
Lots of services to choose from
Per function and service configuration
More granular architectures
Serverless chaos experiments
Inject errors into your code
Remove downstream services
Alter the concurrency of functions
Restrict the capacity of tables
Client
Amazon Simple
Storage Service
(Amazon S3)
Amazon API
Gateway
AWS Lambda
Amazon DynamoDB
AWS Lambda
Slide 34
Slide 34 text
Serverless chaos experiments
Security policy errors
CORS configuration errors
Service configuration errors
Function disk space failure
Client
Amazon S3
Amazon API
Gateway
AWS Lambda
Amazon DynamoDB
AWS Lambda
Slide 35
Slide 35 text
Serverless chaos experiments
Add latency to your functions
• Cold starts
• Cloud provider issues
• Runtime or code issues
• Integration issues
• Timeouts
Client
Amazon S3
Amazon API
Gateway
AWS Lambda
Amazon DynamoDB
AWS Lambda
Client
Amazon S3
Amazon API
Gateway
AWS Lambda Amazon DynamoDB
AWS Lambda
AWS Lambda
Serverless chaos demo
What if my function takes an extra
300 ms for each invocation?
What if my function returns an error
code?
What if there is an exception in the
code?
Hypothesis: If we inject failure to
functions then my application will
use graceful degradation.
Summary
Everything fails, all the time
Serverless doesn’t make your application resilient
Chaos engineering helps us find weaknesses and fix them
Chaos engineering is about building confidence
It’s not rocket science; you can do it!
Slide 42
Slide 42 text
Do you want more?
Follow @gunnargrosch and @serverlesschaos on Twitter
Try the Serverless Chaos Demo app:
https://demo.serverlesschaos.com
YouTube videos and repositories:
https://grosch.se
Join the chaos engineering slack:
http://bit.ly/chaos-eng-slack
Visit chaos engineering meetups