@gunnargrosch
Performing chaos in a
serverless world
Gunnar Grosch
ServerlessDays Milano 2019
Slide 2
Slide 2 text
ServerlessDays Milano 2019
@gunnargrosch
Chaos Engineering has been battle-tested for years
using traditional infrastructure and containerized
microservices, but how does it work with serverless
functions and managed services?
Slide 3
Slide 3 text
ServerlessDays Milano 2019
@gunnargrosch
What we’ll cover
What is Chaos
Engineering?
Running chaos
experiments
Challenges when using
Chaos Engineering for
serverless
Serverless chaos
experiments
Slide 4
Slide 4 text
ServerlessDays Milano 2019
@gunnargrosch
A resilient system is a highly available and durable
system that can maintain an acceptable level of
service in the face of failure.
Slide 5
Slide 5 text
ServerlessDays Milano 2019
@gunnargrosch
About me
Evangelist and co-founder at Opsio Background in development and operations
Organizer of AWS User Groups and Serverless
Meetups
ServerlessDays Stockholm and AWS
Community Day Nordics organizer
Father of three chaos monkeys
Slide 6
Slide 6 text
ServerlessDays Milano 2019
@gunnargrosch
What is Chaos Engineering?
Slide 7
Slide 7 text
ServerlessDays Milano 2019
@gunnargrosch
Chaos Engineering is the discipline of
experimenting on a system in order to build
confidence in the system’s capability to
withstand turbulent conditions in production.
principlesofchaos.org
Slide 8
Slide 8 text
ServerlessDays Milano 2019
@gunnargrosch
Chaos Engineering is not about
breaking things
Slide 9
Slide 9 text
ServerlessDays Milano 2019
@gunnargrosch
Chaos Engineering is about performing
controlled experiments to inject failures
Slide 10
Slide 10 text
ServerlessDays Milano 2019
@gunnargrosch
Chaos Engineering is about finding the
weaknesses in a system and fixing them
before they break
Slide 11
Slide 11 text
ServerlessDays Milano 2019
@gunnargrosch
Chaos Engineering is about building
confidence in your system and in your
organization
Slide 12
Slide 12 text
ServerlessDays Milano 2019
@gunnargrosch
Source: HDMI
No Signal
To display Help, press the ? button
Slide 13
Slide 13 text
ServerlessDays Milano 2019
@gunnargrosch
“Everything fails, all the time!”
Werner Vogels, CTO Amazon
Source: HDMI
No Signal
To display Help, press the ? button
Slide 14
Slide 14 text
ServerlessDays Milano 2019
@gunnargrosch
Don’t ask what happens if a system fails, but
ask what happens when it fails.
Slide 15
Slide 15 text
ServerlessDays Milano 2019
@gunnargrosch
Running chaos experiments
Slide 16
Slide 16 text
ServerlessDays Milano 2019
@gunnargrosch
Why run experiments?
Are your customers
getting the experience
they should?
Is downtime or issues
costing you money?
Are you confident in
your monitoring and
alerting?
Is your organization
ready to handle
outages?
Slide 17
Slide 17 text
ServerlessDays Milano 2019
@gunnargrosch
Step 1: Define steady state
The normal behavior of
a system over time
System metrics and
business metrics
Steady state is not
necessarily continuous
Business metrics are
usually more useful
Slide 18
Slide 18 text
ServerlessDays Milano 2019
@gunnargrosch
Step 2: Form your hypothesis
Chaos can be injected at
any layer in the stack
Use what if:s Always fix known
problems first!
Slide 19
Slide 19 text
ServerlessDays Milano 2019
@gunnargrosch
Step 3: Plan and run your experiment
Whiteboard the
experiment in detail
Contain the blast radius Notify the organization Make sure to have a
”stop” button
Slide 20
Slide 20 text
ServerlessDays Milano 2019
@gunnargrosch
Step 4: Measure and learn
Use metrics to prove or
disprove the hypothesis
Was the system resilient
to the injected failure?
Did anything
unexpected happen?
Share your progress and
success!
Slide 21
Slide 21 text
ServerlessDays Milano 2019
@gunnargrosch
Step 5: Scale up or abort and fix
With confidence you
can scale-up
Increased scope can
reveal new effects
Slide 22
Slide 22 text
ServerlessDays Milano 2019
@gunnargrosch
When do we get to the serverless part?
Slide 23
Slide 23 text
ServerlessDays Milano 2019
@gunnargrosch
Serverless means new challenges
No servers to manage Less heavy lifting
Lots of services Per function configuration
More granular architectures
Slide 24
Slide 24 text
ServerlessDays Milano 2019
@gunnargrosch
Common serverless weaknesses
Missing error handling Wrong timeout values Missing fallback Missing regional failover
Slide 25
Slide 25 text
ServerlessDays Milano 2019
@gunnargrosch
Serverless chaos experiments
Slide 26
Slide 26 text
ServerlessDays Milano 2019
@gunnargrosch
Serverless Chaos Demo app
Slide 27
Slide 27 text
ServerlessDays Milano 2019
@gunnargrosch
Error injection
• Inject errors in your code
• One in X requests throws an error
• Turn on and off using parameter or variable
• Alter the concurrency of your functions
• Restrict the capacity of your DynamoDB table
• Add configuration errors
• Security policies
• CORS configuration
Slide 28
Slide 28 text
ServerlessDays Milano 2019
@gunnargrosch
Latency injection
• Add latency to your functions
• Cold starts
• Cloud provider issues
• Runtime or code issues
• Integration issues
• Timeouts
• Yan Cui wrote an article and published
sample code.
• Adrian Hornsby built a Lambda Layer
around these ideas.
Slide 29
Slide 29 text
ServerlessDays Milano 2019
@gunnargrosch
Latency injection
• What if my functions take X ms extra for
each invocation?
• What if timeouts occur?
• Hypothesis: My app can handle that
latency is injected on a function level.
• Let’s do it!
Slide 30
Slide 30 text
ServerlessDays Milano 2019
@gunnargrosch
Sample tools
Gremlin
gremlin.com
Chaos Toolkit
chaostoolkit.org
Thundra
thundra.io
Build Your Own
127.0.0.1
Slide 31
Slide 31 text
ServerlessDays Milano 2019
@gunnargrosch
Summary
• Chaos Engineering is not about breaking things
• Chaos Engineering is about building confidence in your system and your organization.
• Serverless introduces new challenges for Chaos Engineering.
• You can do it!
Slide 32
Slide 32 text
ServerlessDays Milano 2019
@gunnargrosch
Do you want more?
• Follow @serverlesschaos on Twitter
• Chaos Engineering Slack Community:
bit.ly/chaos-eng-slack
• Chaos Engineering Google Group:
https://groups.google.com/forum/#!forum/chaos-community
• List of awesome Chaos Engineering resources:
https://github.com/dastergon/awesome-chaos-engineering/
• Yan Cui’s latency injection demo:
https://github.com/theburningmonk/lambda-latency-injection-demo
• Adrian Hornsby’s latency injection layer:
https://github.com/adhorn/LatencyInjectionLayer
Slide 33
Slide 33 text
ServerlessDays Milano 2019
@gunnargrosch
Inspiration