Applying Chaos Engineering to build resilient serverless applications

@emrahsamdan Applying Chaos Engineering to build resilient serverless applications Emrah
Şamdan (@emrahsamdan) 11/6/2019

Who am I? • Developer for 6+ years • Product
guy for 2 years • VP of Product for Thundra • Organizing committee • Father of a chaos monkey

@emrahsamdan Why chaos engineering? Unit Tests • My function is
running properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!

@emrahsamdan

@emrahsamdan Your third party API slows down so badly..

@emrahsamdan Some part of your system becomes unreachable.

@emrahsamdan Your cache/DB is down so you can’t load your
data.

@emrahsamdan Chaos Engineering is the discipline of experimenting on a
system in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/

@emrahsamdan Breaking things on purpose in production

@emrahsamdan Breaking things on purpose in production To make them
more resilient

@emrahsamdan Breaking things on purpose in production To make them
more resilient Well, maybe in staging?

@emrahsamdan Chaos Engineering is Vaccine to software

@emrahsamdan Chaos Engineering is Vaccine to software For resiliency

@emrahsamdan Chaos Engineering is Vaccine to software For resiliency To
prevent outages

@emrahsamdan Chaos Engineering is Vaccine to software For resiliency To
prevent outages To deﬁne steady state

@emrahsamdan Chaos Engineering is not For breaking down

@emrahsamdan Chaos Engineering is not For breaking down For bad
surprises

surprises For blaming

surprises For blaming For causing outages

@emrahsamdan

@emrahsamdan History of chaos engineering? 2010 2011 2014 2019

@emrahsamdan Companies applying Chaos Engineering

@emrahsamdan Ups and downs of the system. If I take
this server down, maybe everything will still run smooth. Maybe? Let me attack on my system! Cute! Let me break something else. Oh! I should ﬁx this before it actually happens and then break something else.

@emrahsamdan Chaos experiments will(should) never end!

@emrahsamdan Don’t break on purpose! • Start experimenting with the
ﬁrst row, the leftmost cell: Known-knowns. • Blast radius: The effect will make the smallest effect. • Put a stop button somewhere! • Plan how you learn. • You don’t need to do it on production for the ﬁrst time. • The most important Let the other people know! Surprising chaos is not funny. No, at all!

@emrahsamdan Chaos examples • Your system keeps records on the
DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible.

@emrahsamdan Chaos examples • Your system keeps records on the
DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible. Result: People experiences timeouts while waiting for results.

@emrahsamdan

@emrahsamdan You never fail!

@emrahsamdan Chaos when everything is more granular. SERVERLESS

@emrahsamdan More Granular Functions

@emrahsamdan Every service has its own failure mode Lots of
managed intermediate service which has its own bad-day characteristics. Different throttling, different retry mechanisms for different services.

@emrahsamdan Every function has its own conﬁguration • Timeouts •
IAM Roles

@emrahsamdan

@emrahsamdan Common weaknesses in serverless • Nested functions with improper
timeouts

@emrahsamdan Common weaknesses in serverless • Unhandled errors from upstream
services

@emrahsamdan Common weaknesses in serverless • Failures in resources

@emrahsamdan Chaos experiments in serverless • Inject latency to downstream
services • Inject failure to resources

@emrahsamdan Injecting latency • Don’t attack your system. • You
don’t need to do on prod ﬁrst. • There is no point to inject latency to async calls. Hypothesis: Entry point Lambda will degrade gracefully when the downstream Lambda times out or turns really late.

@emrahsamdan Where else to inject? Inject latency to resources, too.

@emrahsamdan How to inject latency

@emrahsamdan How to inject latency with Thundra

@emrahsamdan Injecting Error • Connection errors with third party services
• Cache down • AWS Resource is unreachable

@emrahsamdan What if we lose the connection to Redis?

@emrahsamdan Let’s inject error to Redis with Thundra

@emrahsamdan Common ﬁxes • Exponential backoff • Properly tunes timeouts
• Circuit breakers • Use async communication when possible

@emrahsamdan Don’t forget! Aim is • Not to break but
to improve • Not to blame people but to give them room to ﬁx • Not to surprise your colleagues but to make your system resilient

@emrahsamdan Thank you!

Applying Chaos Engineering to build resilient s...

Applying Chaos Engineering to build resilient serverless applications

More Decks by Emrah Samdan

Other Decks in Programming

Featured

Transcript