Applying Chaos Engineering to build resilient serverless applications

@emrahsamdan Applying Chaos Engineering to build resilient serverless applications Emrah
Şamdan (@emrahsamdan) 10/8/2019

Who am I? • Developer for 6+ years • Product
guy for 2 years • VP of Product for Thundra • Organizing committee

@emrahsamdan Agenda • What’s chaos engineering? • Why chaos testing
on serverless? • Best practices on chaos testing for serverless • How to apply chaos testing on AWS Lambda • How to apply silence in a world of chaos

@emrahsamdan Why chaos engineering? Unit Tests • My function is
running properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!

@emrahsamdan

@emrahsamdan Your third party API slows down so badly..

@emrahsamdan Some part of your system becomes unreachable.

@emrahsamdan Your cache/DB is down so you can’t load your
data.

@emrahsamdan Chaos Engineering is the discipline of experimenting on a
system in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/

@emrahsamdan Chaos Engineering is Vaccine to software

@emrahsamdan Chaos Engineering is Vaccine to software For resiliency

@emrahsamdan Chaos Engineering is Vaccine to software For resiliency To
prevent outages

@emrahsamdan Chaos Engineering is Vaccine to software For resiliency To
prevent outages To deﬁne steady state

@emrahsamdan Chaos Engineering is not For breaking down

@emrahsamdan Chaos Engineering is not For breaking down For bad
surprises

surprises For blaming

surprises For blaming For causing outages

@emrahsamdan

@emrahsamdan History of chaos engineering? 2010 2011 2014 2019

@emrahsamdan Companies applying Chaos Engineering

@emrahsamdan States of chaos engineering • Define steady state •
Hypothesis on steady state of the system with the designed failure • Run your experiment ◦ Define blast radius ◦ Define halting condition ◦ Have a rollback plan! • Verify & Learn ◦ If your system breaks you understood an issue before it causes an outage. Go fix it! ◦ If it is resilient, congrats! Now, inject some other failure!

@emrahsamdan Don’t break on purpose! • Start experimenting with the
ﬁrst row, the leftmost cell: Known-knowns. • Blast radius: The effect will make the smallest effect. • Put a stop button somewhere! • Plan how you learn. • You don’t need to do it on production for the ﬁrst time. • The most important Let the other people know! Surprising chaos is not funny. No, at all!

@emrahsamdan Chaos examples • Your system keeps records on the
DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible.

@emrahsamdan Chaos examples • Your system keeps records on the
DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible. Result: People experiences timeouts while waiting for results.

@emrahsamdan

@emrahsamdan You never fail!

@emrahsamdan Chaos when everything is more granular. SERVERLESS

@emrahsamdan More Granular Functions

@emrahsamdan Every service has its own failure mode Lots of
managed intermediate service which has its own bad-day characteristics. Different throttling, different retry mechanisms for different services.

@emrahsamdan Every function has its own conﬁguration • Timeouts •
IAM Roles

@emrahsamdan

@emrahsamdan Common weaknesses in serverless • Nested functions with improper
timeouts

@emrahsamdan Common weaknesses in serverless • Unhandled errors from upstream
services

@emrahsamdan Common weaknesses in serverless • Failures in resources

@emrahsamdan Chaos experiments in serverless • Inject latency to downstream
services • Inject failure to resources

@emrahsamdan Injecting latency • Don’t attack your system. • You
don’t need to do on prod ﬁrst. • There is no point to inject latency to async calls. Hypothesis: Entry point Lambda will degrade gracefully when the downstream Lambda times out or turns really late.

@emrahsamdan Where else to inject? Inject latency to resources, too.

@emrahsamdan How to inject latency

@emrahsamdan Injecting Latency to resources by Yan Cui

@emrahsamdan How to inject latency with Thundra

@emrahsamdan Injecting Error • Connection errors with third party services
• Cache down • AWS Resource is unreachable

@emrahsamdan What if we lose the connection to Redis?

@emrahsamdan Let’s inject error to Redis with Thundra

@emrahsamdan Common ﬁxes • Exponential backoff • Properly tunes timeouts
• Circuit breakers • Use async communication when possible

@emrahsamdan Don’t forget! Aim is • Not to break but
to improve • Not to blame people but to give them room to ﬁx • Not to surprise your colleagues but to make your system resilient

@emrahsamdan Thank you !

Applying Chaos Engineering to build resilient s...

Applying Chaos Engineering to build resilient serverless applications

More Decks by Emrah Samdan

Other Decks in Programming

Featured

Transcript