Applying Chaos Engineering to build resilient serverless applications

Applying Chaos Engineering to build resilient serverless applications Emrah Şamdan
(@emrahsamdan) 4/25/2019

Who am I? • Developer for 6+ years • Product
guy for 2 years • VP of Product for Thundra • Organizing committee • Serverlessdays İstanbul On October 11st!

Agenda • What’s chaos engineering? • Why chaos testing on
serverless? • Best practices on chaos testing for serverless • How to apply chaos testing on AWS Lambda • How to apply silence in a world of chaos

Why chaos engineering? Unit Tests • My function is running
properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!

Your third party API slows down so badly..

Some part of your system becomes unreachable.

Your cache/DB is down so you can’t load your data.

Chaos Engineering is the discipline of experimenting on a system
in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/

Chaos Engineering is • Like injecting vaccine to your system
to make it more immune • To improve your system’s resilience by uncovering weaknesses. • Identifying failures before they become outages. • To understand the steady state of your system and challenge it.

Chaos Engineering is not • Breaking down production for purpose.
• For blaming a group of people. • Surprising your colleagues with partial outages. • Taking down all the system at the same time.

History of chaos engineering? 2010 2011 2014 2019

Companies applying Chaos Engineering

States of chaos engineering • Define steady state • Hypothesis
on steady state of the system with the designed failure • Run your experiment ◦ Define blast radius ◦ Define halting condition ◦ Have a rollback plan! • Verify & Learn ◦ If your system breaks you understood an issue before it causes an outage. Go fix it! ◦ If it is resilient, congrats! Now, inject some other failure!

Don’t break on purpose! • Start experimenting with the ﬁrst
row, the leftmost cell: Known-knowns. • Blast radius: The effect will make the smallest effect. • Put a stop button somewhere! • Plan how you learn. • You don’t need to do it on production for the ﬁrst time. • The most important Let the other people know! Surprising chaos is not funny. No, at all!

Chaos examples • Your system keeps records on the DB.
• DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible.

Chaos examples • Your system keeps records on the DB.
• DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible. Result: People experiences timeouts while waiting for results.

You never fail!

Chaos when everything is more granular. SERVERLESS

More Granular Functions

Every service has its own failure mode Lots of managed
intermediate service which has its own bad-day characteristics. Different throttling, different retry mechanisms for different services.

Every function has its own conﬁguration • Timeouts • IAM
Roles

What would you do when your region is down?

Common weaknesses in serverless • Nested functions with improper timeouts

Common weaknesses in serverless • Unhandled errors from upstream services

Common weaknesses in serverless • Failures in resources

Chaos experiments in serverless • Inject latency to downstream services
• Inject failure to resources

Injecting latency • Don’t attack your system. • You don’t
need to do on prod ﬁrst. • There is no point to inject latency to async calls. Hypothesis: Entry point Lambda will degrade gracefully when the downstream Lambda times out or turns really late.

Where else to inject? Inject latency to resources, too.

How to inject latency

Injecting Latency to resources by Yan Cui

How to inject latency with Thundra

Injecting Error • Connection errors with third party services •
Cache down • AWS Resource is unreachable

What if we lose the connection to Redis?

Let’s inject error to Redis with Thundra

Common ﬁxes • Exponential backoff • Properly tunes timeouts •
Circuit breakers • Use async communication when possible

Don’t forget! Aim is • Not to break but to
improve • Not to blame people but to give them room to ﬁx • Not to surprise your colleagues but to make your system resilient

Thank you !

Applying Chaos Engineering to build resilient s...

Applying Chaos Engineering to build resilient serverless applications

More Decks by Emrah Samdan

Other Decks in Programming

Featured

Transcript