From Development to Production: Many Uses of Serverless Observability

From Development to Production: Many Uses of Serverless Observability Emrah
Şamdan 10/17/2019

@emrahsamdan Who am I? • Developer for 6+ years •
Product manager for 2 years • VP of Product for Thundra • Organizing committee • ServerlessDays İstanbul organizer.

@emrahsamdan

Witam w mojej prezentacji na temat bezobsługowego monitorowania!

@emrahsamdan Agenda • Serverless as a shared responsibility model. •
Is observability a buzzword or a real thing? • Observability challenges in serverless • Observability Driven Development • How/When to test serverless applications • What to check to monitor serverless stack • Troubleshooting serverless applications

@emrahsamdan What’s serverless Serverless computing is a cloud-computing execution model
in which the cloud provider runs the server, and dynamically manages the allocation of machine resources. Pricing is based on the actual amount of resources consumed by an application, rather than on pre-purchased units of capacity. Wikipedia: https://en.wikipedia.org/wiki/Serverless_computing

@emrahsamdan What’s serverless? A doctrine, a thought model helping you
deliver faster and put your focus on the value you provide to your customers.

It is serverless is not because there are no servers
but because you think about servers less. SERVERLESS IS MORE!

WAIT?!

Shared Responsibility Model Cloud vendor will handle scalability and reliability.
But performance and security IS still ON US.

@emrahsamdan Serverless Observability • Serverless is full of hidden traps
that can harm its promise. • Can be very costly. • Can perform really poor. • You need to check what’s going on.

@emrahsamdan What’s observability? https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

@emrahsamdan Are we ready for unknown unknowns? Known knowns Things
that we understand and we are aware of. Follow the metric charts. Known unknowns Things that we are aware of but don’t understand at a glance. Yes, there is a peak over there. Let me dig the traces and logs. Unknown knowns Things that we can understand but we are not aware of. I would have fixed this if I had that that metric chart :( Unknown unknowns Things we neither understand nor aware of. That things are broken and I have no freaking idea with what I have.

@emrahsamdan The pillars Traces Metrics Logs Visualization Machine Learning and
Insights Alerts

@emrahsamdan Observability Challenges in Serverless • No access to underlying
infrastructure • You either take whatever cloud vendor provides or accept that there will be an overhead. • Overhead? ◦ Gathering intelligence (should be acceptable) ◦ Transporting it to where necessary (you only have invocation life time to take the data out) • Everything is event-driven and distributed.

Fix the Charizard. If you can!

@emrahsamdan Observability-Driven Development

@emrahsamdan Observability-Driven Development • Recall that unknown-unknowns. • What do
you need to when you have to troubleshoot a unknown-unknown? • Can an observability tool know your unknowns? • If you don’t know what to know what can you do? Known knowns Things that we understand and we are aware of. Follow the metric charts. Known unknowns Things that we are aware of but don’t understand at a glance. Yes, there is a peak over there. Let me dig the traces and logs. Unknown knowns Things that we can understand but we are not aware of. I would have fixed this if I had that that metric chart :( Unknown unknowns Things we neither understand nor aware of. That things are kaputt and I have no freaking idea with what I have.

ASK Your observability tool should give you auto-replies. But it
should also let you ask wise questions.

@emrahsamdan Observability-Driven Development • Not a replacement for test-driven development.
• Think of the answers that you can give for any type of question. • If you are thinking about questions, request that feature from your tool. • Structured logging and manual instrumentation is the key. Retrieved from: https://dzone.com/articles/what-is-structured-logging

@emrahsamdan Observability-Driven Development (Cons) • Observability coverage? • Hard to
accustom. • You can’t sample a thing.

TESTING

@emrahsamdan Testing challenges in Serverless • Local testing is a
pain. ◦ How to mock the cloud resources. Is it actually correct to mock them? ◦ How should you test the chain of many invocations? ◦ How to integrate it with CI/CD tools? • Integration testing with real resources is still the best effort but again how?

@emrahsamdan Integration testing • Serverless != Functions • Test your
business logic against the resources. • See how your messages being transformed in the ﬂow. • Async events can cause problems that you can never guess.

@emrahsamdan Integration Testing (Cons) • Still you’re dealing only with
known-knowns. • Resources that are not pay-per-use. ◦ Setting up a test environment. Still? • Not with the production data.

@emrahsamdan Chaos Testing Serverless Applications • Serverless ﬁts the chaos
engineering greatly because ◦ Distributed ◦ Lots of possibilities of failures in async environment ◦ Event-driven (So poisonous events) ◦ Roles and permissions are so granular that access can slip away.

@emrahsamdan Chaos testing on serverless what? • What would happen
if inner Lambda starts to respond slow? • Are you sure that you properly tuned timeouts? • Test with injecting latency.

@emrahsamdan Chaos testing on serverless what? • What if we
lose the connection to Redis?

MONITORING How large should be my screen to see the
charts for thousands of functions?

@emrahsamdan How to discover an anomaly in serverless?

@emrahsamdan

@emrahsamdan Serverless is more than functions, so is monitoring. •
Issues can stay local before you notice them. • It is slow. Why? ◦ API slowdown? ◦ Throttle on any resource? ◦ Bad coding practice? • Invocation counts go crazy. ◦ Seasonal peak? ◦ Successful product? ◦ Retry storms?

@emrahsamdan

@emrahsamdan Monitoring Latency • Abnormal latency is mostly not related
with the function code. ◦ Idly waiting for a third party API. ◦ Throttled resource • Set aggregated alerts ◦ Alert on transaction duration ◦ Alert on function duration. ◦ Alert on operation duration

@emrahsamdan Storm of retries and errors • When your code
fails for some reason, your function will retry several times. ◦ Sync events: You should control it. ◦ Async events: Different retry mechanisms. ◦ Stream based events: Risk of losing data. • Does this solve? • Check ◦ Iterator age ◦ Number of retries ◦ Number of errors ◦ Memory usage ◦ Cold starts

TROUBLESHOOTING Bad things happen in serverless, too. Now, it’s time
to battle!

@emrahsamdan Failure modes of serverless • https://github.com/adhorn/aws-lambda-chaos-injection • https://github.com/gunnargrosch/

@emrahsamdan Failure modes of serverless Bad-tuned memory Timeout Error in
code or in a managed resource

@emrahsamdan Challenges of Troubleshooting How to trace the distributed async
events with non-aggregated traces, metrics and logs? How to trace requests to external resources? How to trace the async distributed events?

@emrahsamdan Distributed Tracing • Trace the distributed transactions: chain of
multiple invocations • Understand what is wrong with a glance • But?! What if the I have a bad coding practice in the code?

@emrahsamdan Local Tracing • Instrument the code itself and check
against code quality. • Good for discovering ◦ Bad coding practices ◦ Value of local variables in the code. ◦ Debugging the code without breakpoints.

@emrahsamdan Actionable Alerts in Serverless • Alert on code errors
◦ Stacktrace ◦ Code line it caused ◦ Values of (Local variables) • Alert on latencies and timeout errors ◦ Slow API communications ◦ Slow DB interaction for bad queries

@emrahsamdan How to respond to the issues on serverless •
Issue may not be your code. • Check the third parties. • Check other metrics ◦ Iterator age of streams ◦ Throttles on resources • Have some runbooks ◦ Exponential backoffs to APIs, Alternative APIs ◦ Healthy on-call structures

@emrahsamdan Key Takeaways • Serverless observability is not an after-production
issue. • Observability with all three pillars aggregated is life-saving. • Automation is king! But, get yourself ready to ask questions with ODD. • Sadly, no testing scenario is sufﬁcient in serverless. Step into chaos engineering before your engineering run into chaos! • Change the way you monitor your system. Look beyond functions and discover local bottlenecks with an architectural view! • Serverless transaction= A chain of invocations commuting between resources and APIs. Full tracing required! • Make your alerts actionable and start keeping runbooks for the issues that you can predict.

Thank you! Dziękuję!

From Development to Production: Many Uses of Se...

From Development to Production: Many Uses of Serverless Observability

More Decks by Emrah Samdan

Other Decks in Programming

Featured

Transcript