Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Development to Production: Many Uses of Serverless Observability

Emrah Samdan
September 09, 2019

From Development to Production: Many Uses of Serverless Observability

Observability is different than monitoring because it's a state for your software system rather than an action to keep it up. You can resolve the issues in your architecture only if your system is truly observable. Observability is particularly important in serverless because serverless embraces asynchronous communication of distributed tiny resources. In such a system, it's very hard to track the issues with a classical monitoring approach. In this presentation, we are showing how observability can be critical for serverless applications during different phases of software development lifecycle.

Emrah Samdan

September 09, 2019
Tweet

More Decks by Emrah Samdan

Other Decks in Programming

Transcript

  1. From Development to Production: Many Uses of Serverless Observability EMRAH

    SAMDAN | SEPTEMBER 9, 2019 Community Day 2019 Sponsors
  2. @emrahsamdan Who am I? • Developer for 6+ years •

    Product manager for 2 years • VP of Product for Thundra • Organizing committee • Serverlessdays İstanbul On October 3rd!
  3. @emrahsamdan Agenda • Let’s define serverless (yes once again!) •

    Is observability a buzzword or a real thing? • Observability challenges in serverless • Observability Driven Development • How/When to test serverless applications • What to check to monitor serverless stack • Troubleshooting serverless applications
  4. @emrahsamdan What’s serverless Serverless computing is a cloud-computing execution model

    in which the cloud provider runs the server, and dynamically manages the allocation of machine resources. Pricing is based on the actual amount of resources consumed by an application, rather than on pre-purchased units of capacity. Wikipedia: https://en.wikipedia.org/wiki/Serverless_computing
  5. @emrahsamdan What is serverless? Operational construct? Things that run perfectly

    and I don’t need to manage. Is Stripe, Auth0 serverless?
  6. @emrahsamdan What’s serverless? A doctrine, a thought model helping you

    deliver faster and put your focus on the value you provide to your customers. Ben Kehoe Paul Johnston
  7. I agree you all. But! All the ups can go

    down when you don’t pay attention what’s really happening with serverless.
  8. @emrahsamdan Serverless Observability • Serverless is full of hidden traps

    that can harm its promise. • Can be very costly. • Can perform really poor. • You need to check what’s going on.
  9. @emrahsamdan Are we ready for unknown unknowns? Known knowns Things

    that we understand and we are aware of. Follow the metric charts. Known unknowns Things that we are aware of but don’t understand at a glance. Yes, there is a peak over there. Let me dig the traces and logs. Unknown knowns Things that we can understand but we are not aware of. I would have fixed this if I had that that metric chart :( Unknown unknowns Things we neither understand nor aware of. That things are kaputt and I have no freaking idea with what I have.
  10. @emrahsamdan Observability Challenges in Serverless • No access to underlying

    infrastructure • You either take whatever cloud vendor provides or accept that there will be an overhead. • Overhead? ◦ Gathering intelligence (should be acceptable) ◦ Transporting it to where necessary (you only have invocation life time to take the data out) • Everything is event-driven and distributed.
  11. @emrahsamdan Observability-Driven Development • Recall that unknown-unknowns. • What do

    you need to when you have to troubleshoot a unknown-unknown? • Can an observability tool know your unknowns? • If you don’t know what to know what can you do? Known knowns Things that we understand and we are aware of. Follow the metric charts. Known unknowns Things that we are aware of but don’t understand at a glance. Yes, there is a peak over there. Let me dig the traces and logs. Unknown knowns Things that we can understand but we are not aware of. I would have fixed this if I had that that metric chart :( Unknown unknowns Things we neither understand nor aware of. That things are kaputt and I have no freaking idea with what I have.
  12. @emrahsamdan Observability-Driven Development • Not a replacement for test-driven development.

    • Think of the answers that you can give for any type of question. • If you are thinking about questions, request that feature from your tool. • Structured logging and manual instrumentation is the key. Retrieved from: https://dzone.com/articles/what-is-structured-logging
  13. @emrahsamdan Testing challenges in Serverless • Local testing is a

    pain. ◦ How to mock the cloud resources. Is it actually correct to mock them? ◦ How should you test the chain of many invocations? ◦ How to integrate it with CI/CD tools? • Integration testing with real resources is still the best effort but again how?
  14. @emrahsamdan Integration testing • Serverless != Functions • Test your

    business logic against the resources. • See how your messages being transformed in the flow. • Async events can cause problems that you can never guess.
  15. @emrahsamdan Integration Testing (Cons) • Still you’re dealing only with

    known-knowns. • Resources that are not pay-per-use. ◦ Setting up a test environment. Still? • Not with the production data.
  16. @emrahsamdan Chaos Testing Serverless Applications • Serverless fits the chaos

    engineering greatly because ◦ Distributed ◦ Lots of possibilities of failures in async environment ◦ Event-driven (So poisonous events) ◦ Roles and permissions are so granular that access can slip away.
  17. @emrahsamdan Chaos testing on serverless what? • What would happen

    if inner Lambda starts to respond slow? • Are you sure that you properly tuned timeouts? • Test with injecting latency.
  18. MONITORING How large should be my screen to see the

    charts for thousands of functions?
  19. @emrahsamdan Serverless is more than functions, so is monitoring. •

    Issues can stay local before you notice them. • It is slow. Why? ◦ API slowdown? ◦ Throttle on any resource? ◦ Bad coding practice? • Invocation counts go crazy. ◦ Seasonal peak? ◦ Successful product? ◦ Retry storms?
  20. @emrahsamdan Monitoring Latency • Abnormal latency is mostly not related

    with the function code. ◦ Idly waiting for a third party API. ◦ Throttled resource • Set aggregated alerts ◦ Alert on transaction duration ◦ Alert on function duration. ◦ Alert on operation duration
  21. @emrahsamdan Storm of retries and errors • When your code

    fails for some reason, your function will retry several times. ◦ Sync events: You should control it. ◦ Async events: Different retry mechanisms. ◦ Stream based events: Risk of losing data. • Does this solve? • Check ◦ Iterator age ◦ Number of retries ◦ Number of errors ◦ Memory usage ◦ Cold starts
  22. @emrahsamdan Challenges of Troubleshooting How to trace the distributed async

    events with non-aggregated traces, metrics and logs? How to trace requests to external resources? How to trace the async distributed events?
  23. @emrahsamdan Distributed Tracing • Trace the distributed transactions: chain of

    multiple invocations • Understand what is wrong with a glance • But?! What if the I have a bad coding practice in the code?
  24. @emrahsamdan Local Tracing • Instrument the code itself and check

    against code quality. • Good for discovering ◦ Bad coding practices ◦ Value of local variables in the code. ◦ Debugging the code without breakpoints.
  25. @emrahsamdan Actionable Alerts in Serverless • Alert on code errors

    ◦ Stacktrace ◦ Code line it caused ◦ Values of (Local variables) • Alert on latencies and timeout errors ◦ Slow API communications ◦ Slow DB interaction for bad queries
  26. @emrahsamdan How to respond to the issues on serverless •

    Issue may not be your code. • Check the third parties. • Check other metrics ◦ Iterator age of streams ◦ Throttles on resources • Have some runbooks ◦ Exponential backoffs to APIs, Alternative APIs ◦ Healthy on-call structures
  27. @emrahsamdan Key Takeaways • Serverless observability is not an after-production

    issue. • Observability with all three pillars aggregated is life-saving. • Automation is king! But, get yourself ready to ask questions with ODD. • Sadly, no testing scenario is sufficient in serverless. Step into chaos engineering before your engineering run into chaos! • Change the way you monitor your system. Look beyond functions and discover local bottlenecks with an architectural view! • Serverless transaction= A chain of invocations commuting between resources and APIs. Full tracing required! • Make your alerts actionable and start keeping runbooks for the issues that you can predict.