Pro Yearly is on sale from $80 to $50! »

1,001 postmortems: lessons learned from running complex systems at scale

1,001 postmortems: lessons learned from running complex systems at scale

6bcba0c09e7fdeed29218918248fec2f?s=128

Alexis Lê-Quôc

September 04, 2018
Tweet

Transcript

  1. 1,001 postmortems: Lessons learned from running complex systems at scale

    Alexis Lê-Quôc @alq
  2. Why this talk?

  3. Learn from past mistakes 1 2 3 Feel kinship of

    spirit in the face of system failure Share what you learned with others
  4. Metrics Traces Logs Datadog

  5. “Complex systems at scale” Lots of data, all day long,

    every day under tight processing deadlines.
  6. None
  7. – Complex systems almost always run in stable yet degraded

    mode. – Each incident is recording a noteworthy deviation from what we expect. – Luckily only a fraction of incidents has impacted customers. Not literally 1,001 incidents; but enough in 6 years
  8. 1,001 opportunities to learn Most incidents deserve a post mortem.

    At that rate, we got good at post mortems. Watch @bisect talk about our process https://dtdg.co/monitorama-eu-bisect
  9. Very granular • Detailed narrative • Detailed “Why” • Detailed

    fixes to implement Very high-level • Aggregates • Trends Two perspectives
  10. Two perspectives

  11. Missing something until... Solo effort to formalize anti-patterns into a

    living document. Invaluable tool for new and old hands alike.
  12. Talking about failure

  13. Talking about failure is hard

  14. Facing and acknowledging failure in front of (friendly) others is

    hard. A learned skill in a safe environment. An altruistic effort in service of collective learning. Internal post mortems are hard
  15. Talking publicly about failure is harder

  16. – Serves both as a heartfelt apology and shows the

    path forward. – Captures the essence of an incident but... – Can provide too much or too little detail. Public post mortems are harder and imho less satisfactory
  17. Distill failures to their finest essence

  18. – Recurring themes in the “why” section. – Simplify –

    Just enough context to make sense Forget the specifics Focus on patterns.
  19. An anthology of anti-patterns

  20. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  21. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  22. We use git-based config and assumed that git-merge meant configuration

    online. Incident happened at next reload, 3 weeks later. Expose configuration identifier or hash via runtime. Assert that what runs is what is intended. “Config is picked up”
  23. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  24. Load balancer had a timeout of 10s but downstream had

    60s timeouts. Expensive requests end up dominating runtime and eat up all resources. If upstream service A has a timeout of T, downstream services should have timeouts (shorter than T). “Timeouts”
  25. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  26. “But it ran without issues for 3 days” after an

    updated service crashed. We are wired to think that past results are an indicator of future performance. Explicit sanity checks, manual or automated mean victory. “Uptime means victory”
  27. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  28. Buggy agent queuing used DST-sensitive timestamps meaning no data transmitted

    for 1h when we lost one hour. Daylight savings, leap years and time zones are surprisingly difficult to thoroughly test. Default to UTC everywhere but the UI. “time”
  29. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  30. Aggregates and percentiles mask the reality of user experience. That

    0.01% slowest requests could be the most important requests. Look at the distribution as a whole but pay special attention to the tail. Check the worst requests individually. “tail of the distribution”
  31. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  32. There is a crash loop in the code or a

    bad configuration deployed. Do we debate what’s the fastest rollback/fwd strategy is or do we just do it? Pick a rollback/fwd strategy (e.g. blue/green), invest heavily in it and make it really good. “Expensive roll-back/fwd”
  33. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  34. “65,536 connections sounded reasonable” after a load balancer brownout due

    to a spike and back-pressure. Or, “we set the limit to 1,024 handles 4 years ago”. Set reasonable soft limits and ideally no (or really high) hard limits. Let the system hit actual resource limits (cpu, mem, net, i/o). Or rigorously model and empirically validate the system to set limits. “arbitrary resource limits”
  35. – Configuration – Dependencies – Deployment – Development – Observability

    – Operations – Performance – Routing Categories
  36. 3 retries with a 5s timeout on loss-induced connection issues

    meant 2-3x the load on downstream services until connectivity was fully restored. Use adaptive retries (e.g. CoDel). “Retrying too often”
  37. Meta

  38. Breathe 1 2 3 Stay positive Write it down

  39. Learn from past mistakes 1 2 3 Feel kinship of

    spirit in the face of system failure Share what you learned with others
  40. Thank you We are hiring (duh!) in Paris and New

    York