Tracing Production Services at Stripe

Tracing Production Services at Stripe

If a microservice falls down in the middle of a server farm, does my pager make a sound?

If your service is automatically monitored, then the answer is “yes!”. But now that you’ve been paged and roused from your slumber… what happens next? Do you stumble to your computer, bleary-eyed, trying to find the elusive problem by cross-referencing dashboards and server logs across eleven different browser tabs? Or do you have better tools that you can use?

Fortunately, there’s a quick and easy way to get high-resolution, request-level traces for inspecting your services. At Stripe, we built a custom-built, open-source distributed tracing and monitoring pipeline that allows us to inspect each step of an HTTP request and diagnose the root causes of errors, no matter how obscure they may be. And with a monitoring pipeline that unifies metrics, logs, and traces, you can live the observability dream: the right data, in the right form, right when you need it.

94dcff33cbdf74b5d785369ac54bc1a8?s=128

Aditya Mukerjee

May 22, 2017
Tweet

Transcript

  1. Tracing Production Services at Stripe Aditya Mukerjee Systems Engineer at

    Stripe @chimeracoder
  2. Tracing is about more than HTTP requests @chimeracoder

  3. None
  4. https://veneur.org

  5. @chimeracoder

  6. It’s 3:07 AM @chimeracoder

  7. Dashboard Count: 1 @chimeracoder

  8. Dashboard Count: 2 @chimeracoder

  9. Dashboard Count: 3 @chimeracoder

  10. Dashboard Count: 4 @chimeracoder

  11. @chimeracoder

  12. If you need to look at logs, there’s a gap

    in your observability tools @chimeracoder
  13. Dashboard Count: 5 @chimeracoder

  14. Metrics/dashboards? Logs? Request traces? No context! Hard to aggregate! Require

    planning! @chimeracoder
  15. Monitoring information is only as good as developers’ ability to

    predict the future @chimeracoder
  16. @chimeracoder

  17. @chimeracoder

  18. @chimeracoder

  19. @chimeracoder Application

  20. What’s the difference? •If you squint, it’s hard to tell

    them apart •A log is a metric with “longer” information •A trace is a metric that allows “inner joins” @chimeracoder
  21. What if we could have all three, all the time?

    @chimeracoder
  22. Standard Sensor Format @chimeracoder

  23. @chimeracoder

  24. @chimeracoder

  25. @chimeracoder Application

  26. Tradeoffs: Stacking the Deck @chimeracoder

  27. Distributed Collection @chimeracoder host1 host2 host3 Dashboard Tool

  28. Aggregation @chimeracoder host1 host2 host3 Global Aggregator Dashboard Tool

  29. Distributed Aggregation @chimeracoder host1 host2 host3 Dashboard Tool

  30. Stacking the Deck Histogram: t-digests @chimeracoder

  31. Let’s build the world we want to see @chimeracoder

  32. It’s 3:07 AM @chimeracoder

  33. @chimeracoder

  34. Veneur in 2017 •High availability •Host-local metrics •Global aggregate metrics

    •Probabilistic data structures •… and more! Veneur in 2018 •Automatic cardinality detection •Cross-dashboard integration •Unified client instrumentation •… help us decide the rest! @chimeracoder
  35. Thank you! https://veneur.org #veneur on Freenode Aditya Mukerjee @chimeracoder