$30 off During Our Annual Pro Sale. View Details »

Tracing Production Services at Stripe

Tracing Production Services at Stripe

If a microservice falls down in the middle of a server farm, does my pager make a sound?

If your service is automatically monitored, then the answer is “yes!”. But now that you’ve been paged and roused from your slumber… what happens next? Do you stumble to your computer, bleary-eyed, trying to find the elusive problem by cross-referencing dashboards and server logs across eleven different browser tabs? Or do you have better tools that you can use?

Fortunately, there’s a quick and easy way to get high-resolution, request-level traces for inspecting your services. At Stripe, we built a custom-built, open-source distributed tracing and monitoring pipeline that allows us to inspect each step of an HTTP request and diagnose the root causes of errors, no matter how obscure they may be. And with a monitoring pipeline that unifies metrics, logs, and traces, you can live the observability dream: the right data, in the right form, right when you need it.

Aditya Mukerjee

May 22, 2017
Tweet

More Decks by Aditya Mukerjee

Other Decks in Technology

Transcript

  1. Tracing Production Services at Stripe
    Aditya Mukerjee
    Systems Engineer at Stripe
    @chimeracoder

    View Slide

  2. Tracing is about more than HTTP requests
    @chimeracoder

    View Slide

  3. View Slide

  4. https://veneur.org

    View Slide

  5. @chimeracoder

    View Slide

  6. It’s 3:07 AM
    @chimeracoder

    View Slide

  7. Dashboard Count: 1
    @chimeracoder

    View Slide

  8. Dashboard Count: 2
    @chimeracoder

    View Slide

  9. Dashboard Count: 3
    @chimeracoder

    View Slide

  10. Dashboard Count: 4
    @chimeracoder

    View Slide

  11. @chimeracoder

    View Slide

  12. If you need to look at logs, there’s a gap in your observability tools
    @chimeracoder

    View Slide

  13. Dashboard Count: 5
    @chimeracoder

    View Slide

  14. Metrics/dashboards?
    Logs?
    Request traces?
    No context!
    Hard to aggregate!
    Require planning!
    @chimeracoder

    View Slide

  15. Monitoring information is only as good as developers’ ability
    to predict the future
    @chimeracoder

    View Slide

  16. @chimeracoder

    View Slide

  17. @chimeracoder

    View Slide

  18. @chimeracoder

    View Slide

  19. @chimeracoder
    Application

    View Slide

  20. What’s the difference?
    •If you squint, it’s hard to tell them apart
    •A log is a metric with “longer” information
    •A trace is a metric that allows “inner joins”
    @chimeracoder

    View Slide

  21. What if we could have all three, all the time?
    @chimeracoder

    View Slide

  22. Standard Sensor Format
    @chimeracoder

    View Slide

  23. @chimeracoder

    View Slide

  24. @chimeracoder

    View Slide

  25. @chimeracoder
    Application

    View Slide

  26. Tradeoffs: Stacking the Deck
    @chimeracoder

    View Slide

  27. Distributed Collection
    @chimeracoder
    host1
    host2
    host3
    Dashboard Tool

    View Slide

  28. Aggregation
    @chimeracoder
    host1
    host2
    host3
    Global
    Aggregator
    Dashboard
    Tool

    View Slide

  29. Distributed Aggregation
    @chimeracoder
    host1
    host2
    host3
    Dashboard Tool

    View Slide

  30. Stacking the Deck Histogram: t-digests
    @chimeracoder

    View Slide

  31. Let’s build the world we want to see
    @chimeracoder

    View Slide

  32. It’s 3:07 AM
    @chimeracoder

    View Slide

  33. @chimeracoder

    View Slide

  34. Veneur in 2017
    •High availability
    •Host-local metrics
    •Global aggregate metrics
    •Probabilistic data structures
    •… and more!
    Veneur in 2018
    •Automatic cardinality detection
    •Cross-dashboard integration
    •Unified client instrumentation
    •… help us decide the rest!
    @chimeracoder

    View Slide

  35. Thank you!
    https://veneur.org
    #veneur on Freenode
    Aditya Mukerjee
    @chimeracoder

    View Slide