$30 off During Our Annual Pro Sale. View Details »

Unified Observability of Distributed Systems

Unified Observability of Distributed Systems

“If a microservice falls down in the middle of a server farm, does my pager make a sound?”

If your service is automatically monitored, then the answer is “yes!”. But now that you’ve been paged and roused from your slumber… what happens next? Do you stumble to your computer, bleary-eyed, trying to find the elusive problem by cross-referencing dashboards and server logs across eleven different browser tabs? Or do you have better tools that you can use to integrate your monitoring data?

Unless you’re an engineer at a massive company like Google, the answer is probably “no” - most companies don’t have the resources to build all their own monitoring tools in-house. When using third-party and open-source tools for monitoring, there will always be gaps in between.

Fortunately, there’s a way teams can get the best of both worlds: high-resolution visibility into your systems, but without having to write your entire monitoring stack yourselves. We built a custom-built, open-source distributed tracing and monitoring pipeline that allows us to inspect each step of an HTTP request and diagnose the root causes of errors, leveraging the same open-source and third-party monitoring platforms you’re already used to. And with a monitoring pipeline that unifies metrics, logs, and traces, you can live the observability dream: the right data, in the right form, right when you need it.

Aditya Mukerjee

May 10, 2018
Tweet

More Decks by Aditya Mukerjee

Other Decks in Technology

Transcript

  1. Unified Observability of Distributed
    Systems
    Aditya Mukerjee
    Systems Engineer at Stripe
    @chimeracoder

    View Slide

  2. @chimeracoder

    View Slide

  3. Why are we here?
    @chimeracoder

    View Slide

  4. It’s 3:07 AM
    @chimeracoder

    View Slide

  5. Dashboard Count: 1
    @chimeracoder

    View Slide

  6. Dashboard Count: 2
    @chimeracoder

    View Slide

  7. Dashboard Count: 3
    @chimeracoder

    View Slide

  8. @chimeracoder

    View Slide

  9. Dashboard Count: 4
    @chimeracoder

    View Slide

  10. @chimeracoder

    View Slide

  11. @chimeracoder

    View Slide

  12. What tools can we use?
    Metrics/dashboards?
    Logs?
    Request traces?
    No context!
    Hard to aggregate!
    Require planning!
    @chimeracoder

    View Slide

  13. Monitoring information is only as good as
    developers’ ability to predict the future
    @chimeracoder

    View Slide

  14. @chimeracoder

    View Slide

  15. @chimeracoder

    View Slide

  16. @chimeracoder

    View Slide

  17. @chimeracoder
    Application

    View Slide

  18. What’s the difference?
    •If you squint, it’s hard to tell them apart
    •A log is a metric with “longer” information
    •A trace is a metric that allows “inner joins”
    @chimeracoder

    View Slide

  19. What if we could have all three, all the time?
    @chimeracoder

    View Slide

  20. Standard Sensor Format
    @chimeracoder

    View Slide

  21. @chimeracoder

    View Slide

  22. @chimeracoder

    View Slide

  23. @chimeracoder
    Application

    View Slide

  24. Integrated Views
    @chimeracoder

    View Slide

  25. @chimeracoder

    View Slide

  26. @chimeracoder

    View Slide

  27. @chimeracoder

    View Slide

  28. Flexibility and Data Migrations
    @chimeracoder

    View Slide

  29. @chimeracoder
    Application
    B
    A C

    View Slide

  30. “Because we’re in control of our pipeline, we could
    add a new data backend or migrate vendors
    without having to touch our application code at
    all.”
    “Having ownership over our pipeline gave us trust
    in our data. It made us confident that we we
    hadn’t overlooked any parts of the migration
    process.”
    @chimeracoder

    View Slide

  31. Tradeoffs: Stacking the Deck
    @chimeracoder

    View Slide

  32. Distributed Collection
    @chimeracoder
    host1
    host2
    host3
    Dashboard Tool

    View Slide

  33. Aggregation
    @chimeracoder
    host1
    host2
    host3
    Global
    Aggregator
    Dashboard Tool

    View Slide

  34. Distributed Aggregation
    @chimeracoder
    host1
    host2
    host3
    Dashboard Tool

    View Slide

  35. Stacking the Deck Histogram: t-digests
    @chimeracoder

    View Slide

  36. Trying out Veneur
    •Free and open source! http://github.com/stripe/veneur
    •Six-week release cycle
    • Drop-in support for statsd, Graphite, Datadog, SignalFx, Prometheus,
    and more
    •Native Kubernetes support
    •Public images on Docker Hub
    @chimeracoder

    View Slide

  37. @chimeracoder

    View Slide

  38. @chimeracoder

    View Slide

  39. @chimeracoder

    View Slide

  40. Veneur in 2017
    • High availability
    • Host-local metrics
    • Global aggregate metrics
    • Sketching data structures
    • … and more!
    Veneur in 2018
    • Automatic cardinality
    detection
    • Expanded cross-dashboard
    integration
    • Unified client
    instrumentation
    • … help us decide the rest!
    @chimeracoder

    View Slide

  41. Let’s build the world we want to see
    @chimeracoder

    View Slide

  42. Thank you!
    https://github.com/stripe/veneur
    #veneur on Freenode
    Aditya Mukerjee
    @chimeracoder
    @chimeracoder

    View Slide

  43. References
    @chimeracoder

    View Slide