Save 37% off PRO during our Black Friday Sale! »

Making Service Launches Boring with Distributed Tracing

Decf9a676b62d08fd6659e25951fc385?s=47 Karthik Kumar
September 18, 2019

Making Service Launches Boring with Distributed Tracing

I gave a talk at the SF Site Reliability Engineering Meetup. The talk was focused on how incorporating distributed tracing when developing a service can ensure a successful launch. It also included a few best tracing best-practices that Lightstep uses internally.

Decf9a676b62d08fd6659e25951fc385?s=128

Karthik Kumar

September 18, 2019
Tweet

Transcript

  1. Making Service Launches Boring with Tracing Karthik Kumar @k4rkum LightStep

    Inc.
  2. “My team has Distributed Tracing, I just don’t know when

    it’s useful” -Francisco (Last month’s SRE meetup)
  3. Distributed Tracing What is tracing? What are some best practices?

    When is it useful?
  4. Disclaimer

  5. What are traces? Logs Traces = CONTEXT Span Trace

  6. How do I get traces?

  7. What can I do with traces?

  8. When is tracing useful?

  9. Scenario: launching a new service

  10. Scenario: launching a new service

  11. Scenario: launching a new service

  12. Scenario: my service is crashing

  13. Scenario: my service is crashing

  14. Scenario: my service is crashing

  15. Scenario: my service isn’t doing its job

  16. Scenario: my service isn’t doing its job

  17. Scenario: my service is getting DDoS-ed

  18. Scenario: my service is getting DDoS-ed

  19. Scenario: my service is getting DDoS-ed

  20. Scenario: I want to track service SLOs my service’s SLO

    my provider’s SLO
  21. Scenario: I want to improve instrumentation

  22. Tracing Best Practices

  23. Tip #1: Trace close to business value Identify important transactions

    Identify critical path of request Add traces!
  24. Critical path exposes bottlenecks

  25. Tip #2: Trace communication libraries

  26. Tip #3: Trace your dependencies Identify downstream services/data store drivers/platform

    SDKs/3rd party APIs Add or adopt traces!
  27. PaaS Down Detector

  28. Learn from our mistakes

  29. Tip #4: Add useful tags Version tags to track deployments

    Feature flags tags to track config changes Host identifier tags to correlate with machine metrics Customer identifier tags to correlate with business value
  30. Summary When is distributed tracing valuable? Rolling out new services

    Finding and fixing issues Identifying optimizations
  31. Summary What are some distributed tracing best practices? Trace close

    to your caller Trace communication layer Trace dependencies Add useful tags
  32. Questions? https://lightstep.com/careers @k4rkum

  33. Appendix

  34. Metrics

  35. Analyzing Metrics Set thresholds and alerts Look for correlated variance

    across metrics Aggregate across dimensions*
  36. Logs

  37. Analyzing Logs Complex, full-text search Granular analysis of rare events

  38. Metrics & Logs are not enough Metrics Scoped to a

    single system Vulnerable to high-cardinality tags Logs Scoped to a single system Cost scales with usage
  39. Life is a box of trade-offs Logs Metrics Traces Cost

    scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓