Making Service Launches
Boring with Tracing
Karthik Kumar
@k4rkum
LightStep Inc.
Slide 2
Slide 2 text
“My team has Distributed Tracing, I just don’t know when it’s useful”
-Francisco
(Last month’s SRE meetup)
Slide 3
Slide 3 text
Distributed Tracing
What is tracing?
What are some best practices?
When is it useful?
Slide 4
Slide 4 text
Disclaimer
Slide 5
Slide 5 text
What are traces?
Logs
Traces
=
CONTEXT
Span
Trace
Slide 6
Slide 6 text
How do I get traces?
Slide 7
Slide 7 text
What can I do with traces?
Slide 8
Slide 8 text
When is tracing useful?
Slide 9
Slide 9 text
Scenario: launching a new service
Slide 10
Slide 10 text
Scenario: launching a new service
Slide 11
Slide 11 text
Scenario: launching a new service
Slide 12
Slide 12 text
Scenario: my service is crashing
Slide 13
Slide 13 text
Scenario: my service is crashing
Slide 14
Slide 14 text
Scenario: my service is crashing
Slide 15
Slide 15 text
Scenario: my service isn’t doing its job
Slide 16
Slide 16 text
Scenario: my service isn’t doing its job
Slide 17
Slide 17 text
Scenario: my service is getting DDoS-ed
Slide 18
Slide 18 text
Scenario: my service is getting DDoS-ed
Slide 19
Slide 19 text
Scenario: my service is getting DDoS-ed
Slide 20
Slide 20 text
Scenario: I want to track service SLOs
my service’s SLO
my provider’s SLO
Slide 21
Slide 21 text
Scenario: I want to improve instrumentation
Slide 22
Slide 22 text
Tracing Best Practices
Slide 23
Slide 23 text
Tip #1: Trace close to business value
Identify important transactions
Identify critical path of request
Add traces!
Slide 24
Slide 24 text
Critical path exposes bottlenecks
Slide 25
Slide 25 text
Tip #2: Trace communication libraries
Slide 26
Slide 26 text
Tip #3: Trace your dependencies
Identify downstream services/data store drivers/platform SDKs/3rd
party APIs
Add or adopt traces!
Slide 27
Slide 27 text
PaaS Down Detector
Slide 28
Slide 28 text
Learn from our mistakes
Slide 29
Slide 29 text
Tip #4: Add useful tags
Version tags to track deployments
Feature flags tags to track config changes
Host identifier tags to correlate with machine metrics
Customer identifier tags to correlate with business value
Slide 30
Slide 30 text
Summary
When is distributed tracing valuable?
Rolling out new services
Finding and fixing issues
Identifying optimizations
Slide 31
Slide 31 text
Summary
What are some distributed tracing best practices?
Trace close to your caller
Trace communication layer
Trace dependencies
Add useful tags
Slide 32
Slide 32 text
Questions?
https://lightstep.com/careers
@k4rkum
Slide 33
Slide 33 text
Appendix
Slide 34
Slide 34 text
Metrics
Slide 35
Slide 35 text
Analyzing Metrics
Set thresholds and alerts
Look for correlated variance across metrics
Aggregate across dimensions*
Slide 36
Slide 36 text
Logs
Slide 37
Slide 37 text
Analyzing Logs
Complex, full-text search
Granular analysis of rare events
Slide 38
Slide 38 text
Metrics & Logs are not enough
Metrics
Scoped to a single system
Vulnerable to high-cardinality tags
Logs
Scoped to a single system
Cost scales with usage
Slide 39
Slide 39 text
Life is a box of trade-offs
Logs Metrics Traces
Cost scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled)
✓ ✓ –
Immune to cardinality ✓ – ✓