Slide 1

Slide 1 text

Making Service Launches Boring with Tracing Karthik Kumar @k4rkum LightStep Inc.

Slide 2

Slide 2 text

“My team has Distributed Tracing, I just don’t know when it’s useful” -Francisco (Last month’s SRE meetup)

Slide 3

Slide 3 text

Distributed Tracing What is tracing? What are some best practices? When is it useful?

Slide 4

Slide 4 text

Disclaimer

Slide 5

Slide 5 text

What are traces? Logs Traces = CONTEXT Span Trace

Slide 6

Slide 6 text

How do I get traces?

Slide 7

Slide 7 text

What can I do with traces?

Slide 8

Slide 8 text

When is tracing useful?

Slide 9

Slide 9 text

Scenario: launching a new service

Slide 10

Slide 10 text

Scenario: launching a new service

Slide 11

Slide 11 text

Scenario: launching a new service

Slide 12

Slide 12 text

Scenario: my service is crashing

Slide 13

Slide 13 text

Scenario: my service is crashing

Slide 14

Slide 14 text

Scenario: my service is crashing

Slide 15

Slide 15 text

Scenario: my service isn’t doing its job

Slide 16

Slide 16 text

Scenario: my service isn’t doing its job

Slide 17

Slide 17 text

Scenario: my service is getting DDoS-ed

Slide 18

Slide 18 text

Scenario: my service is getting DDoS-ed

Slide 19

Slide 19 text

Scenario: my service is getting DDoS-ed

Slide 20

Slide 20 text

Scenario: I want to track service SLOs my service’s SLO my provider’s SLO

Slide 21

Slide 21 text

Scenario: I want to improve instrumentation

Slide 22

Slide 22 text

Tracing Best Practices

Slide 23

Slide 23 text

Tip #1: Trace close to business value Identify important transactions Identify critical path of request Add traces!

Slide 24

Slide 24 text

Critical path exposes bottlenecks

Slide 25

Slide 25 text

Tip #2: Trace communication libraries

Slide 26

Slide 26 text

Tip #3: Trace your dependencies Identify downstream services/data store drivers/platform SDKs/3rd party APIs Add or adopt traces!

Slide 27

Slide 27 text

PaaS Down Detector

Slide 28

Slide 28 text

Learn from our mistakes

Slide 29

Slide 29 text

Tip #4: Add useful tags Version tags to track deployments Feature flags tags to track config changes Host identifier tags to correlate with machine metrics Customer identifier tags to correlate with business value

Slide 30

Slide 30 text

Summary When is distributed tracing valuable? Rolling out new services Finding and fixing issues Identifying optimizations

Slide 31

Slide 31 text

Summary What are some distributed tracing best practices? Trace close to your caller Trace communication layer Trace dependencies Add useful tags

Slide 32

Slide 32 text

Questions? https://lightstep.com/careers @k4rkum

Slide 33

Slide 33 text

Appendix

Slide 34

Slide 34 text

Metrics

Slide 35

Slide 35 text

Analyzing Metrics Set thresholds and alerts Look for correlated variance across metrics Aggregate across dimensions*

Slide 36

Slide 36 text

Logs

Slide 37

Slide 37 text

Analyzing Logs Complex, full-text search Granular analysis of rare events

Slide 38

Slide 38 text

Metrics & Logs are not enough Metrics Scoped to a single system Vulnerable to high-cardinality tags Logs Scoped to a single system Cost scales with usage

Slide 39

Slide 39 text

Life is a box of trade-offs Logs Metrics Traces Cost scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓