Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Always Bee Tracing

Always Bee Tracing

Honeycomb's first half-day Tracing workshop , held on January 24th in SF. Intended to be a great interactive overview for anyone wanting to learn about tracing: creating them and/or using them to answer questions.

We'll cover concepts + important tips for instrumenting an application for tracing, capturing instrumentation around external dependencies, and using traces to debug incidents. Honeycomb engineers will be on hand for the workshop and office hours afterwards, for hands-on consultation around instrumentation and tracing.

Christine Yen

January 24, 2019
Tweet

More Decks by Christine Yen

Other Decks in Technology

Transcript

  1. Welcome to Always Bee Tracing! If you haven’t already, please

    clone the repository of your choice:
 ▸ Golang (into your $GOPATH):
 git clone [email protected]:honeycombio/tracing-workshop-go.git ▸ Node:
 git clone [email protected]:honeycombio/tracing-workshop-node.git Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel
  2. ▸ We used to have "one thing" (monolithic application) ▸

    Then we started to have "more things" (splitting monoliths into services) ▸ Now we have "yet more things", or even "Death Star" architectures (microservices, containers, serverless) A bit of history
  3. ▸ Now we have N2 problems (one slow service bogs

    down everything, etc.) ▸ 2010 - Google releases the Dapper paper describing how they improve on existing tracing systems ▸ Key innovations: use of sampling, common client libraries decoupling app code from tracing logic A bit of history
  4. ▸ 2012 - Zipkin was developed at Twitter for use

    with Thrift RPC ▸ 2015 - Uber releases Jaeger (also OpenTracing) ▸ Better sampling story, better client libraries, no Scribe/Kafka ▸ Various proprietary systems abound ▸ 2019 - Honeycomb is the best available due to best-in-class queries ;) Why should GOOG have all the fun?
  5. ▸ Standards for tracing exist: OpenTracing, OpenCensus, etc. ▸ Pros:

    Collaboration, preventing vendor lock-in ▸ Cons: Slower innovation, political battles/drama ▸ Honeycomb has integrations to bridge standard formats with the Honeycomb event model A word on standards
  6. How Honeycomb fits in Understand how your production systems are

    behaving, right now QUERY BUILDER INTERACTIVE VISUALS RAW DATA TRACES BUBBLEUP + OUTLIERS BEELINES (AUTOMATIC INSTRUMENTATION + TRACING APIS) DATA STORE
 High Cardinality Data | High Dimensionality Data | Efficient storage
  7. ▸ For software engineers who need to understand their code

    ▸ Better when visualized (preferably first in aggregate) ▸ Best when layered on top of existing data streams (rather than adding another data silo to your toolkit) Tracing is…
  8. Our path today ▸ Establish a baseline: send simple events

    ▸ Customize: enrich with custom fields and extend into traces ▸ Explore: learn to query a collection of traces, to find the most interesting one
  9. EXERCISE: Run the wall service go run ./wall.go ‣ Open

    up http://localhost:8080 in your browser and post some messages to your wall. ‣ Try writing messages like these: ‣ "hello #test #hashtag" ‣ "seems @twitteradmin isn’t a valid username but @honeycombio is" node ./wall.js Go: Node:
  10. Custom Instrumentation ▸ Identify metadata that will help you isolate

    unexpected behavior in custom logic: ▸ Bits about your infrastructure (e.g. which host) ▸ Bits about your deploy (e.g. which version/build, which feature flags) ▸ Bits about your business (e.g. which customer, which shopping cart) ▸ Bits about your execution (e.g. payload characteristics, sub-timers)
  11. trace.trace_id
 The ID of the trace this span belongs to

    trace.span_id
 A unique ID for each span trace.parent_id
 The ID of this span’s parent span, the call location the current span was called from service_name
 The name of the service that generated this span name
 The specific call location (like a function or method name) duration_ms
 How much time the span took, in milliseconds EVENT ID: A EVENT ID: B, PARENTID: A EVENT ID: C, PARENTID: B ⊡ ⊡ TRACE 1
  12. EVENT ID: A EVENT ID: B, PARENTID: A EVENT ID:

    C, PARENTID: B ⊡ ⊡ TRACE 1
  13. EXERCISE: Find Checkpoint 2 ‣ Try writing messages like these:

    ‣ "seems @twitteradmin isn’t a valid username but @honeycombio is" ‣ "have you tried @honeycombio for @mysql #observability?"
  14. Checkpoint 2 Takeaways ▸ Events can be used to trace

    across functions within a service just as easily as it can be "distributed" ▸ Store useful metadata on any event in a trace — and query against it! ▸ To aggregate per trace, filter to trace.parent_id does-not-exist (or break down by unique trace.trace_id values)
  15. EXERCISE: ID sources of latency ▸ Who’s experienced the longest

    delay when talking to Twitter? ▸ Hint: app.username, MAX(duration_ms),
 and name = check_twitter ▸ Who’s responsible for the most amount of cumulative time talking to Twitter? ▸ Hint: Use SUM(duration_ms) instead
  16. EXERCISE: Run the analysis service ‣ Open up http://localhost:8080 in

    your browser and post some messages to your wall. ‣ Try these: ‣ "everything is awesome!" ‣ "the sky is dark and gloomy and #winteriscoming" go run ./analysis.go node ./analysis.js Go: Node:
  17. Checkpoint 3 Takeaways ▸ Tracing across services just requires serialization

    of tracing context over the wire ▸ Wrapping outbound HTTP requests is a simple form of tracing dependencies
  18. Checkpoint 4 Takeaways ▸ Working with a black box? Instrument

    from the perspective of the code you can control. ▸ Similar to identifying test cases in TDD: capture fields to let you refine your understanding of the system.
  19. EXERCISE: Who’s knocking over my black box? ▸ First: what

    does "knocking over" mean? We know that we talk to our black box via an HTTP call. What are our signals of health? ▸ What’s the "usual worst" latency for this call out to AWS?
 (Explore different calculations: P95 = 95th percentile, MAX, HEATMAP) ▸ Hint: P95(duration_ms),
 and request.host contains aws
  20. Scenario #1 Symptoms: we pulled in that last POST in

    order to persist messages somewhere, but we’re hearing from customer support that behavior has felt buggy lately — like it works sometimes but not always. What’s going on? Think about: ‣ Verify this claim. Are we sure persist has been flaky? What does failure look like? ‣ Look through all of the metadata we have to try and find some correlation across those failing requests. response.status_code request.content_length HEATMAPs are great :)
  21. Scenario #2 Symptoms: everything feels slowed down, but more importantly

    the persistence behavior seems completely broken. What gives? Think about: ‣ What might failure mean in this case? ‣ Once you’ve figured out what these failures look like, can we do anything to stop the bleeding? What might we need to find out to answer that question? response.status_code app.username
  22. Scenario #3 Symptoms: persistence seems fine, but all requests seem

    to have slowed down to a snail’s pace. What could be impacting our overall latency so badly? Prompts: ‣ Hint! Think about adding a num_hashtags or num_handles field to your events if you’d like to capture more about the characteristics of your payload. ‣ It may be helpful to zoom in (aka add a filter) to just requests talking to amazonaws.com response.status_code request.host contains aws