Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lynn Root - Tracing, Fast and Slow: Digging into and improving your web service’s performance

Lynn Root - Tracing, Fast and Slow: Digging into and improving your web service’s performance

Do you maintain a [Rube Goldberg](https://s-media-cache-ak0.pinimg.com/564x/92/27/a6/9227a66f6028bd19d418c4fb3a55b379.jpg)-like service? Perhaps it’s highly distributed? Or you recently walked onto a team with an unfamiliar codebase? Have you noticed your service responds slower than molasses? This talk will walk you through how to pinpoint bottlenecks, approaches and tools to make improvements, and make you seem like the hero! All in a day’s work.

The talk will describe various types of tracing a web service, including black & white box tracing, tracing distributed systems, as well as various tools and external services available to measure performance. I’ll also present a few different rabbit holes to dive into when trying to improve your service’s performance.

https://us.pycon.org/2017/schedule/presentation/565/

Bde70c0ba031a765ff25c19e6b7d6d23?s=128

PyCon 2017

May 21, 2017
Tweet

Transcript

  1. Lynn Root | SRE | @roguelynn Tracing: Fast & Slow

    Digging into and improving your web service’s performance
  2. $ whoami

  3. agenda —

  4. agenda • Overview and problem space —

  5. agenda • Overview and problem space • Approaches to tracing

  6. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale —
  7. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale • Diagnosing performance issues —
  8. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale • Diagnosing performance issues • Tracing services & systems —
  9. Tracing Overview —

  10. machine-centric • Focus on a single machine —

  11. machine-centric • Focus on a single machine • No view

    into a service’s dependencies —
  12. workflow-centric • Understand causal relationships —

  13. workflow-centric • Understand causal relationships • End-to-end tracing —

  14. None
  15. why trace? —

  16. why trace? • Performance analysis —

  17. why trace? • Performance analysis • Anomaly detection —

  18. why trace? • Performance analysis • Anomaly detection • Profiling

  19. why trace? • Performance analysis • Anomaly detection • Profiling

    • Resource attribution —
  20. why trace? • Performance analysis • Anomaly detection • Profiling

    • Resource attribution • Workload modeling —
  21. Tracing Approaches —

  22. manual

  23. def request_id(f): @wraps(f) def decorated(*args, **kwargs): req_id = request.headers.get( "X-Request-Id",

    uuid.uuid4()) return f(req_id, *args, **kwargs) return decorated @app.route("/") @request_id def list_services(req_id): # log w/ ID for wherever you want to trace # app logic
  24. upstream appserver { 10.0.0.0:80; } server { listen 80; #

    Return to client add_header X-Request-ID $request_id; location / { proxy_pass http://appserver; # Pass to app server proxy_set_header X-Request-ID $request_id; } }
  25. log_format trace '$remote_addr … $request_id'; server { listen 80; add_header

    X-Request-ID $request_id; location / { proxy_pass http://app_server; proxy_set_header X-Request-ID $request_id; # Log $request_id access_log /var/log/nginx/access_trace.log trace; } }
  26. blackbox

  27. metadata propagation

  28. None
  29. Tracing at Scale —

  30. four things to think about —

  31. four things to think about • What relationships to track

  32. four things to think about • What relationships to track

    • How to track them —
  33. four things to think about • What relationships to track

    • How to track them • Which sampling approach to take —
  34. four things to think about • What relationships to track

    • How to track them • Which sampling approach to take • How to visualize to employ —
  35. what to track

  36. Request One Request Two Submitter Flow PoV

  37. Request One Request Two Trigger Flow PoV

  38. how to track

  39. request ID

  40. request ID + logical clock

  41. request ID + logical clock + previous trace points

  42. tradeoffs —

  43. tradeoffs • Payload size —

  44. tradeoffs • Payload size • Explicit relationships —

  45. tradeoffs • Payload size • Explicit relationships • Collate despite

    lost data —
  46. tradeoffs • Payload size • Explicit relationships • Collate despite

    lost data • Immediate availability —
  47. how to sample

  48. sampling approaches • Head-based —

  49. sampling approaches • Head-based • Tail-based —

  50. sampling approaches • Head-based • Tail-based • Unitary —

  51. what to visualize

  52. gantt chart — GET /home GET /feed GET /profile GET

    /messages GET /friends Trace ID: de4db33f
  53. — request flow graph A call B call C call

    C call D call E call E reply D reply B reply C reply C reply A reply 2200µs 1500µs 500µs 300µs 400µs 600µs 800µs 500µs 500µs 700µs 500µs 400µs 600µs 100µs
  54. — context calling tree A B C C D E

  55. keep in mind • What do I want to know?

  56. keep in mind • What do I want to know?

    • How much can I instrument? —
  57. keep in mind • What do I want to know?

    • How much can I instrument? • How much do I want to know? —
  58. suggested for performance —

  59. suggested for performance — • Trigger PoV

  60. suggested for performance — • Trigger PoV • Head-based sampling

  61. suggested for performance — • Trigger PoV • Head-based sampling

    • Flow graphs
  62. Diagnosing —

  63. questions to ask — • Batch requests?

  64. questions to ask • Batch requests? • Any parallelization opportunities?

  65. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? —
  66. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? • Frontend resource loading? —
  67. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? • Frontend resource loading? • Chunked or JIT responses? —
  68. Systems & Services —

  69. OpenTracing

  70. self-hosted

  71. Zipkin (Twitter) —

  72. Zipkin (Twitter) • Out-of-band reporting to remote collector —

  73. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe —
  74. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe • Python libs only support propagation via HTTP —
  75. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe • Python libs only support propagation via HTTP • Limited web UI —
  76. def http_transport(span_data): requests.post( "http://zipkinserver:9411/api/v1/spans", data=span_data, headers={"Content-type": "application/x-thrift"}) @app.route("/") def index():

    with zipkin_span(service_name="myawesomeapp", span_name="index", # need to write own transport func transport_handler=http_transport, port=app_port, # 0-100 percent sample_rate=100): # do something
  77. Jaeger (Uber) —

  78. Jaeger (Uber) • Local daemon to collect & report —

  79. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra —
  80. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra • Lacking in documentation —
  81. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra • Lacking in documentation • Cringe-worthy client library —
  82. import opentracing as ot config = Config(…) tracer = config.initialize_tracer()

    @app.route("/") def index(): with ot.tracer.start_span("ASpan") as span: span.log_event("test message", payload={"life": 42}) with ot.tracer.start_span("AChildSpan", child_of=span) as cspan: span.log_event("another test message") # wat time.sleep(2) # yield to IOLoop to flush the spans tracer.close() # flush any buffered spans
  83. honorable mentions • AppDash • LightStep (private beta) —

  84. services

  85. Stackdriver Trace (Google) —

  86. Stackdriver Trace (Google) • No Python client libraries; no gRPC

    client support —
  87. Stackdriver Trace (Google) • No Python client libraries; no gRPC

    client support • Forward traces from Zipkin —
  88. Stackdriver Trace (Google) • No Python client libraries; no gRPC

    client support • Forward traces from Zipkin • Storage limitation of 30 days —
  89. X-Ray (AWS) —

  90. X-Ray (AWS) • No first class Python support; Boto available

  91. X-Ray (AWS) • No first class Python support; Boto available

    • Configurable sampling, but not for Boto —
  92. X-Ray (AWS) • No first class Python support; Boto available

    • Configurable sampling, but not for Boto • Flow graphs with latency, response %, sample % —
  93. honorable mentions • Datadog • New Relic —

  94. TL;DR —

  95. tl;dr — • You need this

  96. tl;dr — • You need this • Docs are lacking

  97. tl;dr — • You need this • Docs are lacking

    • Language support lacking
  98. tl;dr — • You need this • Docs are lacking

    • Language support lacking • One size fits all approaches
  99. tl;dr — • You need this • Docs are lacking

    • Language support lacking • One size fits all approaches • But there’s an open spec!
  100. Thanks! — Sources & links: rogue.ly/tracing Lynn Root | SRE

    | @roguelynn