Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tracing, Fast & Slow: Digging into and improving your web service's performance

8c5e76dca74a59822dbf7f0286177ddd?s=47 Lynn Root
October 31, 2018

Tracing, Fast & Slow: Digging into and improving your web service's performance

* PyLadies in St Petersburg, Nov 2018
* EuroPython 2017
* PyCon 2017

8c5e76dca74a59822dbf7f0286177ddd?s=128

Lynn Root

October 31, 2018
Tweet

Transcript

  1. Lynn Root | SRE | @roguelynn Tracing: Fast & Slow

    Digging into and improving your web service’s performance
  2. $ whoami

  3. agenda —

  4. agenda • Overview and problem space —

  5. agenda • Overview and problem space • Approaches to tracing

  6. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale —
  7. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale • Diagnosing performance issues —
  8. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale • Diagnosing performance issues • Tracing services & systems —
  9. Tracing Overview —

  10. machine-centric • Focus on a single machine —

  11. machine-centric • Focus on a single machine • No view

    into a service’s dependencies —
  12. workflow-centric • Understand causal relationships —

  13. workflow-centric • Understand causal relationships • End-to-end tracing —

  14. None
  15. 100k’s client connections 100’s access point hosts 1,000’s unique services

    running on 10k’s hosts
  16. why trace? —

  17. why trace? • Performance analysis —

  18. why trace? • Performance analysis • Anomaly detection —

  19. why trace? • Performance analysis • Anomaly detection • Profiling

  20. why trace? • Performance analysis • Anomaly detection • Profiling

    • Resource attribution —
  21. why trace? • Performance analysis • Anomaly detection • Profiling

    • Resource attribution • Workload modeling —
  22. Tracing Approaches —

  23. manual

  24. def request_id(f): @wraps(f) def decorated(*args, **kwargs): req_id = request.headers.get( "X-Request-Id",

    uuid.uuid4()) return f(req_id, *args, **kwargs) return decorated @app.route("/") @request_id def list_services(req_id): # log w/ ID for wherever you want to trace # app logic
  25. upstream appserver { 10.0.0.0:80; } server { listen 80; #

    Return to client add_header X-Request-ID $request_id; location / { proxy_pass http://appserver; # Pass to app server proxy_set_header X-Request-ID $request_id; } }
  26. log_format trace '$remote_addr … $request_id'; server { listen 80; add_header

    X-Request-ID $request_id; location / { proxy_pass http://app_server; proxy_set_header X-Request-ID $request_id; # Log $request_id access_log /var/log/nginx/access_trace.log trace; } }
  27. blackbox

  28. metadata propagation

  29. None
  30. Tracing at Scale —

  31. four things to think about —

  32. four things to think about • What relationships to track

  33. four things to think about • What relationships to track

    • How to track them —
  34. four things to think about • What relationships to track

    • How to track them • Which sampling approach to take —
  35. four things to think about • What relationships to track

    • How to track them • Which sampling approach to take • How to visualize —
  36. what to track

  37. Request One Request Two Submitter Flow PoV

  38. Request One Request Two Trigger Flow PoV

  39. how to track

  40. request ID

  41. request ID + logical clock

  42. request ID + logical clock + previous trace points

  43. tradeoffs —

  44. tradeoffs • Payload size —

  45. tradeoffs • Payload size • Explicit relationships —

  46. tradeoffs • Payload size • Explicit relationships • Collate despite

    lost data —
  47. tradeoffs • Payload size • Explicit relationships • Collate despite

    lost data • Immediate availability —
  48. how to sample

  49. sampling approaches • Head-based —

  50. sampling approaches • Head-based • Tail-based —

  51. sampling approaches • Head-based • Tail-based • Unitary —

  52. what to visualize

  53. gantt chart — GET /home GET /feed GET /profile GET

    /messages GET /friends Trace ID: de4db33f
  54. — request flow graph A call B call C call

    C call D call E call E reply D reply B reply C reply C reply A reply 2200µs 1500µs 500µs 300µs 400µs 600µs 800µs 500µs 500µs 700µs 500µs 400µs 600µs 100µs
  55. — context calling tree A B C C D E

  56. keep in mind • What do I want to know?

  57. keep in mind • What do I want to know?

    • How much can I instrument? —
  58. keep in mind • What do I want to know?

    • How much can I instrument? • How much do I want to know? —
  59. suggested for performance —

  60. suggested for performance — • Trigger PoV

  61. suggested for performance — • Trigger PoV • Head-based sampling

  62. suggested for performance — • Trigger PoV • Head-based sampling

    • Flow graphs
  63. Diagnosing —

  64. questions to ask — • Batch requests?

  65. questions to ask • Batch requests? • Any parallelization opportunities?

  66. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? —
  67. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? • Frontend resource loading? —
  68. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? • Frontend resource loading? • Chunked or JIT responses? —
  69. Frameworks, Systems & Services —

  70. OpenTracing

  71. OpenCensus

  72. self-hosted

  73. Zipkin (Twitter) —

  74. Zipkin (Twitter) • Out-of-band reporting to remote collector —

  75. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe —
  76. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe • Python libs only support propagation via HTTP —
  77. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe • Python libs only support propagation via HTTP • Limited web UI —
  78. def http_transport(span_data): requests.post( "http://zipkinserver:9411/api/v1/spans", data=span_data, headers={"Content-type": "application/x-thrift"}) @app.route("/") def index():

    with zipkin_span(service_name="myawesomeapp", span_name="index", # need to write own transport func transport_handler=http_transport, port=app_port, # 0-100 percent sample_rate=100): # do something
  79. Jaeger (Uber) —

  80. Jaeger (Uber) • Local daemon to collect & report —

  81. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra —
  82. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra • Lacking in documentation —
  83. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra • Lacking in documentation • Cringe-worthy client library —
  84. import opentracing as ot config = Config(…) tracer = config.initialize_tracer()

    @app.route("/") def index(): with ot.tracer.start_span("ASpan") as span: span.log_event("test message", payload={"life": 42}) with ot.tracer.start_span("AChildSpan", child_of=span) as cspan: span.log_event("another test message") # wat time.sleep(2) # yield to IOLoop to flush the spans tracer.close() # flush any buffered spans
  85. honorable mentions • AppDash —

  86. services

  87. Stackdriver Trace (Google) —

  88. Stackdriver Trace (Google) • OpenCensus Python library with gRPC support

  89. Stackdriver Trace (Google) • OpenCensus Python library with gRPC support

    • Forward traces from Zipkin —
  90. Stackdriver Trace (Google) • OpenCensus Python library with gRPC support

    • Forward traces from Zipkin • Storage limitation of 30 days —
  91. Stackdriver Trace (Google) • OpenCensus Python library with gRPC support

    • Forward traces from Zipkin • Storage limitation of 30 days • Recreate graphs per time period —
  92. X-Ray (AWS) —

  93. X-Ray (AWS) • Supports OpenCensus, not OpenTracing —

  94. X-Ray (AWS) • Supports OpenCensus, not OpenTracing • SDK has

    Python support —
  95. X-Ray (AWS) • Supports OpenCensus, not OpenTracing • SDK has

    Python support • Lots of flexibility with configuring sampling —
  96. X-Ray (AWS) • Supports OpenCensus, not OpenTracing • SDK has

    Python support • Lots of flexibility with configuring sampling • Send metrics from outside AWS environment —
  97. X-Ray (AWS) • Supports OpenCensus, not OpenTracing • SDK has

    Python support • Lots of flexibility with configuring sampling • Send metrics from outside AWS environment • Flow graphs with latency, response %, sample % —
  98. honorable mentions • Datadog • New Relic • LightStep •

    Azure Monitor —
  99. TL;DR —

  100. tl;dr — • You need this

  101. tl;dr — • You need this • Docs are lacking

  102. tl;dr — • You need this • Docs are lacking

    • Language support is improving
  103. tl;dr — • You need this • Docs are lacking

    • Language support is improving • One size fits all approaches
  104. tl;dr — • You need this • Docs are lacking

    • Language support is improving • One size fits all approaches • But there are open specs!
  105. Thanks! — Write up: rogue.ly/tracing Lynn Root | SRE |

    @roguelynn