a day in the life of a request

a day in the life of a request

A4b95be2145cc46f891707b6db9dd82d?s=128

Igor Wiedler

March 23, 2019
Tweet

Transcript

  1. a day in the life of a request

  2. None
  3. hello!

  4. why is it slow?

  5. latency

  6. t0 t1

  7. Designs, Lessons and Advice from Building Large Distributed Systems, Jeff

    Dean
  8. What is in the tail? 0 0.5 1 1.5 2

    2.5 3 3.5 4 4.5 5 0 20 40 60 80 100 0 20 40 60 80 100 Percentage of requests Latency (ms) ? Measuring and Optimizing Tail Latency, Kathryn McKinley
  9. Benchmarking "Hello, World!", Dick Sites

  10. Amdahl's law, Wikipedia

  11. Example 2: Task Scheduling in Spark Driver W1 W2 W3

    5 SnailTrail, critical participation Window Conventional profiling Window % time SnailTrail, Hoffmann et al
  12. CPU Flame Graphs, Brendan Gregg

  13. Systems Performance by Brendan Gregg

  14. None
  15. <span>

  16. The Gantt Chart: A Working Tool of Management, Henry Wallace

    Clark
  17. Twitter Dot Com, Google Chrome

  18. Symfony

  19. Dapper, Google

  20. func ProcessVideo(ctx, video) { ctx, span := trace.StartSpan(ctx, "ProcessVideo") defer

    span.End() video.Process() }
  21. things this helps debug

  22. Travis CI

  23. func (rl *redisRateLimiter) RateLimit(...) { conn := rl.pool.Get() defer conn.Close()

    ctx, span := trace.StartSpan(ctx, "Redis.RateLimit") defer span.End() ... }
  24. tx0 tx1 tx2 tx3 tx4 tx5 ... blocked

  25. </span>

  26. context propagation

  27. Dapper, Google

  28. X-Request-ID

  29. SELECT COUNT(*) FROM likes WHERE artist = 'CHVRCHES'

  30. SELECT COUNT(*) FROM likes WHERE artist = 'CHVRCHES' /*request_id:123e4567-e89b-12d3- a456-426655440000*/

    Marginalia, Basecamp
  31. EXPLAIN ANALYZE SELECT COUNT(*) FROM likes WHERE artist = 'CHVRCHES'

    /*request_id:123e4567-e89b-12d3- a456-426655440000*/
  32. Aggregate Buffers: shared hit=74 read=41 -> Index Only Scan using

    likes_artist_idx on likes Index Cond: (artist = 'CHRVRCHES'::text) Heap Fetches: 10000 Buffers: shared hit=74 read=41 Planning Time: 0.344 ms Execution Time: 5.182 ms
  33. req, err := http.NewRequest("GET", serviceURL, nil) req.Header.Add("X-Request-ID", requestID) resp, err

    := client.Do(req)
  34. Canopy, Facebook

  35. sampling

  36. Dapper, Google

  37. sampling decision

  38. Travis CI

  39. finding interesting traces

  40. Honeycomb

  41. LightStep

  42. group by customer

  43. happy path can also be interesting!

  44. visualization

  45. Jaeger, Uber

  46. where do we go from here?

  47. aggregation

  48. None
  49. Canopy, Facebook

  50. Canopy, Facebook

  51. Pivot Tracing, Mace et al

  52. Pivot Tracing, Mace et al

  53. kernel tracing

  54. Systems Performance by Brendan Gregg

  55. Debugging Latency in Go 1.11, Jaana B. Dogan

  56. eBPF

  57. None
  58. Performance Analysis of Cloud Applications, Google

  59. Performance Analysis of Cloud Applications, Google

  60. Benchmarking "Hello, World!", Dick Sites

  61. Benchmarking "Hello, World!", Dick Sites

  62. Go Dynamic Tools, Dmitry Vyukov, GopherCon 2015

  63. Visualization: Statemaps The Hurricane’s Butterfly, Bryan Cantrill

  64. Stacked statemaps across machines Visualizing Systems with Statemaps, Bryan Cantrill

  65. adaptively improving tail latency

  66. "long requests reveal themselves" ~ Kathryn McKinley

  67. The Tail Longest 200 requests 15 0 20 40 60

    80 100 120 0 50 100 150 200 latency (ms) Top 200 requests Network and networking queueing time Idle time CPU time Dispatch queueing time latency Network & other Idle CPU work Queuing at worker not noise Network imperfections OS imperfections Long requests Overload }noise } Measuring and Optimizing Tail Latency, Kathryn McKinley
  68. dealing with noise

  69. speeding up work

  70. recap • tail latency matters • tracing helps debug it

  71. OpenCensus

  72. the morning paper blog.acolyer.org

  73. • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure from Google,

    2010 • Scuba: Diving into Data at Facebook from Facebook, 2016 • Canopy: An End-to-End Performance Tracing And Analysis System from Facebook, 2017 • Performance Analysis of Cloud Applications from Google, 2018 • Systems Performance: Enterprise and the Cloud by Brendan Gregg, 2013 • The Tail at Scale by Jeff Dean and Luiz André Barroso, 2013 • Designs, Lessons and Advice from Building Large Distributed Systems by Jeff Dean, 2009 • Data Center Computers: Modern Challenges in CPU Design by Dick Sites, 2015 • Measuring and Optimizing Tail Latency by Kathryn McKinley, Strange Loop 2017 • Benchmarking "Hello, World!" by Dick Sites, 2018 • Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems by Mace et al, 2015 • RobinHood: Tail Latency Aware Caching by Berger et al, 2018 • SnailTrail: Generalizing Critical Paths for Online Analysis of Distributed Dataflows by Hoffmann et al, 2018
  74. thanks! @igorwhilefalse