Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wide Event Analytics (LISA19)

Wide Event Analytics (LISA19)

Software is becoming increasingly complex and difficult to debug in production. Yet most of the monitoring systems of today are not equipped to handle high cardinality data needed to effectively operate large-scale services. It doesn't have to be this way! If we treat monitoring as an analytics problem, we can gain the ability to query our events with a lot more flexibility, get answers to questions previously unthinkable, and do so at interactive query latencies.

Igor Wiedler

October 28, 2019
Tweet

More Decks by Igor Wiedler

Other Decks in Technology

Transcript

  1. wide event analytics
    @igorwhilefalse

    View Slide

  2. hello!

    View Slide

  3. @igorwhilefalse

    View Slide

  4. gentle constructive rant

    View Slide

  5. debugging large scale
    systems using events

    View Slide

  6. understanding
    system behaviour

    View Slide

  7. events column store
    analytical
    queries
    { k: v }
    SELECT ...
    GROUP BY
    users

    app

    View Slide

  8. events column store
    analytical
    queries
    { k: v }
    SELECT ...
    GROUP BY
    users

    app
    you are here

    View Slide

  9. software is becoming
    increasingly complex

    View Slide

  10. 72 3. WSC HARDWARE BUILDING BLOCKS
    D/F
    D/F
    D/F
    D/F
    D/F
    D/F
    DRAM
    DRAM
    DRAM
    L1$ + L2$ L1$ + L2$
    LAST-LEVEL CACHE
    L1$ + L2$ L1$ + L2$
    LAST-LEVEL CACHE
    LOCAL DRAM
    RACK SWITCH
    DATACENTER FABRIC
    DISK/FL
    ASH
    DRAM
    DRAM
    DRAM
    DRAM D/F
    D/F
    D/F
    D/F
    D/F
    D/F
    D/F
    D/F
    DRAM
    DRAM
    DRAM
    DRAM
    DRAM
    DRAM
    DRAM D/F
    D/F
    P P P P
    DRAM
    ONE SERVER
    DRAM: 256GB, 100ns, 150GB/s
    DISK: 80TB, 10ms, 800MB/s
    FLASH: 4TB, 100us, 3GB/s
    DRAM
    LOCAL RACK (40 SERVERS)
    DRAM: 10TB, 20us, 5GB/s
    DISK: 3.2PB, 10ms, 5GB/s
    FLASH: 160TB, 120us, 5GB/s
    DRAM
    CLUSTER (125 RACKS)
    DRAM: 1.28PB, 50us, 1.2GB/s
    DISK: 400PB, 10ms, 1.2GB/s
    FLASH: 20PB, 150us, 1.2GB/s
    Figure 3.15: Storage hierarchy of a WSC.
    The Datacenter as a Computer, Barroso et al

    View Slide

  11. Jaeger, Uber

    View Slide

  12. Philippe M Desveaux

    View Slide

  13. Alexandre Baron

    View Slide

  14. logs vs metrics:
    a false dichotomy
    Nick Stenning

    View Slide

  15. 10.2.3.4 - - [1/Jan/1970:18:32:20
    +0000] "GET / HTTP/1.1" 200 5324
    "-" "curl/7.54.0" "-"

    View Slide

  16. Honeycomb

    View Slide

  17. we can derive metrics
    from log streams

    View Slide

  18. $ cat access.log
    | grep ... | awk ...
    | sort | uniq -c

    View Slide

  19. {
    time = "1970-01-01T18:32:20"
    status = 200
    method = "GET"
    path = ...
    host = "i-123456af"
    client_ip = "10.2.3.4"
    user_agent = "curl/7.54.0"
    request_dur_ms = 325
    request_bytes = 2456
    response_bytes = 5324
    }

    View Slide

  20. structured logs
    summary events
    canonical log lines
    arbitrarily wide data blobs

    View Slide

  21. ~
    events
    ~

    View Slide

  22. a metric is an aggregation
    of events

    View Slide

  23. why do we aggregate?

    View Slide

  24. View Slide

  25. View Slide

  26. count
    p50
    p99
    max
    histogram

    View Slide

  27. events column store
    analytical
    queries
    { k: v }
    SELECT ...
    GROUP BY
    users

    app
    you are here

    View Slide

  28. prometheus and the
    problem with metrics

    View Slide

  29. View Slide

  30. domaso

    View Slide

  31. "it's slow"

    View Slide

  32. Honeycomb

    View Slide

  33. p99(request_latency)
    > 1000ms

    View Slide

  34. 300 requests were slow

    ... which ones?!

    View Slide

  35. group by

    View Slide

  36. most monitoring questions are
    ✨top-k

    View Slide

  37. top traffic by IP address
    top resource usage by customer
    top latency by country
    top error count by host
    top request size by client

    View Slide

  38. how many users
    are impacted?

    View Slide

  39. SELECT user_id, COUNT(*)
    FROM requests
    WHERE request_latency >= 1000
    GROUP BY user_id

    View Slide

  40. metrics will not
    tell you this

    View Slide

  41. ✨ cardinality

    View Slide

  42. Honeycomb

    View Slide

  43. Honeycomb

    View Slide

  44. http_requests_total{status=200}
    http_requests_total{status=201}
    http_requests_total{status=301}
    http_requests_total{status=304}
    ...
    http_requests_total{status=503}
    10

    View Slide

  45. user_id 10k

    View Slide

  46. ip address space = 2^32

    4 billion possible values
    100k

    View Slide

  47. kubectl get pods 100

    View Slide

  48. build_id 100

    View Slide

  49. the curse of
    dimensionality

    View Slide

  50. {
    status = 200
    method = "GET"
    path = ...
    host = "i-123456af"
    zone = "eu-central-1a"
    client_ip = "10.2.3.4"
    user_agent = "curl/7.54.0"
    client_country = "de"
    user_id = 30032
    partition_id = 31

    build_id = "9045e1"
    customer_plan = "platinum"
    endpoint = "tweet_detail"
    }

    View Slide

  51. {
    status = 200
    method = "GET"
    path = ...
    host = "i-123456af"
    zone = "eu-central-1a"
    client_ip = "10.2.3.4"
    user_agent = "curl/7.54.0"
    client_country = "de"
    user_id = 30032
    partition_id = 31

    build_id = "9045e1"
    customer_plan = "platinum"
    endpoint = "tweet_detail"
    }
    10
    5
    300
    20
    5
    1k
    300
    20
    1k
    32
    10
    3
    20

    View Slide

  52. 10 5 300 20 5
    ✖ ✖ ✖ ✖
    = 172'800'000'000
    000'000'000
    1k 300 20 1k 32
    10 3 20
    ✖ ✖ ✖ ✖
    ✖ ✖



    View Slide

  53. TheUjulala

    View Slide

  54. events column store
    analytical
    queries
    { k: v }
    SELECT ...
    GROUP BY
    users

    app
    you are here

    View Slide

  55. recording events

    View Slide

  56. View Slide

  57. {
    time = "1970-01-01T18:32:20"
    status = 200
    method = "GET"
    path = ...
    host = "i-123456af"
    region = "eu-central-1"
    zone = "eu-central-1a"
    client_ip = "10.2.3.4"
    user_agent = "curl/7.54.0"
    client_country = "de"
    kernel = "5.0.0-1018-aws"
    user_id = 30032
    tweet_id = 2297111098
    partition_id = 31

    build_id = "9045e1"

    request_id = "f2a3bdc4"
    customer_plan = "platinum"
    feature_blub = true
    cache = "miss"
    endpoint = "tweet_detail"
    request_dur_ms = 325
    db_dur_ms = 5
    db_pool_dur_ms = 3
    db_query_count = 63
    cache_dur_ms = 2
    svc_a_dur_ms = 32
    svc_b_dur_ms = 90
    request_bytes = 2456
    response_bytes = 5324
    }

    View Slide

  58. {
    time = "1970-01-01T18:32:20"
    status = 200
    method = "GET"
    path = ...
    host = "i-123456af"
    region = "eu-central-1"
    zone = "eu-central-1a"
    client_ip = "10.2.3.4"
    user_agent = "curl/7.54.0"
    client_country = "de"
    kernel = "5.0.0-1018-aws"
    }

    View Slide

  59. {
    user_id = 30032
    tweet_id = 2297111098
    partition_id = 31

    build_id = "9045e1"

    request_id = "f2a3bdc4"
    customer_plan = "platinum"
    feature_blub = true
    cache = "miss"
    endpoint = "tweet_detail"
    }

    View Slide

  60. {
    request_dur_ms = 325
    db_dur_ms = 5
    db_pool_dur_ms = 3
    db_query_count = 63
    cache_dur_ms = 2
    svc_a_dur_ms = 32
    svc_b_dur_ms = 90
    request_bytes = 2456
    response_bytes = 5324
    }

    View Slide

  61. Jaeger, Uber

    View Slide

  62. traces vs events:
    a false dichotomy

    View Slide

  63. we can derive events
    from traces

    View Slide

  64. Canopy Events
    (a) Engineers instrument Facebook components using a range of
    dierent Canopy instrumentation APIs ( ). At runtime, requests
    traverse components ( ) and propagate aTraceID ( ); when requests
    trigger instrumentation, Canopy generates and emits events ( ).

    Event Aggregation


    Model Construction

    Feature Extraction

    Query Evaluation

    Query Results, Visualizations, Graphs, etc.
    Raw Trace
    Events

    Trace
    Datasets


    Trace
    Model
    Canopy
    Engineers
    Feature
    Lambdas
    Performance
    Engineers
    Dataset
    Queries
    Any Facebook
    Engineer
    (b) Canopy’s tailer aggregates events ( ), constructs model-based
    traces ( ), evaluates user-supplied feature extraction functions ( ),
    and pipes output to user-defined datasets (). Users subsequently run
    queries, view dashboards and explore datasets ( ,).
    traces that unies the dier
    APIs used by Facebook dev
    user-supplied feature lambd
    interesting features from ea
    their feature lambdas with a
    es predicates for ltering un
    where to output the extracte
    are piped to Scuba [], an in
    performance data.
    Finally, Facebook enginee
    view visualizations and das
    ( ). In addition to user-con
    several shared datasets and v
    high-level features, plus tools
    lying traces if deeper inspec
    . Instrumentation API
    Instrumentation broadly com
    the TraceID alongside reques
    formance data generated by d
    the request structure, e.g. wh
    between threads and compo
    tion; and ) capturing usefu
    statements, performance cou
    Each Canopy instrumenta
    Canopy, Facebook

    View Slide

  65. View Slide

  66. stick those events in kafka

    View Slide

  67. events column store
    analytical
    queries
    { k: v }
    SELECT ...
    GROUP BY
    users

    app
    you are here

    View Slide

  68. columnar storage
    changed my life

    View Slide

  69. [Dea09 ]. These rough operation latencies help engineers reason about throughput, latency, and ca-
    pacity within a first-order approximation. We have updated the numbers here to reflect technology
    and hardware changes in WSC.
    Table 2.3: Latency numbers that every WSC engineer should know. (Updated
    version of table from [Dea09 ].)
    Operation Time
    L1 cache reference 1.5 ns
    L2 cache reference 5 ns
    Branch misprediction 6 ns
    Uncontended mutex lock/unlock 20 ns
    L3 cache reference 25 ns
    Main memory reference 100 ns
    Decompress 1 KB with Snappy [Sna] 500 ns
    “Far memory”/Fast NVM reference 1,000 ns (1us)
    Compress 1 KB with Snappy [Sna] 2,000 ns (2us)
    Read 1 MB sequentially from memory 12,000 ns (12 us)
    SSD Random Read 100,000 ns (100 us)
    Read 1 MB bytes sequentially from SSD 500,000 ns (500 us)
    Read 1 MB sequentially from 10Gbps network 1,000,000 ns (1 ms)
    Read 1 MB sequentially from disk 10,000,000 ns (10 ms)
    Disk seek 10,000,000 ns (10 ms)
    Send packet California→Netherlands→California 150,000,000 ns (150 ms)
    The Datacenter as a Computer, Barroso et al

    View Slide

  70. • 1TB Hitachi Deskstar 7K1000
    • disk seek time = 14ms
    • transfer rate = 69MB/s
    • 62.5 billion rows (= 1TB / 16 bytes)
    • 28 years (= 62.5 billion rows * 14 ms/row / 32×10^9
    ms/year)
    The Trouble with Point Queries, Bradley C. Kuszmaul

    View Slide

  71. • 1TB Hitachi Deskstar 7K1000
    • transfer rate = 69MB/s
    • 4 hours (= 1.000.000MB / 69MB/s / 3600 s/hour)

    View Slide

  72. • SSD
    • transfer rate = 1GB/s
    • 15 minutes (= 1.000GB / 1GB/s / 60 s/min)

    View Slide

  73. 10GB

    View Slide

  74. Dremel: Interactive Analysis of Web-Scale Datasets, Google

    View Slide

  75. 10 GB / 8 bytes per data point
    = 1.3 billion
    events

    View Slide

  76. status
    200
    200
    200
    200
    404
    200
    200
    200
    404
    200
    status
    4 * 200
    404
    3 * 200
    404
    200

    View Slide

  77. time-based partitioning

    View Slide

  78. dynamic sampling

    View Slide

  79. it's lossy, but that's fine

    View Slide

  80. vectorized processing

    View Slide

  81. Scuba: Diving into Data at Facebook, Facebook

    View Slide

  82. sequential scans

    columnar layout

    time-based partitioning

    compression / sampling

    vectorized processing

    sharding

    View Slide

  83. View Slide

  84. putting it all
    together

    View Slide

  85. events column store
    analytical
    queries
    { k: v }
    SELECT ...
    GROUP BY
    users

    app

    View Slide

  86. we need more of this
    in the monitoring space!

    View Slide

  87. SELECT user_id, COUNT(*)
    FROM requests
    WHERE status >= 500
    GROUP BY user_id
    ORDER BY COUNT(*) DESC
    LIMIT 10

    View Slide

  88. ✨ top-k
    ✨ cardinality
    ✨ events

    View Slide

  89. View Slide

  90. • Dremel: Interactive Analysis of Web-Scale Datasets from Google, 2010
    • Scuba: Diving into Data at Facebook from Facebook, 2016
    • Canopy: An End-to-End Performance Tracing And Analysis System from Facebook, 2017
    • Look at Your Data by John Rauser, Velocity 2011
    • Observability for Emerging Infra by Charity Majors, Strange Loop 2017
    • Why We Built Our Own Distributed Column Store by Sam Stokes, Strange Loop 2017
    • The Design and Implementation of Modern Column-Oriented Database Systems by Abadi et al, 2013
    • Designing Data-Intensive Applications by Martin Kleppmann, 2017
    • Monitoring in the time of Cloud Native by Cindy Sridharan, 2017
    • Logs vs. metrics: a false dichotomy by Nick Stenning, 2019
    • Using Canonical Log Lines for Online Visibility by Brandur Leach, 2016
    • The Datacenter as a Computer: Designing Warehouse-Scale Machines by Barroso et al, 2018

    View Slide

  91. @igorwhilefalse
    [email protected]
    .io

    View Slide