$30 off During Our Annual Pro Sale. View Details »

How Twitter Monitors Millions of Time series

How Twitter Monitors Millions of Time series

Yann Ramin

February 12, 2014
Tweet

More Decks by Yann Ramin

Other Decks in Programming

Transcript

  1. How Twitter Monitors
    Millions of Time-series
    Yann Ramin
    Observability at Twitter
    Strata Santa Clara - 2014
    @theatrus
    [email protected]

    View Slide

  2. Monitoring for all of Twitter
    Services and Infrastructure

    View Slide

  3. Time series data
    Generating, Collection, Storing, Querying
    Alerting
    For when you’re not watching
    Tracing
    Distributed systems call tracing
    Concerns

    View Slide

  4. Time series data

    View Slide

  5. View Slide

  6. View Slide

  7. Data from services
    Not just hosts

    View Slide

  8. Contrast:
    The “Nagios model”

    View Slide

  9. The website is slow

    View Slide

  10. “Nagios says it can’t connect
    to my webserver”

    View Slide

  11. Why?

    View Slide

  12. ssh  me@host  uptime  
    ssh  me@host  top  
    ssh  me@host  tail  /var/log

    View Slide

  13. Now do that for more n > 5
    servers

    View Slide

  14. Logs are unstructured

    View Slide

  15. “Log parsing” is a stop-gap
    Why deploy log parsing rules with applications?

    View Slide

  16. Move beyond logging
    structured statistics

    View Slide

  17. Provide rich and detailed
    instrumentation

    View Slide

  18. Make it cheap
    and easy

    View Slide

  19. First tier aggregations and
    sampling are in the application
    Incrementing atomic counter = cheap
    Writing to disk, sending packet, etc = expensive

    View Slide

  20. Lets look at Finagle-based
    services
    http://twitter.github.io/finagle/
    !

    View Slide

  21. Lots of great default
    instrumentation
    For network, JVM, etc

    View Slide

  22. Easy to add more

    View Slide

  23. case  class  StatsFilter(  
       name:  String,  
       statsReceiver:  StatsReceiver  =  NullStatsReceiver  
    )  extends  SimpleFilter[Things,  Unit]  {  
    !
       private[this]  val  stats  =  statsReceiver.scope(name)  
       private[this]  val  all  =  stats.counter("all")  
    !
       def  apply(set:  Things,  service:  Service[Things,  Unit]):  
    Future[Unit]  =  {  
           all.incr(set.length)  
           stats.counter(set.service).incr(set.metrics.length)  
           service(set)  
       }  
    }

    View Slide

  24. case  class  StatsFilter(  
       name:  String,  
       statsReceiver:  StatsReceiver  =  NullStatsReceiver  
    )  extends  SimpleFilter[Things,  Unit]  {  
    !
       private[this]  val  stats  =  statsReceiver.scope(name)  
       private[this]  val  all  =  stats.counter("all")  
    !
       def  apply(set:  Things,  service:  Service[Things,  Unit]):  
    Future[Unit]  =  {  
           all.incr(set.length)  
           stats.counter(set.service).incr(set.metrics.length)  
           service(set)  
       }  
    }
    Get a StatsReceiver
    Make a scoped receiver
    Create a counter named all
    Increment the counter
    Get a counter named by variable, increment by length

    View Slide

  25. Easy to get out

    View Slide

  26. http://server:port/
    admin/metrics.json

    View Slide

  27.        {...
           "srv/http/request_latency_ms.avg":  45,  
           "srv/http/request_latency_ms.count":  181094,  
           "srv/http/request_latency_ms.max":  5333,  
           "srv/http/request_latency_ms.min":  0,  
           "srv/http/request_latency_ms.p50":  37,  
           "srv/http/request_latency_ms.p90":  72,  
           "srv/http/request_latency_ms.p95":  157,  
           "srv/http/request_latency_ms.p99":  308,  
           "srv/http/request_latency_ms.p9990":  820,  
           "srv/http/request_latency_ms.p9999":  820,  
           "srv/http/request_latency_ms.sum":  8149509,  
           "srv/http/requests":  18109445,
    !

    View Slide

  28. Great support for approximate
    histograms
    com.twitter.common.stats.ApproximateHistogram
    used as stats.stat(“timing”).add(datapoint)

    View Slide

  29. Also, counters & gauges

    View Slide

  30. Twitter-Server
    A simple way to make a Finagle server

    Sets things up the right way

    !
    https://github.com/twitter/twitter-server

    View Slide

  31. What about everything else?
    Very simple HTTP+JSON protocols means this is easy to add to other
    persistent servers

    View Slide

  32. We support ephemeral tasks
    Rolls up into a persistent server

    View Slide

  33. Now we’ve replaced ssh with curl
    and this is where Observability comes in

    View Slide

  34. Collection

    View Slide

  35. View Slide

  36. Distributed Scala service

    View Slide

  37. Find endpoints:
    !
    Zookeeper
    Asset database
    other sources

    View Slide

  38. Fetch/sample data
    HTTP GET (via Finagle)

    View Slide

  39. Filter, cleanup, etc
    Hygiene for incoming data

    View Slide

  40. Route to storage layers!
    Time series database, memory pools, queues and
    HDFS aggregators

    View Slide

  41. Metrics are added by default
    Need instrumentation? Just add it!
    Shows up “instantly” in the system

    View Slide

  42. This is good

    View Slide

  43. Easy to use
    No management overhead
    “Can you add a rrd-file for us?”

    View Slide

  44. This is bad
    “Metric name on .toString”
    [I@579b7698

    View Slide

  45. Remove barriers
    Be defensive
    Pick both

    View Slide

  46. Time series storage

    View Slide

  47. View Slide

  48. Distributed Scala front-end
    service
    Databases, caches, data conversion, querying, etc.

    View Slide

  49. 220 million time series
    Updated every minute
    !
    When this talk was proposed: 160 million time series

    View Slide

  50. Cassandra
    For real time storage

    View Slide

  51. (Now replaced with an internal
    database)
    Similar enough to Cassandra

    View Slide

  52. Uses KV storage
    For the most part

    View Slide

  53. Multiple clusters per DC
    For different access patterns

    View Slide

  54. We namespace metrics

    View Slide

  55. Service = group
    Source = where
    Metric = what

    View Slide

  56. Row key:
    (service, source, metric)

    View Slide

  57. Columns:
    timestamp = value

    View Slide

  58. Range scan for time series

    View Slide

  59. Tweaks: Optimizations for
    time series
    We never modify old data
    We time-bound old data writes

    View Slide

  60. Informed heuristics to reduce
    SSTables scanned

    View Slide

  61. Easy expiry - drop the whole
    SSTable

    View Slide

  62. Cassandra Counters
    Write time aggregations

    View Slide

  63. “Services as a whole”
    Why read every “source” all the time?
    Write them all into an aggregate

    View Slide

  64. Don’t scale with cluster size

    View Slide

  65. Limited aggregations
    Sum, Count

    View Slide

  66. Non-idempotent writes

    View Slide

  67. Bad failure modes
    Over counting? Undercounting? Who knows!

    View Slide

  68. Friends don’t let friends use
    counters
    http://aphyr.com/posts/294-call-me-maybe-cassandra

    View Slide

  69. Expanding storage tiers
    Memcache
    HDFS Logs
    On-demand high resolution samplers

    View Slide

  70. Name indexing

    View Slide

  71. What metrics exist?
    What instances? Hosts?
    Services?

    View Slide

  72. Used in language tools (globs, etc)
    and discovery tools (here is what you
    have)

    View Slide

  73. Index is temporal

    View Slide

  74. “All metrics matching
    http/*, from Oct 1-10”

    View Slide

  75. Maintained as a log of operations
    on a set

    View Slide

  76. t = 0, add metric r
    t = 2, remove metric q

    View Slide

  77. Snapshot to avoid long scans

    View Slide

  78. Getting data

    View Slide

  79. View Slide

  80. Ad-hoc queries

    View Slide

  81. View Slide

  82. Dashboards

    View Slide

  83. View Slide

  84. View Slide

  85. Specialized Visualizations: Storm

    View Slide

  86. Everything is built on our query
    language

    View Slide

  87. CQL
    Not the Cassandra one

    View Slide

  88. Functional/declarative language

    View Slide

  89. On-demand
    Don’t need to pre-register queries

    View Slide

  90. Aggregate, correlate and explore

    View Slide

  91. and many more (cross-
    DC federation, etc)

    View Slide

  92. Support matchers and drill down
    from index
    i.e., Explore by regex: http*latency.p9999

    View Slide

  93. Ratio of GC activity to requests served
    Get and combine two
    time series

    View Slide

  94. We didn’t create a stat :(
    Get and combine two
    time series

    View Slide

  95. But, we can query it!
    Get and combine two
    time series

    View Slide

  96. ts(cuckoo,  members(role.cuckoo_frontend),  
    jvm_gc_msec)  /    
    ts(cuckoo,  members(role.cuckoo_frontend),  
    api/query_count)
    Get and combine two
    time series

    View Slide

  97. Queries work with
    “interactive”
    performance
    When something is wrong, you need data yesterday
    p50 = 2 milliseconds
    p9999 = 2 seconds

    View Slide

  98. Support individual time series
    and aggregates

    View Slide

  99. Common to aggregate
    100-10,000 time
    series
    Over a week

    Still respond within 5 seconds, cold cache

    View Slide

  100. Aggregate partial
    caching
    max(rate(ts(  {10,000  time  series  match  }))
    Cache this
    result!
    Time limiting out-of-order arrivals makes this a safe operation

    View Slide

  101. Caching via
    Memcache
    Time-windowed immutable results

    e.g.1-minute, 5-minute, 30-minute, 3-hour immutable spans

    !
    Replacing with an internal time series optimized cache

    View Slide

  102. Read federations
    Tiered storage:
    High temporal resolution, caches, long retention
    !
    Different data centers and storage clusters

    View Slide

  103. Read federations
    Decomposes query, runs fragments next to storage

    View Slide

  104. On-demand secondly resolution
    sampling
    Launch sampler in Apache Mesos
    Discovery for read federation is automatic

    View Slide

  105. Query system uses a
    term rewriter
    structure
    Multi-pass optimizations
    Data source lookups
    Cache modeling
    Costing and very large query avoidance
    Inspired by Stratego/XT

    View Slide

  106. Alerting

    View Slide

  107. View Slide

  108. Paging and e-mails

    View Slide

  109. Uses CQL
    Adds predicates for conditions
    See, unified access is a good thing!

    View Slide

  110. Widespread
    Watches all key services at Twitter

    View Slide

  111. Distributed Tracing

    View Slide

  112. Zipkin
    https://github.com/twitter/zipkin

    !
    Based on the Dapper paper

    View Slide

  113. View Slide

  114. Sampled traces of services calling
    services
    Hash of the trace ID mapped to sampling ratio

    View Slide

  115. Annotations on traces
    Request parameters, internal timing, servers, clients, etc.

    View Slide

  116. Finagle “upgrades” the Thrift
    protocol
    Calls test method, if present adds random trace ID and span ID to future
    messages on the connection

    View Slide

  117. Also for HTTP

    View Slide

  118. Force debug capability
    Now with Firefox plugin!
    https://blog.twitter.com/2013/zippy-traces-zipkin-your-
    browser

    View Slide

  119. View Slide

  120. Requires services to support
    tracing
    Limited support outside Finagle
    Contributions welcome!

    View Slide

  121. Thanks!
    Yann Ramin
    Observability @ Twitter
    !
    @theatrus
    [email protected]

    View Slide