Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Twitter Monitors Millions of Time series

Yann Ramin
February 12, 2014

How Twitter Monitors Millions of Time series

Yann Ramin

February 12, 2014

More Decks by Yann Ramin

Other Decks in Programming


  1. How Twitter Monitors Millions of Time-series Yann Ramin Observability at

    Twitter Strata Santa Clara - 2014 @theatrus [email protected]
  2. Monitoring for all of Twitter Services and Infrastructure

  3. Time series data Generating, Collection, Storing, Querying Alerting For when

    you’re not watching Tracing Distributed systems call tracing Concerns
  4. Time series data

  5. None
  6. None
  7. Data from services Not just hosts

  8. Contrast: The “Nagios model”

  9. The website is slow

  10. “Nagios says it can’t connect to my webserver”

  11. Why?

  12. ssh  [email protected]  uptime   ssh  [email protected]  top   ssh  [email protected]

     tail  /var/log
  13. Now do that for more n > 5 servers

  14. Logs are unstructured

  15. “Log parsing” is a stop-gap Why deploy log parsing rules

    with applications?
  16. Move beyond logging structured statistics

  17. Provide rich and detailed instrumentation

  18. Make it cheap and easy

  19. First tier aggregations and sampling are in the application Incrementing

    atomic counter = cheap Writing to disk, sending packet, etc = expensive
  20. Lets look at Finagle-based services http://twitter.github.io/finagle/ !

  21. Lots of great default instrumentation For network, JVM, etc

  22. Easy to add more

  23. case  class  StatsFilter(      name:  String,      statsReceiver:

     StatsReceiver  =  NullStatsReceiver   )  extends  SimpleFilter[Things,  Unit]  {   !    private[this]  val  stats  =  statsReceiver.scope(name)      private[this]  val  all  =  stats.counter("all")   !    def  apply(set:  Things,  service:  Service[Things,  Unit]):   Future[Unit]  =  {          all.incr(set.length)          stats.counter(set.service).incr(set.metrics.length)          service(set)      }   }
  24. case  class  StatsFilter(      name:  String,      statsReceiver:

     StatsReceiver  =  NullStatsReceiver   )  extends  SimpleFilter[Things,  Unit]  {   !    private[this]  val  stats  =  statsReceiver.scope(name)      private[this]  val  all  =  stats.counter("all")   !    def  apply(set:  Things,  service:  Service[Things,  Unit]):   Future[Unit]  =  {          all.incr(set.length)          stats.counter(set.service).incr(set.metrics.length)          service(set)      }   } Get a StatsReceiver Make a scoped receiver Create a counter named all Increment the counter Get a counter named by variable, increment by length
  25. Easy to get out

  26. http://server:port/ admin/metrics.json

  27.        {...        "srv/http/request_latency_ms.avg":  45,  

           "srv/http/request_latency_ms.count":  181094,          "srv/http/request_latency_ms.max":  5333,          "srv/http/request_latency_ms.min":  0,          "srv/http/request_latency_ms.p50":  37,          "srv/http/request_latency_ms.p90":  72,          "srv/http/request_latency_ms.p95":  157,          "srv/http/request_latency_ms.p99":  308,          "srv/http/request_latency_ms.p9990":  820,          "srv/http/request_latency_ms.p9999":  820,          "srv/http/request_latency_ms.sum":  8149509,          "srv/http/requests":  18109445, !
  28. Great support for approximate histograms com.twitter.common.stats.ApproximateHistogram used as stats.stat(“timing”).add(datapoint)

  29. Also, counters & gauges

  30. Twitter-Server A simple way to make a Finagle server Sets

    things up the right way ! https://github.com/twitter/twitter-server
  31. What about everything else? Very simple HTTP+JSON protocols means this

    is easy to add to other persistent servers
  32. We support ephemeral tasks Rolls up into a persistent server

  33. Now we’ve replaced ssh with curl and this is where

    Observability comes in
  34. Collection

  35. None
  36. Distributed Scala service

  37. Find endpoints: ! Zookeeper Asset database other sources

  38. Fetch/sample data HTTP GET (via Finagle)

  39. Filter, cleanup, etc Hygiene for incoming data

  40. Route to storage layers! Time series database, memory pools, queues

    and HDFS aggregators
  41. Metrics are added by default Need instrumentation? Just add it!

    Shows up “instantly” in the system
  42. This is good

  43. Easy to use No management overhead “Can you add a

    rrd-file for us?”
  44. This is bad “Metric name on .toString” [[email protected]

  45. Remove barriers Be defensive Pick both

  46. Time series storage

  47. None
  48. Distributed Scala front-end service Databases, caches, data conversion, querying, etc.

  49. 220 million time series Updated every minute ! When this

    talk was proposed: 160 million time series
  50. Cassandra For real time storage

  51. (Now replaced with an internal database) Similar enough to Cassandra

  52. Uses KV storage For the most part

  53. Multiple clusters per DC For different access patterns

  54. We namespace metrics

  55. Service = group Source = where Metric = what

  56. Row key: (service, source, metric)

  57. Columns: timestamp = value

  58. Range scan for time series

  59. Tweaks: Optimizations for time series We never modify old data

    We time-bound old data writes
  60. Informed heuristics to reduce SSTables scanned

  61. Easy expiry - drop the whole SSTable

  62. Cassandra Counters Write time aggregations

  63. “Services as a whole” Why read every “source” all the

    time? Write them all into an aggregate
  64. Don’t scale with cluster size

  65. Limited aggregations Sum, Count

  66. Non-idempotent writes

  67. Bad failure modes Over counting? Undercounting? Who knows!

  68. Friends don’t let friends use counters http://aphyr.com/posts/294-call-me-maybe-cassandra

  69. Expanding storage tiers Memcache HDFS Logs On-demand high resolution samplers

  70. Name indexing

  71. What metrics exist? What instances? Hosts? Services?

  72. Used in language tools (globs, etc) and discovery tools (here

    is what you have)
  73. Index is temporal

  74. “All metrics matching http/*, from Oct 1-10”

  75. Maintained as a log of operations on a set

  76. t = 0, add metric r t = 2, remove

    metric q
  77. Snapshot to avoid long scans

  78. Getting data

  79. None
  80. Ad-hoc queries

  81. None
  82. Dashboards

  83. None
  84. None
  85. Specialized Visualizations: Storm

  86. Everything is built on our query language

  87. CQL Not the Cassandra one

  88. Functional/declarative language

  89. On-demand Don’t need to pre-register queries

  90. Aggregate, correlate and explore

  91. and many more (cross- DC federation, etc)

  92. Support matchers and drill down from index i.e., Explore by

    regex: http*latency.p9999
  93. Ratio of GC activity to requests served Get and combine

    two time series
  94. We didn’t create a stat :( Get and combine two

    time series
  95. But, we can query it! Get and combine two time

  96. ts(cuckoo,  members(role.cuckoo_frontend),   jvm_gc_msec)  /     ts(cuckoo,  members(role.cuckoo_frontend),  

    api/query_count) Get and combine two time series
  97. Queries work with “interactive” performance When something is wrong, you

    need data yesterday p50 = 2 milliseconds p9999 = 2 seconds
  98. Support individual time series and aggregates

  99. Common to aggregate 100-10,000 time series Over a week Still

    respond within 5 seconds, cold cache
  100. Aggregate partial caching max(rate(ts(  {10,000  time  series  match  })) Cache

    this result! Time limiting out-of-order arrivals makes this a safe operation
  101. Caching via Memcache Time-windowed immutable results e.g.1-minute, 5-minute, 30-minute, 3-hour

    immutable spans ! Replacing with an internal time series optimized cache
  102. Read federations Tiered storage: High temporal resolution, caches, long retention

    ! Different data centers and storage clusters
  103. Read federations Decomposes query, runs fragments next to storage

  104. On-demand secondly resolution sampling Launch sampler in Apache Mesos Discovery

    for read federation is automatic
  105. Query system uses a term rewriter structure Multi-pass optimizations Data

    source lookups Cache modeling Costing and very large query avoidance Inspired by Stratego/XT
  106. Alerting

  107. None
  108. Paging and e-mails

  109. Uses CQL Adds predicates for conditions See, unified access is

    a good thing!
  110. Widespread Watches all key services at Twitter

  111. Distributed Tracing

  112. Zipkin https://github.com/twitter/zipkin ! Based on the Dapper paper

  113. None
  114. Sampled traces of services calling services Hash of the trace

    ID mapped to sampling ratio
  115. Annotations on traces Request parameters, internal timing, servers, clients, etc.

  116. Finagle “upgrades” the Thrift protocol Calls test method, if present

    adds random trace ID and span ID to future messages on the connection
  117. Also for HTTP

  118. Force debug capability Now with Firefox plugin! https://blog.twitter.com/2013/zippy-traces-zipkin-your- browser

  119. None
  120. Requires services to support tracing Limited support outside Finagle Contributions

  121. Thanks! Yann Ramin Observability @ Twitter ! @theatrus [email protected]