Slide 1

Slide 1 text

How Twitter Monitors Millions of Time-series Yann Ramin Observability at Twitter Strata Santa Clara - 2014 @theatrus [email protected]

Slide 2

Slide 2 text

Monitoring for all of Twitter Services and Infrastructure

Slide 3

Slide 3 text

Time series data Generating, Collection, Storing, Querying Alerting For when you’re not watching Tracing Distributed systems call tracing Concerns

Slide 4

Slide 4 text

Time series data

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Data from services Not just hosts

Slide 8

Slide 8 text

Contrast: The “Nagios model”

Slide 9

Slide 9 text

The website is slow

Slide 10

Slide 10 text

“Nagios says it can’t connect to my webserver”

Slide 11

Slide 11 text

Why?

Slide 12

Slide 12 text

ssh  me@host  uptime   ssh  me@host  top   ssh  me@host  tail  /var/log

Slide 13

Slide 13 text

Now do that for more n > 5 servers

Slide 14

Slide 14 text

Logs are unstructured

Slide 15

Slide 15 text

“Log parsing” is a stop-gap Why deploy log parsing rules with applications?

Slide 16

Slide 16 text

Move beyond logging structured statistics

Slide 17

Slide 17 text

Provide rich and detailed instrumentation

Slide 18

Slide 18 text

Make it cheap and easy

Slide 19

Slide 19 text

First tier aggregations and sampling are in the application Incrementing atomic counter = cheap Writing to disk, sending packet, etc = expensive

Slide 20

Slide 20 text

Lets look at Finagle-based services http://twitter.github.io/finagle/ !

Slide 21

Slide 21 text

Lots of great default instrumentation For network, JVM, etc

Slide 22

Slide 22 text

Easy to add more

Slide 23

Slide 23 text

case  class  StatsFilter(      name:  String,      statsReceiver:  StatsReceiver  =  NullStatsReceiver   )  extends  SimpleFilter[Things,  Unit]  {   !    private[this]  val  stats  =  statsReceiver.scope(name)      private[this]  val  all  =  stats.counter("all")   !    def  apply(set:  Things,  service:  Service[Things,  Unit]):   Future[Unit]  =  {          all.incr(set.length)          stats.counter(set.service).incr(set.metrics.length)          service(set)      }   }

Slide 24

Slide 24 text

case  class  StatsFilter(      name:  String,      statsReceiver:  StatsReceiver  =  NullStatsReceiver   )  extends  SimpleFilter[Things,  Unit]  {   !    private[this]  val  stats  =  statsReceiver.scope(name)      private[this]  val  all  =  stats.counter("all")   !    def  apply(set:  Things,  service:  Service[Things,  Unit]):   Future[Unit]  =  {          all.incr(set.length)          stats.counter(set.service).incr(set.metrics.length)          service(set)      }   } Get a StatsReceiver Make a scoped receiver Create a counter named all Increment the counter Get a counter named by variable, increment by length

Slide 25

Slide 25 text

Easy to get out

Slide 26

Slide 26 text

http://server:port/ admin/metrics.json

Slide 27

Slide 27 text

       {...        "srv/http/request_latency_ms.avg":  45,          "srv/http/request_latency_ms.count":  181094,          "srv/http/request_latency_ms.max":  5333,          "srv/http/request_latency_ms.min":  0,          "srv/http/request_latency_ms.p50":  37,          "srv/http/request_latency_ms.p90":  72,          "srv/http/request_latency_ms.p95":  157,          "srv/http/request_latency_ms.p99":  308,          "srv/http/request_latency_ms.p9990":  820,          "srv/http/request_latency_ms.p9999":  820,          "srv/http/request_latency_ms.sum":  8149509,          "srv/http/requests":  18109445, !

Slide 28

Slide 28 text

Great support for approximate histograms com.twitter.common.stats.ApproximateHistogram used as stats.stat(“timing”).add(datapoint)

Slide 29

Slide 29 text

Also, counters & gauges

Slide 30

Slide 30 text

Twitter-Server A simple way to make a Finagle server Sets things up the right way ! https://github.com/twitter/twitter-server

Slide 31

Slide 31 text

What about everything else? Very simple HTTP+JSON protocols means this is easy to add to other persistent servers

Slide 32

Slide 32 text

We support ephemeral tasks Rolls up into a persistent server

Slide 33

Slide 33 text

Now we’ve replaced ssh with curl and this is where Observability comes in

Slide 34

Slide 34 text

Collection

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Distributed Scala service

Slide 37

Slide 37 text

Find endpoints: ! Zookeeper Asset database other sources

Slide 38

Slide 38 text

Fetch/sample data HTTP GET (via Finagle)

Slide 39

Slide 39 text

Filter, cleanup, etc Hygiene for incoming data

Slide 40

Slide 40 text

Route to storage layers! Time series database, memory pools, queues and HDFS aggregators

Slide 41

Slide 41 text

Metrics are added by default Need instrumentation? Just add it! Shows up “instantly” in the system

Slide 42

Slide 42 text

This is good

Slide 43

Slide 43 text

Easy to use No management overhead “Can you add a rrd-file for us?”

Slide 44

Slide 44 text

This is bad “Metric name on .toString” [I@579b7698

Slide 45

Slide 45 text

Remove barriers Be defensive Pick both

Slide 46

Slide 46 text

Time series storage

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Distributed Scala front-end service Databases, caches, data conversion, querying, etc.

Slide 49

Slide 49 text

220 million time series Updated every minute ! When this talk was proposed: 160 million time series

Slide 50

Slide 50 text

Cassandra For real time storage

Slide 51

Slide 51 text

(Now replaced with an internal database) Similar enough to Cassandra

Slide 52

Slide 52 text

Uses KV storage For the most part

Slide 53

Slide 53 text

Multiple clusters per DC For different access patterns

Slide 54

Slide 54 text

We namespace metrics

Slide 55

Slide 55 text

Service = group Source = where Metric = what

Slide 56

Slide 56 text

Row key: (service, source, metric)

Slide 57

Slide 57 text

Columns: timestamp = value

Slide 58

Slide 58 text

Range scan for time series

Slide 59

Slide 59 text

Tweaks: Optimizations for time series We never modify old data We time-bound old data writes

Slide 60

Slide 60 text

Informed heuristics to reduce SSTables scanned

Slide 61

Slide 61 text

Easy expiry - drop the whole SSTable

Slide 62

Slide 62 text

Cassandra Counters Write time aggregations

Slide 63

Slide 63 text

“Services as a whole” Why read every “source” all the time? Write them all into an aggregate

Slide 64

Slide 64 text

Don’t scale with cluster size

Slide 65

Slide 65 text

Limited aggregations Sum, Count

Slide 66

Slide 66 text

Non-idempotent writes

Slide 67

Slide 67 text

Bad failure modes Over counting? Undercounting? Who knows!

Slide 68

Slide 68 text

Friends don’t let friends use counters http://aphyr.com/posts/294-call-me-maybe-cassandra

Slide 69

Slide 69 text

Expanding storage tiers Memcache HDFS Logs On-demand high resolution samplers

Slide 70

Slide 70 text

Name indexing

Slide 71

Slide 71 text

What metrics exist? What instances? Hosts? Services?

Slide 72

Slide 72 text

Used in language tools (globs, etc) and discovery tools (here is what you have)

Slide 73

Slide 73 text

Index is temporal

Slide 74

Slide 74 text

“All metrics matching http/*, from Oct 1-10”

Slide 75

Slide 75 text

Maintained as a log of operations on a set

Slide 76

Slide 76 text

t = 0, add metric r t = 2, remove metric q

Slide 77

Slide 77 text

Snapshot to avoid long scans

Slide 78

Slide 78 text

Getting data

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

Ad-hoc queries

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

Dashboards

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

Specialized Visualizations: Storm

Slide 86

Slide 86 text

Everything is built on our query language

Slide 87

Slide 87 text

CQL Not the Cassandra one

Slide 88

Slide 88 text

Functional/declarative language

Slide 89

Slide 89 text

On-demand Don’t need to pre-register queries

Slide 90

Slide 90 text

Aggregate, correlate and explore

Slide 91

Slide 91 text

and many more (cross- DC federation, etc)

Slide 92

Slide 92 text

Support matchers and drill down from index i.e., Explore by regex: http*latency.p9999

Slide 93

Slide 93 text

Ratio of GC activity to requests served Get and combine two time series

Slide 94

Slide 94 text

We didn’t create a stat :( Get and combine two time series

Slide 95

Slide 95 text

But, we can query it! Get and combine two time series

Slide 96

Slide 96 text

ts(cuckoo,  members(role.cuckoo_frontend),   jvm_gc_msec)  /     ts(cuckoo,  members(role.cuckoo_frontend),   api/query_count) Get and combine two time series

Slide 97

Slide 97 text

Queries work with “interactive” performance When something is wrong, you need data yesterday p50 = 2 milliseconds p9999 = 2 seconds

Slide 98

Slide 98 text

Support individual time series and aggregates

Slide 99

Slide 99 text

Common to aggregate 100-10,000 time series Over a week Still respond within 5 seconds, cold cache

Slide 100

Slide 100 text

Aggregate partial caching max(rate(ts(  {10,000  time  series  match  })) Cache this result! Time limiting out-of-order arrivals makes this a safe operation

Slide 101

Slide 101 text

Caching via Memcache Time-windowed immutable results e.g.1-minute, 5-minute, 30-minute, 3-hour immutable spans ! Replacing with an internal time series optimized cache

Slide 102

Slide 102 text

Read federations Tiered storage: High temporal resolution, caches, long retention ! Different data centers and storage clusters

Slide 103

Slide 103 text

Read federations Decomposes query, runs fragments next to storage

Slide 104

Slide 104 text

On-demand secondly resolution sampling Launch sampler in Apache Mesos Discovery for read federation is automatic

Slide 105

Slide 105 text

Query system uses a term rewriter structure Multi-pass optimizations Data source lookups Cache modeling Costing and very large query avoidance Inspired by Stratego/XT

Slide 106

Slide 106 text

Alerting

Slide 107

Slide 107 text

No content

Slide 108

Slide 108 text

Paging and e-mails

Slide 109

Slide 109 text

Uses CQL Adds predicates for conditions See, unified access is a good thing!

Slide 110

Slide 110 text

Widespread Watches all key services at Twitter

Slide 111

Slide 111 text

Distributed Tracing

Slide 112

Slide 112 text

Zipkin https://github.com/twitter/zipkin ! Based on the Dapper paper

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

Sampled traces of services calling services Hash of the trace ID mapped to sampling ratio

Slide 115

Slide 115 text

Annotations on traces Request parameters, internal timing, servers, clients, etc.

Slide 116

Slide 116 text

Finagle “upgrades” the Thrift protocol Calls test method, if present adds random trace ID and span ID to future messages on the connection

Slide 117

Slide 117 text

Also for HTTP

Slide 118

Slide 118 text

Force debug capability Now with Firefox plugin! https://blog.twitter.com/2013/zippy-traces-zipkin-your- browser

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

Requires services to support tracing Limited support outside Finagle Contributions welcome!

Slide 121

Slide 121 text

Thanks! Yann Ramin Observability @ Twitter ! @theatrus [email protected]