Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graphite 101

Graphite 101

A bitly tech talk on how we do internal app metrics with Graphite

Jehiah Czebotar

January 16, 2014
Tweet

Other Decks in Technology

Transcript

  1. GRAPHITE 101 What we count, How we count it, and

    Why Jehiah - bitly Tech Talk - Jan 16th 2014
  2. Why do we look at Numbers and Metrics every day?

    Detect Trends in usage (desired or undesired) Make Product Decisions Quantify Effects of Changes Ensure Health of Systems
  3. COUNTER Something you can discretely count via 1+1 You would

    measure entrances to the bitly office by counting +=1 each time someone enters. You can break the counting into parts (by time and by hour) and sum() to get the total.
  4. GAUGE A gauge is something you want to keep track

    of the absolute value over time (like the temperature outside). If you count from multiple locations or multiple segments you max()/min()/avg() to aggregate.
  5. PERCENTILES When you want to know the distribution of a

    set of values. (What was the fastest/slowest/average time x took). Stored as pre-calculated percentiles, and averages for all the values in a time period (minute). 100th percentile is a max, 0th percentile is a min. 50th percentile is the median To combine values you avg()/min()/max() percentiles depending on desired output
  6. 434 servers 14,000 HTTP Requests/second (1B/day) 3,250 Link Clicks/second (223M/day)

    67,000 Link Metrics/second (4B/day) 59,515 Graphite Metrics every minute
  7. App statsdaemon UDP key:3|c Graphite TCP */60sec key 8 {ts}

    key:5|c App statsdaemon UDP temp:3|g Graphite TCP */60sec temp 5 {ts} temp:5|g App statsdaemon UDP d:7|ms Graphite TCP */60sec d.count 3 {ts} d:5|ms d:23|ms d.upper_100 23 {ts} d.median 7 {ts}
  8. GRAPHITE FUNCTIONS sumSeries(…) summarize(…,"1d","avg") scale(…, n) asPercent(…, …) alias(…, "key")

    maxSeries(…) groupByNode(…, n) highestAverage(…, n) timeShift(…, "7d") movingAverage(…,"1d") http://bit.ly/graphite-functions
  9. GOTCHAS • At less than 1 datapoint per hour, per

    host metrics are inaccurate (nulls and zero’s don’t average the same way) • 95th percentile counted across five hosts is the 99th percentile. • Percentiles combine in low volume situations (when you have fewer datapoints than percentiles)