Save 37% off PRO during our Black Friday Sale! »

openTSDB - metrics for a distributed world

openTSDB - metrics for a distributed world

This is the talk about our metrics solution at gutefrage.net. We use openTSDB. I gave this talk at the Open Source Monitorin Conference 2013 in Nuremberg.

F6267239b86c21f830d51d87ff5bf5e5?s=128

Oliver Hankeln

October 24, 2013
Tweet

Transcript

  1. openTSDB - Metrics for a distributed world Oliver Hankeln /

    gutefrage.net @mydalon Donnerstag, 24. Oktober 13
  2. Who am I? Senior Engineer - Data and Infrastructure at

    gutefrage.net GmbH Was doing software development before DevOps advocate Donnerstag, 24. Oktober 13
  3. Who is Gutefrage.net? Germany‘s biggest Q&A platform #1 German site

    (mobile) about 5M Unique Users #3 German site (desktop) about 17M Unique Users > 4 Mio PI/day Part of the Holtzbrinck group Running several platforms (Gutefrage.net, Helpster.de, Cosmiq, Comprano, ...) Donnerstag, 24. Oktober 13
  4. What you will get Why we chose openTSDB What is

    openTSDB? How does openTSDB store the data? Our experiences Some advice Donnerstag, 24. Oktober 13
  5. Why we chose openTSDB Donnerstag, 24. Oktober 13

  6. We were looking at some options Munin Graphite openTSDB Ganglia

    Scales well no sort of yes yes Keeps all data no no yes no Creating metrics easy easy easy easy Donnerstag, 24. Oktober 13
  7. We have a winner! Munin Graphite openTSDB Ganglia Scales well

    no sort of yes yes Keeps all data no no yes no Creating metrics easy easy easy easy Bingo! Donnerstag, 24. Oktober 13
  8. Separation of concerns $ unzip|strip|touch|finger|grep|mount|fsck|more|yes| fsck|fsck|fsck|umount|sleep Donnerstag, 24. Oktober 13

  9. The ecosystem App feeds metrics in via RabbitMQ We base

    Icinga checks on the metrics We evaluate etsy Skyline for anomaly detection We deploy sensors via chef Donnerstag, 24. Oktober 13
  10. openTSDB Written at StumbleUpon but OpenSource Uses HBase (which is

    based on HDFS) as a storage Distributed system (multiple TSDs) Donnerstag, 24. Oktober 13
  11. The big picture HBase TSD TSD TSD TSD UI API

    tcollector This is really a cluster Donnerstag, 24. Oktober 13
  12. Putting data into openTSDB $ telnet tsd01.acme.com 4242 put proc.load.avg5min

    1382536472 23.2 host=db01.acme.com Donnerstag, 24. Oktober 13
  13. It gets even better tcollector is a python script that

    runs your collectors handles network connection, starts your collectors at set intervals does basic process management adds host tag, does deduplication Donnerstag, 24. Oktober 13
  14. A simple tcollector script #!/usr/bin/php <?php #Cast a die $die

    = rand(1,6); echo "roll.a.d6 " . time() . " " . $die . "\n"; Donnerstag, 24. Oktober 13
  15. What was that HDFS again? HDFS is a distributed filesystem

    suitable for Petabytes of data on thousands of machines. Runs on commodity hardware Takes care of redundancy Used by e.g. Facebook, Spotify, eBay,... Donnerstag, 24. Oktober 13
  16. Okay... and HBase? HBase is a NoSQL database / data

    store on top of HDFS Modeled after Google‘s BigTable Built for big tables (billions of rows, millions of columns) Automatic sharding by row key Donnerstag, 24. Oktober 13
  17. How openTSDB stores the data Donnerstag, 24. Oktober 13

  18. Keys are key! Data is sharded across regions based on

    their row key You query data based on the row key You can query row key ranges (say e.g. A...D) So: think about key design Donnerstag, 24. Oktober 13
  19. Take 1 Row key format: timestamp, metric id Donnerstag, 24.

    Oktober 13
  20. Take 1 Row key format: timestamp, metric id 1382536472, 5

    17 Server A Server B Donnerstag, 24. Oktober 13
  21. Take 1 Row key format: timestamp, metric id 1382536472, 5

    17 1382536472, 6 24 Server A Server B Donnerstag, 24. Oktober 13
  22. Take 1 Row key format: timestamp, metric id 1382536472, 5

    17 1382536472, 6 24 1382536472, 8 12 1382536473, 5 134 1382536473, 6 10 1382536473, 8 99 Server A Server B Donnerstag, 24. Oktober 13
  23. Take 1 Row key format: timestamp, metric id 1382536472, 5

    17 1382536472, 6 24 1382536472, 8 12 1382536473, 5 134 1382536473, 6 10 1382536473, 8 99 1382536474, 5 12 1382536474, 6 42 Server A Server B Donnerstag, 24. Oktober 13
  24. Solution: Swap timestamp and metric id Row key format: metric

    id, timestamp 5, 1382536472 17 6, 1382536472 24 8, 1382536472 12 5, 1382536473 134 6, 1382536473 10 8, 1382536473 99 5, 1382536474 12 6, 1382536474 42 Server A Server B Donnerstag, 24. Oktober 13
  25. Solution: Swap timestamp and metric id Row key format: metric

    id, timestamp 5, 1382536472 17 6, 1382536472 24 8, 1382536472 12 5, 1382536473 134 6, 1382536473 10 8, 1382536473 99 5, 1382536474 12 6, 1382536474 42 Server A Server B Donnerstag, 24. Oktober 13
  26. Take 2 Metric ID first, then timestamp Searching through many

    rows is slower than searching through viewer rows. (Obviously) So: Put multiple data points into one row Donnerstag, 24. Oktober 13
  27. Take 2 continued 5, 1382608800 +23 +35 +94 +142 5,

    1382608800 17 1 23 42 5, 1382612400 +13 +25 +88 +89 5, 1382612400 3 44 12 2 Donnerstag, 24. Oktober 13
  28. Take 2 continued 5, 1382608800 +23 +35 +94 +142 5,

    1382608800 17 1 23 42 5, 1382612400 +13 +25 +88 +89 5, 1382612400 3 44 12 2 Row key Donnerstag, 24. Oktober 13
  29. Take 2 continued 5, 1382608800 +23 +35 +94 +142 5,

    1382608800 17 1 23 42 5, 1382612400 +13 +25 +88 +89 5, 1382612400 3 44 12 2 Row key Cell Name Donnerstag, 24. Oktober 13
  30. Take 2 continued 5, 1382608800 +23 +35 +94 +142 5,

    1382608800 17 1 23 42 5, 1382612400 +13 +25 +88 +89 5, 1382612400 3 44 12 2 Row key Cell Name Data point Donnerstag, 24. Oktober 13
  31. Where are the tags stored? They are put at the

    end of the row key Both metric names and metric values are represented by IDs Donnerstag, 24. Oktober 13
  32. The Row Key 3 Bytes - metric ID 4 Bytes

    - timestamp (rounded down to the hour) 3 Bytes tag ID 3 Bytes tag value ID Total: 7 Bytes + 6 Bytes * Number of tags Donnerstag, 24. Oktober 13
  33. Let‘s look at some graphs Donnerstag, 24. Oktober 13

  34. Our experiences Donnerstag, 24. Oktober 13

  35. What works well We store about 200M data points in

    several thousand time series with no issues tcollector is decoupling measurement from storage Creating new metrics is really easy Donnerstag, 24. Oktober 13
  36. Challenges The UI is seriously lacking no annotation support out

    of the box Only 1s time resolution (and only 1 value/s/ time series) Donnerstag, 24. Oktober 13
  37. salvation is coming OpenTSDB 2 is around the corner millisecond

    precision annotations and meta data improved API Donnerstag, 24. Oktober 13
  38. Friendly advice Pick a naming scheme and stick to it

    Use tags wisely (not more than 6 or 7 tags per data point) Use tcollector wait for openTSDB 2 ;-) Donnerstag, 24. Oktober 13
  39. Questions? Please contact me: oliver.hankeln@gutefrage.net @mydalon I‘ll upload the slides

    and tweet about it Donnerstag, 24. Oktober 13