This talk covers openTSDB, culture around monitoring. It looks into how openTSDB stores its data in HBase, which is based on HDFS.
I share our experiences and give advice on how to use this measuring tool
Who am I? Senior Engineer - Data and Infrastructure at gutefrage.net GmbH Was doing software development before DevOps advocate Mittwoch, 30. Oktober 13
Who is Gutefrage.net? Germany‘s biggest Q&A platform #1 German site (mobile) about 5M Unique Users #3 German site (desktop) about 17M Unique Users > 4 Mio PI/day Part of the Holtzbrinck group Running several platforms (Gutefrage.net, Helpster.de, Cosmiq, Comprano, ...) Mittwoch, 30. Oktober 13
We were looking at some options Munin Graphite openTSDB Ganglia Scales well no sort of yes yes Keeps all data no no yes no Creating metrics easy easy easy easy Mittwoch, 30. Oktober 13
We have a winner! Munin Graphite openTSDB Ganglia Scales well no sort of yes yes Keeps all data no no yes no Creating metrics easy easy easy easy Bingo! Mittwoch, 30. Oktober 13
Separation of concerns UI was not important for our decision Alerting is not what we are looking for in our time series data base $ unzip|strip|touch|finger|grep|mount|fsck|more|yes| fsck|fsck|fsck|umount|sleep Mittwoch, 30. Oktober 13
The ecosystem App feeds metrics in via RabbitMQ We base Icinga checks on the metrics We evaluate Skyline and Oculus by Etsy for anomaly detection We deploy sensors via chef Mittwoch, 30. Oktober 13
openTSDB Written by Benoît Sigoure at StumbleUpon OpenSource (get it from github) Uses HBase (which is based on HDFS) as a storage Distributed system (multiple TSDs) Mittwoch, 30. Oktober 13
It gets even better tcollector is a python script that runs your collectors handles network connection, starts your collectors at set intervals does basic process management adds host tag, does deduplication Mittwoch, 30. Oktober 13
What was that HDFS again? HDFS is a distributed filesystem suitable for Petabytes of data on thousands of machines. Runs on commodity hardware Takes care of redundancy Used by e.g. Facebook, Spotify, eBay,... Mittwoch, 30. Oktober 13
Okay... and HBase? HBase is a NoSQL database / data store on top of HDFS Modeled after Google‘s BigTable Built for big tables (billions of rows, millions of columns) Automatic sharding by row key Mittwoch, 30. Oktober 13
Keys are key! Data is sharded across regions based on their row key You query data based on the row key You can query row key ranges (say e.g. A...D) So: think about key design Mittwoch, 30. Oktober 13
Take 1 Row key format: timestamp, metric id 1382536472, 5 17 1382536472, 6 24 1382536472, 8 12 1382536473, 5 134 1382536473, 6 10 1382536473, 8 99 Server A Server B Mittwoch, 30. Oktober 13
Take 2 Metric ID first, then timestamp Searching through many rows is slower than searching through viewer rows. (Obviously) So: Put multiple data points into one row Mittwoch, 30. Oktober 13
The Row Key 3 Bytes - metric ID 4 Bytes - timestamp (rounded down to the hour) 3 Bytes tag ID 3 Bytes tag value ID Total: 7 Bytes + 6 Bytes * Number of tags Mittwoch, 30. Oktober 13
Myth: Keeping Data is expensive Gartner found the price for enterprise SSDs at 1$/GB in 2013 A data point gets compressed to 2-3 Bytes A metric that you measure every second then uses disk space for 18.9ct per year. Usually it is even cheaper Mittwoch, 30. Oktober 13
If your work costs 50$ per hour and it takes you only one minute to think about and configure your RRD compaction setting, you could have collected that metric on a second-by-second basis for 4.4 YEARS instead. Mittwoch, 30. Oktober 13
Myth: the amount of metrics is too limited Don‘t confuse Graphite metric count with openTSBD metric count. 3 Bytes of metric ID = 16.7M possibilities 3 Bytes tag value ID = 16.7M possibilities => at least 280 T metrics (graphite counting) Mittwoch, 30. Oktober 13
Tools shape culture shapes tools It is time for a new monitoring culture! Embrace machine learning! Monitor everything in your organisation! Throw of the shackles of fixed intervals! Come, join the revolution! Mittwoch, 30. Oktober 13
What works well We store about 200M data points in several thousand time series with no issues tcollector is decoupling measurement from storage Creating new metrics is really easy You are free to choose your rhythm Mittwoch, 30. Oktober 13
Challenges The UI is seriously lacking no annotation support out of the box no meta data for time series Only 1s time resolution (and only 1 value/s/ time series) Mittwoch, 30. Oktober 13
Friendly advice Pick a naming scheme and stick to it Use tags wisely (not more than 6 or 7 tags per data point) Use tcollector wait for openTSDB 2 ;-) Mittwoch, 30. Oktober 13