$30 off During Our Annual Pro Sale. View Details »

openTSDB - metrics for a distributed world

openTSDB - metrics for a distributed world

This is the talk about our metrics solution at gutefrage.net. We use openTSDB. I gave this talk at the Open Source Monitorin Conference 2013 in Nuremberg.

Oliver Hankeln

October 24, 2013
Tweet

More Decks by Oliver Hankeln

Other Decks in Technology

Transcript

  1. openTSDB - Metrics for
    a distributed world
    Oliver Hankeln / gutefrage.net
    @mydalon
    Donnerstag, 24. Oktober 13

    View Slide

  2. Who am I?
    Senior Engineer - Data and Infrastructure at
    gutefrage.net GmbH
    Was doing software development before
    DevOps advocate
    Donnerstag, 24. Oktober 13

    View Slide

  3. Who is Gutefrage.net?
    Germany‘s biggest Q&A platform
    #1 German site (mobile) about 5M Unique Users
    #3 German site (desktop) about 17M Unique Users
    > 4 Mio PI/day
    Part of the Holtzbrinck group
    Running several platforms (Gutefrage.net,
    Helpster.de, Cosmiq, Comprano, ...)
    Donnerstag, 24. Oktober 13

    View Slide

  4. What you will get
    Why we chose openTSDB
    What is openTSDB?
    How does openTSDB store the data?
    Our experiences
    Some advice
    Donnerstag, 24. Oktober 13

    View Slide

  5. Why we chose
    openTSDB
    Donnerstag, 24. Oktober 13

    View Slide

  6. We were looking at
    some options
    Munin Graphite openTSDB Ganglia
    Scales
    well
    no sort of yes yes
    Keeps all
    data
    no no yes no
    Creating
    metrics
    easy easy easy easy
    Donnerstag, 24. Oktober 13

    View Slide

  7. We have a winner!
    Munin Graphite openTSDB Ganglia
    Scales
    well
    no sort of yes yes
    Keeps all
    data
    no no yes no
    Creating
    metrics
    easy easy easy easy
    Bingo!
    Donnerstag, 24. Oktober 13

    View Slide

  8. Separation of concerns
    $ unzip|strip|touch|finger|grep|mount|fsck|more|yes|
    fsck|fsck|fsck|umount|sleep
    Donnerstag, 24. Oktober 13

    View Slide

  9. The ecosystem
    App feeds metrics in via RabbitMQ
    We base Icinga checks on the metrics
    We evaluate etsy Skyline for anomaly
    detection
    We deploy sensors via chef
    Donnerstag, 24. Oktober 13

    View Slide

  10. openTSDB
    Written at StumbleUpon but OpenSource
    Uses HBase (which is based on HDFS) as a
    storage
    Distributed system (multiple TSDs)
    Donnerstag, 24. Oktober 13

    View Slide

  11. The big picture
    HBase
    TSD
    TSD
    TSD
    TSD
    UI
    API
    tcollector
    This is really
    a cluster
    Donnerstag, 24. Oktober 13

    View Slide

  12. Putting data into
    openTSDB
    $ telnet tsd01.acme.com 4242
    put proc.load.avg5min 1382536472 23.2 host=db01.acme.com
    Donnerstag, 24. Oktober 13

    View Slide

  13. It gets even better
    tcollector is a python script that runs your
    collectors
    handles network connection, starts your
    collectors at set intervals
    does basic process management
    adds host tag, does deduplication
    Donnerstag, 24. Oktober 13

    View Slide

  14. A simple tcollector script
    #!/usr/bin/php
    #Cast a die
    $die = rand(1,6);
    echo "roll.a.d6 " . time() . " " . $die . "\n";
    Donnerstag, 24. Oktober 13

    View Slide

  15. What was that HDFS
    again?
    HDFS is a distributed filesystem suitable for
    Petabytes of data on thousands of machines.
    Runs on commodity hardware
    Takes care of redundancy
    Used by e.g. Facebook, Spotify, eBay,...
    Donnerstag, 24. Oktober 13

    View Slide

  16. Okay... and HBase?
    HBase is a NoSQL database / data store on
    top of HDFS
    Modeled after Google‘s BigTable
    Built for big tables (billions of rows, millions
    of columns)
    Automatic sharding by row key
    Donnerstag, 24. Oktober 13

    View Slide

  17. How openTSDB stores
    the data
    Donnerstag, 24. Oktober 13

    View Slide

  18. Keys are key!
    Data is sharded across regions based on
    their row key
    You query data based on the row key
    You can query row key ranges (say e.g. A...D)
    So: think about key design
    Donnerstag, 24. Oktober 13

    View Slide

  19. Take 1
    Row key format: timestamp, metric id
    Donnerstag, 24. Oktober 13

    View Slide

  20. Take 1
    Row key format: timestamp, metric id
    1382536472, 5 17
    Server A
    Server B
    Donnerstag, 24. Oktober 13

    View Slide

  21. Take 1
    Row key format: timestamp, metric id
    1382536472, 5 17
    1382536472, 6 24
    Server A
    Server B
    Donnerstag, 24. Oktober 13

    View Slide

  22. Take 1
    Row key format: timestamp, metric id
    1382536472, 5 17
    1382536472, 6 24
    1382536472, 8 12
    1382536473, 5 134
    1382536473, 6 10
    1382536473, 8 99
    Server A
    Server B
    Donnerstag, 24. Oktober 13

    View Slide

  23. Take 1
    Row key format: timestamp, metric id
    1382536472, 5 17
    1382536472, 6 24
    1382536472, 8 12
    1382536473, 5 134
    1382536473, 6 10
    1382536473, 8 99
    1382536474, 5 12
    1382536474, 6 42
    Server A
    Server B
    Donnerstag, 24. Oktober 13

    View Slide

  24. Solution: Swap
    timestamp and metric id
    Row key format: metric id, timestamp
    5, 1382536472 17
    6, 1382536472 24
    8, 1382536472 12
    5, 1382536473 134
    6, 1382536473 10
    8, 1382536473 99
    5, 1382536474 12
    6, 1382536474 42
    Server A
    Server B
    Donnerstag, 24. Oktober 13

    View Slide

  25. Solution: Swap
    timestamp and metric id
    Row key format: metric id, timestamp
    5, 1382536472 17
    6, 1382536472 24
    8, 1382536472 12
    5, 1382536473 134
    6, 1382536473 10
    8, 1382536473 99
    5, 1382536474 12
    6, 1382536474 42
    Server A
    Server B
    Donnerstag, 24. Oktober 13

    View Slide

  26. Take 2
    Metric ID first, then timestamp
    Searching through many rows is slower than
    searching through viewer rows. (Obviously)
    So: Put multiple data points into one row
    Donnerstag, 24. Oktober 13

    View Slide

  27. Take 2 continued
    5, 1382608800
    +23 +35 +94 +142
    5, 1382608800
    17 1 23 42
    5, 1382612400
    +13 +25 +88 +89
    5, 1382612400
    3 44 12 2
    Donnerstag, 24. Oktober 13

    View Slide

  28. Take 2 continued
    5, 1382608800
    +23 +35 +94 +142
    5, 1382608800
    17 1 23 42
    5, 1382612400
    +13 +25 +88 +89
    5, 1382612400
    3 44 12 2
    Row key
    Donnerstag, 24. Oktober 13

    View Slide

  29. Take 2 continued
    5, 1382608800
    +23 +35 +94 +142
    5, 1382608800
    17 1 23 42
    5, 1382612400
    +13 +25 +88 +89
    5, 1382612400
    3 44 12 2
    Row key
    Cell Name
    Donnerstag, 24. Oktober 13

    View Slide

  30. Take 2 continued
    5, 1382608800
    +23 +35 +94 +142
    5, 1382608800
    17 1 23 42
    5, 1382612400
    +13 +25 +88 +89
    5, 1382612400
    3 44 12 2
    Row key
    Cell Name Data point
    Donnerstag, 24. Oktober 13

    View Slide

  31. Where are the tags
    stored?
    They are put at the end of the row key
    Both metric names and metric values are
    represented by IDs
    Donnerstag, 24. Oktober 13

    View Slide

  32. The Row Key
    3 Bytes - metric ID
    4 Bytes - timestamp (rounded down to the
    hour)
    3 Bytes tag ID
    3 Bytes tag value ID
    Total: 7 Bytes + 6 Bytes * Number of tags
    Donnerstag, 24. Oktober 13

    View Slide

  33. Let‘s look at some
    graphs
    Donnerstag, 24. Oktober 13

    View Slide

  34. Our experiences
    Donnerstag, 24. Oktober 13

    View Slide

  35. What works well
    We store about 200M data points in several
    thousand time series with no issues
    tcollector is decoupling measurement from
    storage
    Creating new metrics is really easy
    Donnerstag, 24. Oktober 13

    View Slide

  36. Challenges
    The UI is seriously lacking
    no annotation support out of the box
    Only 1s time resolution (and only 1 value/s/
    time series)
    Donnerstag, 24. Oktober 13

    View Slide

  37. salvation is coming
    OpenTSDB 2 is around the corner
    millisecond precision
    annotations and meta data
    improved API
    Donnerstag, 24. Oktober 13

    View Slide

  38. Friendly advice
    Pick a naming scheme and stick to it
    Use tags wisely (not more than 6 or 7 tags
    per data point)
    Use tcollector
    wait for openTSDB 2 ;-)
    Donnerstag, 24. Oktober 13

    View Slide

  39. Questions?
    Please contact me:
    [email protected]
    @mydalon
    I‘ll upload the slides and tweet about it
    Donnerstag, 24. Oktober 13

    View Slide