Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time-Series Metrics with Cassandra

Time-Series Metrics with Cassandra

Librato's Metrics platform relies on Cassandra as its sole data storage platform for time-series data. This session will discuss how we have scaled from a single six node Cassandra ring two years ago to the multiple storage rings that handle all time-series measurement data today.

Mike Heffner

June 11, 2013
Tweet

Other Decks in Technology

Transcript

  1. #CASSANDRA13
    Time-Series Metrics with Cassandra
    Mike Heffner

    View Slide

  2. #CASSANDRA13
    What we do.

    View Slide

  3. #CASSANDRA13
    October 2011

    Decision: All measurements in Cassandra

    Single EC2 Ring: 6 * m1.large

    Cassandra 0.8.x

    How does this work?

    View Slide

  4. #CASSANDRA13
    Today

    Multiple sharded rings

    EC2: m1.xlarge and m2.4xlarge

    Cassandra 1.1.x

    Read load: < 1%

    View Slide

  5. #CASSANDRA13
    Talk Highlights

    Adapting Schema to Storage

    Optimally Expiring Data

    Monitor Everything

    View Slide

  6. #CASSANDRA13
    Adapting Schema to Storage

    View Slide

  7. #CASSANDRA13
    What is a Measurement?
    ( Metric ID, Source )
    (X, Y) => (Epoch Timestamp, Value)

    View Slide

  8. #CASSANDRA13
    Measurement CF

    View Slide

  9. #CASSANDRA13
    Locating Measurement Rows
    Maximum row size math:

    1 minute records

    1 week TTL

    7 days * 24 hours * 60 minutes => ~10k

    4 Longs * 8 bytes * 10k => ~320KB (not bad)

    View Slide

  10. #CASSANDRA13
    We have a problem

    View Slide

  11. #CASSANDRA13
    Examining CF SSTables
    Metrics/metric_id_epochs_60 histograms
    Offset SSTables
    1 28821
    2 58859
    3 201198
    4 178326
    5 223016
    6 154952
    7 83289
    8 21552
    10 81104 1 2 3 4 5 6 7 8 10
    nodetool cfhistograms Metrics metric_id_epochs_60

    View Slide

  12. #CASSANDRA13
    Storage Over Time

    View Slide

  13. #CASSANDRA13

    View Slide

  14. #CASSANDRA13

    View Slide

  15. #CASSANDRA13
    Rotating the Rows
    mget(Rows: [12, EBase_30], [12, EBase_40], Columns: {31->45})
    Retrieve Time Bases for Times 31->45 for metric ID 12:

    View Slide

  16. #CASSANDRA13
    Examining CF SSTables
    Metrics/metric_id_epochs_60
    Offset SSTables
    1 28821
    2 58859
    3 201198
    4 178326
    5 223016
    6 154952
    7 83289
    8 21552
    10 81104
    1 2 3 4 5 6 7 8 10
    nodetool cfhistograms Metrics metric_id_epochs_60
    Metrics/metric_id_epochs_60
    Offset SSTables
    1 3491820
    2 5389762
    3 4095760
    4 1310741
    5 9976
    1 2 3 4 5 6 7 8 9 10
    Before
    After

    View Slide

  17. #CASSANDRA13
    /graph me

    View Slide

  18. #CASSANDRA13
    Optimally Expiring Data

    View Slide

  19. #CASSANDRA13
    TTL Expiration

    Churn of about 750GB / day

    12 TB total

    6% of data set

    gc_grace = 0

    STC

    View Slide

  20. #CASSANDRA13
    Synchronized Compactions

    View Slide

  21. #CASSANDRA13
    nodetool compact

    View Slide

  22. #CASSANDRA13
    * http://hight3ch.com/garbage-truck-crushing-a-car/

    View Slide

  23. #CASSANDRA13
    nodetool cleanup

    View Slide

  24. #CASSANDRA13
    Cleanup

    Not just for topology changes

    Tombstoned rows (not referenced)

    Rotated row keys decrease references

    Cons: Must process every sstable.

    View Slide

  25. #CASSANDRA13
    Immutable SStables

    View Slide

  26. #CASSANDRA13
    Leverage SStable Mod Time

    If now – mtime > TTL => all data is expired

    We can quickly eliminate entire sstables:
    find -mtime + -name *.db | xargs rm

    Fast and low overhead

    Cons: Rolling restart
    26G 2013-05-17 09:44 Metrics-metrics_60-hf-7209-Data.db

    View Slide

  27. #CASSANDRA13
    nodetool setcompactionthreshold

    View Slide

  28. #CASSANDRA13
    Increasing minor compactions

    By default, STC requires a minimum of 4 ssts

    Leads to large non-compacted sstables

    Dropping to 2 can flatten the storage growth
    nodetool setcompactionthreshold 2

    Cons: CPU/IO increase

    View Slide

  29. #CASSANDRA13
    Result

    View Slide

  30. #CASSANDRA13
    New in 1.2

    View Slide

  31. #CASSANDRA13
    1.2

    Off-heap memory

    TTL Histograms

    View Slide

  32. #CASSANDRA13
    Effective Monitoring

    View Slide

  33. #CASSANDRA13
    Ring Dashboards

    View Slide

  34. #CASSANDRA13
    Disk Errors => Throw Away

    If you ever see this, replace!
    end_request: I/O error, dev xvdb, sector 467940617
    end_request: I/O error, dev xvdb, sector 467940617

    Mark node down, bootstrap new

    No metric for this?

    View Slide

  35. #CASSANDRA13
    Cassandra Log Volume

    Count log lines seen every 10 minutes

    Track over time

    Can identify:
    – Unbalanced workloads
    – Schema disagreements
    – Phantom gossip nodes
    – GC activity

    grep -v '.java' => exceptions

    View Slide

  36. #CASSANDRA13
    Q & A
    Mike Heffner
    /mheffner
    /mheffner
    We're Hiring!

    View Slide