Slide 1

Slide 1 text

#CASSANDRA13 Time-Series Metrics with Cassandra Mike Heffner

Slide 2

Slide 2 text

#CASSANDRA13 What we do.

Slide 3

Slide 3 text

#CASSANDRA13 October 2011 ● Decision: All measurements in Cassandra ● Single EC2 Ring: 6 * m1.large ● Cassandra 0.8.x ● How does this work?

Slide 4

Slide 4 text

#CASSANDRA13 Today ● Multiple sharded rings ● EC2: m1.xlarge and m2.4xlarge ● Cassandra 1.1.x ● Read load: < 1%

Slide 5

Slide 5 text

#CASSANDRA13 Talk Highlights ● Adapting Schema to Storage ● Optimally Expiring Data ● Monitor Everything

Slide 6

Slide 6 text

#CASSANDRA13 Adapting Schema to Storage

Slide 7

Slide 7 text

#CASSANDRA13 What is a Measurement? ( Metric ID, Source ) (X, Y) => (Epoch Timestamp, Value)

Slide 8

Slide 8 text

#CASSANDRA13 Measurement CF

Slide 9

Slide 9 text

#CASSANDRA13 Locating Measurement Rows Maximum row size math: ● 1 minute records ● 1 week TTL ● 7 days * 24 hours * 60 minutes => ~10k ● 4 Longs * 8 bytes * 10k => ~320KB (not bad)

Slide 10

Slide 10 text

#CASSANDRA13 We have a problem

Slide 11

Slide 11 text

#CASSANDRA13 Examining CF SSTables Metrics/metric_id_epochs_60 histograms Offset SSTables 1 28821 2 58859 3 201198 4 178326 5 223016 6 154952 7 83289 8 21552 10 81104 1 2 3 4 5 6 7 8 10 nodetool cfhistograms Metrics metric_id_epochs_60

Slide 12

Slide 12 text

#CASSANDRA13 Storage Over Time

Slide 13

Slide 13 text

#CASSANDRA13

Slide 14

Slide 14 text

#CASSANDRA13

Slide 15

Slide 15 text

#CASSANDRA13 Rotating the Rows mget(Rows: [12, EBase_30], [12, EBase_40], Columns: {31->45}) Retrieve Time Bases for Times 31->45 for metric ID 12:

Slide 16

Slide 16 text

#CASSANDRA13 Examining CF SSTables Metrics/metric_id_epochs_60 Offset SSTables 1 28821 2 58859 3 201198 4 178326 5 223016 6 154952 7 83289 8 21552 10 81104 1 2 3 4 5 6 7 8 10 nodetool cfhistograms Metrics metric_id_epochs_60 Metrics/metric_id_epochs_60 Offset SSTables 1 3491820 2 5389762 3 4095760 4 1310741 5 9976 1 2 3 4 5 6 7 8 9 10 Before After

Slide 17

Slide 17 text

#CASSANDRA13 /graph me

Slide 18

Slide 18 text

#CASSANDRA13 Optimally Expiring Data

Slide 19

Slide 19 text

#CASSANDRA13 TTL Expiration ● Churn of about 750GB / day ● 12 TB total ● 6% of data set ● gc_grace = 0 ● STC

Slide 20

Slide 20 text

#CASSANDRA13 Synchronized Compactions

Slide 21

Slide 21 text

#CASSANDRA13 nodetool compact

Slide 22

Slide 22 text

#CASSANDRA13 * http://hight3ch.com/garbage-truck-crushing-a-car/

Slide 23

Slide 23 text

#CASSANDRA13 nodetool cleanup

Slide 24

Slide 24 text

#CASSANDRA13 Cleanup ● Not just for topology changes ● Tombstoned rows (not referenced) ● Rotated row keys decrease references ● Cons: Must process every sstable.

Slide 25

Slide 25 text

#CASSANDRA13 Immutable SStables

Slide 26

Slide 26 text

#CASSANDRA13 Leverage SStable Mod Time ● If now – mtime > TTL => all data is expired ● We can quickly eliminate entire sstables: find -mtime + -name *.db | xargs rm ● Fast and low overhead ● Cons: Rolling restart 26G 2013-05-17 09:44 Metrics-metrics_60-hf-7209-Data.db

Slide 27

Slide 27 text

#CASSANDRA13 nodetool setcompactionthreshold

Slide 28

Slide 28 text

#CASSANDRA13 Increasing minor compactions ● By default, STC requires a minimum of 4 ssts ● Leads to large non-compacted sstables ● Dropping to 2 can flatten the storage growth nodetool setcompactionthreshold 2 ● Cons: CPU/IO increase

Slide 29

Slide 29 text

#CASSANDRA13 Result

Slide 30

Slide 30 text

#CASSANDRA13 New in 1.2

Slide 31

Slide 31 text

#CASSANDRA13 1.2 ● Off-heap memory ● TTL Histograms

Slide 32

Slide 32 text

#CASSANDRA13 Effective Monitoring

Slide 33

Slide 33 text

#CASSANDRA13 Ring Dashboards

Slide 34

Slide 34 text

#CASSANDRA13 Disk Errors => Throw Away ● If you ever see this, replace! end_request: I/O error, dev xvdb, sector 467940617 end_request: I/O error, dev xvdb, sector 467940617 ● Mark node down, bootstrap new ● No metric for this?

Slide 35

Slide 35 text

#CASSANDRA13 Cassandra Log Volume ● Count log lines seen every 10 minutes ● Track over time ● Can identify: – Unbalanced workloads – Schema disagreements – Phantom gossip nodes – GC activity ● grep -v '.java' => exceptions

Slide 36

Slide 36 text

#CASSANDRA13 Q & A Mike Heffner /mheffner /mheffner We're Hiring!