How Do You Eat An Elephant? (RICON East 2013)

/ Eating an elephant a journey from Postgres to elsewhere

Theo @postwait

Robert @robtreat2

OmniTI • A consultancy focused on web scalability • We
manage lots of large scale systems • We hated our tools and we like to build new products...

Reconnoiter • Enter Reconnoiter: “Large-scale monitoring and trend analysis system”
• http://labs.omniti.com/labs/reconnoiter

Reconnoiter had simple roots. • And I’ll add: a very
low bar. • We just needed to analyze telemetry better than: • Nagios, Cacti, RRD-based systems, munin, ganglia, HP Openview, Tivoli

Telemetry • Turns out there’s a lot of telemetry data
in our systems • We chose Postgres as we had good experience scaling it into the multiple terabyte range.

What’s going on here?

All was well • Console based tools • GUI for
monitoring • Collecting lots of data • This was the path forward

All was well until... • We launched RaaS (Reconnoiter as
a Service) • The scale upped. • The downtime tolerances dropped. • Sometimes I seriously regret founding a monitoring company. • Circonus was born... on Postgres.

Handling data growth. CREATE TABLE rollup_matrix_numeric_5m ( sid integer NOT
NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer, avg_value numeric, counter_dev numeric );

Handling data growth. CREATE TABLE rollup_matrix_numeric_5m ( sid integer NOT
NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer, avg_value numeric, counter_dev numeric ); • Aggregate data for faster retrieval • hold data forever (aka 10 years) • Lots of rows • one million rows per metric • typical machine had 50 metrics

Handling data growth. • Rollups happen at reliably consistent intervals:
• 5m, 20m, 60m, etc. • Users expect them to be aligned (1:00, 1:05, 1:10, etc.) • This means we could stop storing every single timestamp • ... switch to make-shift column storage using arrays.

Handling data growth. • Rollups happen at reliably consistent intervals:
• 5m, 20m, 60m, etc. • Users expect them to be aligned (1:00, 1:05, 1:10, etc.) • This means we could stop storing every single timestamp • ... switch to make-shift column storage using arrays. CREATE TABLE metric_numeric_rollup_5m ( sid integer NOT NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer[], avg_value numeric[], counter_dev numeric[] );

The next disaster. • Every time new data comes in
between 1:00 and 1:05, you must recalculate your rollups. • We chose to do this at 1:05.1 for obvious reasons. • With a large, distributed service, we always had data streams that were delayed causing massive re-rolling of data. • Next step...

It’s a shame about Ray

The next disaster. • Separate each data source into its
own schema for decoupled rollup jobs. • individual “noits” might fall behind • overall system impact was minimal CREATE TABLE noit_r2112.metric_numeric_rollup_5m ( sid integer NOT NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer[], avg_value numeric[], counter_dev numeric[] );

Where were we now? • We were actually pretty convinced
this would scale. • The economic model wasn’t superb, but it fit within the business model.

The next disaster • Fault tolerance. • With large clusters,
we had machine failures regularly. • Automating failover and slave rebuilds was very challenging.

we had machine failures regularly. • Automating failover and slave rebuilds was very challenging. • Postgres 8.4 • Log Shipping (not the biggest problem) • Rebuild Masters as Slave (time consuming, only getting worse)

we had machine failures regularly. • Automating failover and slave rebuilds was very challenging. •For an internal tool, this would have been OK. •For an external tool, we needed to do better. • Postgres 8.4 • Log Shipping (not the biggest problem) • Rebuild Masters as Slave (time consuming, only getting worse)

Where to next? • Teradata? • Oracle RAC? • Hadoop?
• MongoDB? • Riak? • Cassandra?

Obviously... not MongoDB

Perhaps obviously... • not Teradata, Oracle RAC, etc... • expensive
• broke the economic model

Not so obviously... • not Hadoop • our data was
very structured • we did not have complex questions • operationally immature (spof and such)

Not so obviously... • not Cassandra • operationally painful at
the time. • we don’t delete old data... which means Cassandra didn’t want to be our friend.

Riak... so good... but... • Riak seemed like a very
good fit. • However, the storage design didn’t fit the type of data we were storing. • We really need a column store. • And... it turns out we need a lot less agreement.

Maybe we should build our own? “No, we shouldn’t” --
Robert Treat

What would it take to build our own?

What would it take to build our own? • That’s
stupid.. that’s a bad idea.

What would it take to build our own? • That’s
stupid.. that’s a bad idea. • (true)

What would it take to build our own? • That
exercise should only be done on a napkin, in a bar, after far too many drinks.

What would it take to build our own?

What would it take to build our own? • That
exercise led us to believe we could get: • 80% reduction in storage space, • no single point of failure, • (and no Erlang)

Napkin -> Whiteboard

Snowth

Snowth • We had two years of wildly diverse data...
(we understood the problem)

(we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass)

(we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass) • We did more work with lzjb on ZFS and got that to > 93% reduction in storage space.

(we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass) • We did more work with lzjb on ZFS and got that to > 93% reduction in storage space. • It took 8 man-weeks over three months to build.

(we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass) • We did more work with lzjb on ZFS and got that to > 93% reduction in storage space. • It took 8 man-weeks over three months to build. • It’s taken approximately 4 man months to maintain since then (approximately 6 weeks/year)

Easing into it. • We ran Postgres and snowth in
parallel for 14 months

Easing into it. • Started collecting / populating data long
before we looked at it • Used data in Postgres to verify data in Snowth

Easing into it. • Started building graphs from Snowth •
Internal folks only

Easing into it. my $cnt = $Request->Params('cnt') || $default_cnt; my
$driver = Circonus::Graph::Driver::Flot->new($start, $end, $cnt, $type); +$driver->dataSetClass('Circonus::Graph::DataSet::Snowth') if $r->feature('snowth'); my $json = {}; if (@args == 1 || @args == 2) { my $graph = Circonus::Graph->new($args[0]);

Teasing into it. • Started building graphs from Snowth •
Internal folks only • Didn’t tell all of the internal folks :-)

Easing into it. • We ran Postgres and snowth in
parallel for 14 months

Meanwhile back on Postgres • Hardware failure rates increasing •
Capacity planning was an almost daily exercise • Luckily made easy with Circonus projections :-)

And snowth it goes • Switch everyone to snowth •
Continue to run both for ~ 1 month • Turn off the Postgres

Lessons “Learned” • No sacred cows • Heavy applications of
time help reduce risk • If you do it right, it’s anti-climactic • Understand when your problems change

The Right Tool For The Job? • Understand your problem
• Problems are not static • Understand your problem

Want More?

How Do You Eat An Elephant? (RICON East 2013)

How Do You Eat An Elephant? (RICON East 2013)

More Decks by Basho Technologies

Other Decks in Technology

Featured

Transcript