How Do You Eat An Elephant? (RICON East 2013)

How Do You Eat An Elephant? (RICON East 2013)

Presented by Theo Schlossnagle and Robert Treat at RICON East 2013

When OmniTI first set out to build a next generation monitoring system, we turned to one of our most trusted tools for data management; Postgres. While this worked well for developing the initial Open Source application, as we continued to grow the Circonus public monitoring service, we eventually ran into scaling issues. This talk will cover some of the changes we made to make the original Postgres system work better, talk about some of the other systems we evaluated, and discuss the eventual solution to our problem; building our own time series database. Of course, that's only half the story. We'll also go into how we swapped out these backend data storage pieces in our production environment, all the while capturing and reporting on millions of metrics, without downtime or customer interruption.

About Theo

Theo Schlossnagle is a Founder and Principal at OmniTI where he designs and implements scalable solutions for highly trafficked sites and other clients in need of sound, scalable architectural engineering. He is the architect of the highly scalable Momentum mail transport agent, principal architect of Fontdeck, which delivers professional typefaces optimized for the web, Project Lead and Architect for OmniOS, an Illumos based operating system distribution, and Founder and Principal Architect of Circonus, a cloud platform designed for monitoring and marrying systems and business analytics. He authored Scalable Internet Architectures (Sams) and is a veteran speaker in the open source conference circuit. A member of the Apache Software Foundation and IEEE, and senior member of the ACM, he serves on the editoral board of ACM’s Queue Magazine.

About Robert

Working on database backed, internet based systems for over a decade, Robert Treat is co-author of the book Beginning PHP and PostgreSQL 8, maintains the phpPgAdmin software package, and has been recognized as a major contributor to the PostgreSQL project for his work over the years. An international speaker on databases, open source, and managing web operations at scale, he spends his days as COO of OmniTI, a consultancy focused on building and managing large scale web infrastructure.

E0f4dbccf64a1d37a92e224b070ee84f?s=128

Basho Technologies

May 14, 2013
Tweet

Transcript

  1. / Eating an elephant a journey from Postgres to elsewhere

  2. Theo @postwait

  3. Robert @robtreat2

  4. OmniTI • A consultancy focused on web scalability • We

    manage lots of large scale systems • We hated our tools and we like to build new products...
  5. Reconnoiter • Enter Reconnoiter: “Large-scale monitoring and trend analysis system”

    • http://labs.omniti.com/labs/reconnoiter
  6. Reconnoiter had simple roots. • And I’ll add: a very

    low bar. • We just needed to analyze telemetry better than: • Nagios, Cacti, RRD-based systems, munin, ganglia, HP Openview, Tivoli
  7. Telemetry • Turns out there’s a lot of telemetry data

    in our systems • We chose Postgres as we had good experience scaling it into the multiple terabyte range.
  8. What’s going on here?

  9. All was well • Console based tools • GUI for

    monitoring • Collecting lots of data • This was the path forward
  10. All was well until... • We launched RaaS (Reconnoiter as

    a Service) • The scale upped. • The downtime tolerances dropped. • Sometimes I seriously regret founding a monitoring company. • Circonus was born... on Postgres.
  11. Handling data growth. CREATE TABLE rollup_matrix_numeric_5m ( sid integer NOT

    NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer, avg_value numeric, counter_dev numeric );
  12. Handling data growth. CREATE TABLE rollup_matrix_numeric_5m ( sid integer NOT

    NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer, avg_value numeric, counter_dev numeric ); • Aggregate data for faster retrieval • hold data forever (aka 10 years) • Lots of rows • one million rows per metric • typical machine had 50 metrics
  13. Handling data growth. • Rollups happen at reliably consistent intervals:

    • 5m, 20m, 60m, etc. • Users expect them to be aligned (1:00, 1:05, 1:10, etc.) • This means we could stop storing every single timestamp • ... switch to make-shift column storage using arrays.
  14. Handling data growth. • Rollups happen at reliably consistent intervals:

    • 5m, 20m, 60m, etc. • Users expect them to be aligned (1:00, 1:05, 1:10, etc.) • This means we could stop storing every single timestamp • ... switch to make-shift column storage using arrays. CREATE TABLE metric_numeric_rollup_5m ( sid integer NOT NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer[], avg_value numeric[], counter_dev numeric[] );
  15. The next disaster. • Every time new data comes in

    between 1:00 and 1:05, you must recalculate your rollups. • We chose to do this at 1:05.1 for obvious reasons. • With a large, distributed service, we always had data streams that were delayed causing massive re-rolling of data. • Next step...
  16. It’s a shame about Ray

  17. The next disaster. • Separate each data source into its

    own schema for decoupled rollup jobs. • individual “noits” might fall behind • overall system impact was minimal CREATE TABLE noit_r2112.metric_numeric_rollup_5m ( sid integer NOT NULL, name text NOT NULL, rollup_time timestamp with time zone NOT NULL, count_rows integer[], avg_value numeric[], counter_dev numeric[] );
  18. Where were we now? • We were actually pretty convinced

    this would scale. • The economic model wasn’t superb, but it fit within the business model.
  19. The next disaster • Fault tolerance. • With large clusters,

    we had machine failures regularly. • Automating failover and slave rebuilds was very challenging.
  20. The next disaster • Fault tolerance. • With large clusters,

    we had machine failures regularly. • Automating failover and slave rebuilds was very challenging. • Postgres 8.4 • Log Shipping (not the biggest problem) • Rebuild Masters as Slave (time consuming, only getting worse)
  21. The next disaster • Fault tolerance. • With large clusters,

    we had machine failures regularly. • Automating failover and slave rebuilds was very challenging. •For an internal tool, this would have been OK. •For an external tool, we needed to do better. • Postgres 8.4 • Log Shipping (not the biggest problem) • Rebuild Masters as Slave (time consuming, only getting worse)
  22. Where to next? • Teradata? • Oracle RAC? • Hadoop?

    • MongoDB? • Riak? • Cassandra?
  23. Obviously... not MongoDB

  24. Perhaps obviously... • not Teradata, Oracle RAC, etc... • expensive

    • broke the economic model
  25. Not so obviously... • not Hadoop • our data was

    very structured • we did not have complex questions • operationally immature (spof and such)
  26. Not so obviously... • not Cassandra • operationally painful at

    the time. • we don’t delete old data... which means Cassandra didn’t want to be our friend.
  27. Riak... so good... but... • Riak seemed like a very

    good fit. • However, the storage design didn’t fit the type of data we were storing. • We really need a column store. • And... it turns out we need a lot less agreement.
  28. Maybe we should build our own? “No, we shouldn’t” --

    Robert Treat
  29. What would it take to build our own?

  30. What would it take to build our own? • That’s

    stupid.. that’s a bad idea.
  31. What would it take to build our own? • That’s

    stupid.. that’s a bad idea. • (true)
  32. What would it take to build our own? • That

    exercise should only be done on a napkin, in a bar, after far too many drinks.
  33. What would it take to build our own?

  34. What would it take to build our own? • That

    exercise led us to believe we could get: • 80% reduction in storage space, • no single point of failure, • (and no Erlang)
  35. Napkin -> Whiteboard

  36. Snowth

  37. Snowth • We had two years of wildly diverse data...

    (we understood the problem)
  38. Snowth • We had two years of wildly diverse data...

    (we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass)
  39. Snowth • We had two years of wildly diverse data...

    (we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass) • We did more work with lzjb on ZFS and got that to > 93% reduction in storage space.
  40. Snowth • We had two years of wildly diverse data...

    (we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass) • We did more work with lzjb on ZFS and got that to > 93% reduction in storage space. • It took 8 man-weeks over three months to build.
  41. Snowth • We had two years of wildly diverse data...

    (we understood the problem) • We prototyped the columns store and got 80% reduction in storage space... (napkins at bars kick ass) • We did more work with lzjb on ZFS and got that to > 93% reduction in storage space. • It took 8 man-weeks over three months to build. • It’s taken approximately 4 man months to maintain since then (approximately 6 weeks/year)
  42. Easing into it. • We ran Postgres and snowth in

    parallel for 14 months
  43. Easing into it. • Started collecting / populating data long

    before we looked at it • Used data in Postgres to verify data in Snowth
  44. Easing into it. • Started building graphs from Snowth •

    Internal folks only
  45. Easing into it. my $cnt = $Request->Params('cnt') || $default_cnt; my

    $driver = Circonus::Graph::Driver::Flot->new($start, $end, $cnt, $type); +$driver->dataSetClass('Circonus::Graph::DataSet::Snowth') if $r->feature('snowth'); my $json = {}; if (@args == 1 || @args == 2) { my $graph = Circonus::Graph->new($args[0]);
  46. Teasing into it. • Started building graphs from Snowth •

    Internal folks only • Didn’t tell all of the internal folks :-)
  47. Easing into it. • We ran Postgres and snowth in

    parallel for 14 months
  48. Meanwhile back on Postgres • Hardware failure rates increasing •

    Capacity planning was an almost daily exercise • Luckily made easy with Circonus projections :-)
  49. And snowth it goes • Switch everyone to snowth •

    Continue to run both for ~ 1 month • Turn off the Postgres
  50. Lessons “Learned” • No sacred cows • Heavy applications of

    time help reduce risk • If you do it right, it’s anti-climactic • Understand when your problems change
  51. The Right Tool For The Job? • Understand your problem

    • Problems are not static • Understand your problem
  52. Want More?