Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GlueCon 2013 - Think Backwards: Realtime Analytics + Cassandra

GlueCon 2013 - Think Backwards: Realtime Analytics + Cassandra

At Parse, the demands of capturing and supporting realtime analytics differ enough from the standard load of a Backend-as-a-Service to necessitate exploring new options for data stores and data processing. This talk will cover the reasons we chose to work with Cassandra as our data store, and the inversion of the standard analysis pipeline.

Christine Yen

May 23, 2013
Tweet

More Decks by Christine Yen

Other Decks in Technology

Transcript

  1. THINK BACKWARDS: REALTIME ANALYTICS + CASSANDRA Christine Yen Parse @cyen

    Thursday, May 23, 13 thanks colourlovers http://www.colourlovers.com/palette/46688/fresh_cut_day
  2. STATUS QUO our backend As flexible as possible for our

    developers optimized for our load • Lots of reads • Caching layers in between • Steady load over all apps Thursday, May 23, 13 - at parse, we’re a BaaS. role as a platform => collecting analytics should be invisible to devs + end users - our earliest + primary product is data storage in the cloud - our systems = primarily focused on THAT
  3. WANTED: analytics • Lots of writes • Bursty • Fairly

    consistent data model • Start with a known load... THEN ONWARDS Thursday, May 23, 13 - lots of writes - likely to exceed what our current data store can handle - unlike data access patterns, analytics could be very bursty - but the structure of time series is fairly standard - so we decided to start with a known load (PUSH SERVICE)
  4. TRADITIONAL BIG DATA • Capture data as generically as possible

    • Post-process to your heart’s content • Instrument first, evaluate later Thursday, May 23, 13 - spend as little time as possible on the write-path - take your time during post-processing. This is where MapReduce lovers can go crazy - focus on capturing very generic data with as much info as we might later need
  5. THINGS FALL APART aggregation On-the-fly aggregation slows to a crawl

    user-facing UI Minimize dependence on batch jobs Thursday, May 23, 13 - if we have it user-facing (like push analytics), can’t depend solely on batch jobs - and once users want to compare data, multiplied - aggregation slow --> especially with cassandra - more about that later
  6. CASSANDRA • Write-optimized • Designed for data to be distributed

    • Highly available, eventually consistent • “Load balancing” +analytics? Thursday, May 23, 13 - writes are much, much faster than reads (order of mag?) - lends itself well to replication & sharding --> DYNAMO - can imagine analytics now fearlessly being in any code path - no single point of failure - some sacrifices, ultimately OK - reasonable load balancing, without too much manual intervention
  7. NEEDLE SCRATCH but reads can be frustratingly slow... HMM. Thursday,

    May 23, 13 - so we know all these good things about cassandra - but what we know traditionally -> processing around reads - let’s explore. - what unusual things can we do with a write-optimized store? - how can we take advantage of how cassandra stores + works with its data?
  8. AGGREGATE as we go • Example: ‘push sent to iOS

    devices via REST’ • track: (‘push sent’) • track: (‘push sent’, ‘ios’) • track: (‘push sent’, ‘rest’) • track: (‘push sent’, ‘ios’, ‘rest’) • Fast reads, no matter how we slice it Thursday, May 23, 13 PUSH NOTIFICATIONS = PART OF OUR OFFERING instead of one write (w/ bag of properties as a blob or col values to query over later) --> write to FOUR places: all counters that might be affected --> in this case: one per combination of event name + properties a single read, from a known key = fast
  9. SCHEMA PLANS • Row: as much as we can (within

    reason) • Column Keys: bucket by time interval • Column Values: store a counter Thursday, May 23, 13 so let’s take a close look at how cass stores its data, match our data model to their storage patterns - all traffic related to one row is handled by a single node (or set of replicas, tbh) - take adv of wide rows, read from the minimum # of machines - a row’s column data is stored in order by key - grabbing a time slice = trivial slice over a row’s columns - values straightforward: counter columns = special column type
  10. READING OUR DATA “Get counts of iOS pushes over the

    last week” / , : / , : / , : / , : ... special “counter” value types rows[key] = row key = “ a eb cc:PushSent:ios: ” Thursday, May 23, 13 Notice a few things: - row key is just a string, with some values concatenated together. will get to that. - the row itself is just a key-value store, of column names -> values - Cassandra is essentially a map of a map - have shown how we map column names = “hour of the month"
  11. QUIRKS • De-normalize and duplicate • Know thy data •

    Too much flexibility? • Standardize key generation Thursday, May 23, 13 - most data stores trend toward normalization. foreign keys, relations, etc - think of use cases / sample queries -> use to plan data model - watch out = diverging key-generation schemes, etc, could be awful - standardized: we use the same separator as other generated keys - we wrap key-building logic in helper methods, and MAKE SURE all logic goes through them
  12. CAVEATS • Batch jobs aren’t evil • Old Time Series

    data unlikely to change Thursday, May 23, 13 - goes back to “know your data” - you CAN do batch / rollup jobs (hourly columns -> year rows with daily columns?)
  13. RECAP! • Cassandra: write-optimized K-V store • Know your data

    • and know how Cassandra will store it • Think backwards QUESTIONS? @cyen p.s. we’re hiring :) Thursday, May 23, 13 - Cassandra --> write/read behavior from BIGTABLE - THINK BACKWARDS: from your queries - and if your strengths are “inverted”, then maybe your approach should be, too