GlueCon 2013 - Think Backwards: Realtime Analytics + Cassandra

THINK BACKWARDS: REALTIME ANALYTICS + CASSANDRA Christine Yen Parse @cyen
Thursday, May 23, 13 thanks colourlovers http://www.colourlovers.com/palette/46688/fresh_cut_day

STATUS QUO our backend As ﬂexible as possible for our
developers optimized for our load • Lots of reads • Caching layers in between • Steady load over all apps Thursday, May 23, 13 - at parse, we’re a BaaS. role as a platform => collecting analytics should be invisible to devs + end users - our earliest + primary product is data storage in the cloud - our systems = primarily focused on THAT

WANTED: analytics • Lots of writes • Bursty • Fairly
consistent data model • Start with a known load... THEN ONWARDS Thursday, May 23, 13 - lots of writes - likely to exceed what our current data store can handle - unlike data access patterns, analytics could be very bursty - but the structure of time series is fairly standard - so we decided to start with a known load (PUSH SERVICE)

TRADITIONAL BIG DATA • Capture data as generically as possible
• Post-process to your heart’s content • Instrument ﬁrst, evaluate later Thursday, May 23, 13 - spend as little time as possible on the write-path - take your time during post-processing. This is where MapReduce lovers can go crazy - focus on capturing very generic data with as much info as we might later need

THINGS FALL APART aggregation On-the-ﬂy aggregation slows to a crawl
user-facing UI Minimize dependence on batch jobs Thursday, May 23, 13 - if we have it user-facing (like push analytics), can’t depend solely on batch jobs - and once users want to compare data, multiplied - aggregation slow --> especially with cassandra - more about that later

CASSANDRA • Write-optimized • Designed for data to be distributed
• Highly available, eventually consistent • “Load balancing” +analytics? Thursday, May 23, 13 - writes are much, much faster than reads (order of mag?) - lends itself well to replication & sharding --> DYNAMO - can imagine analytics now fearlessly being in any code path - no single point of failure - some sacriﬁces, ultimately OK - reasonable load balancing, without too much manual intervention

NEEDLE SCRATCH but reads can be frustratingly slow... HMM. Thursday,
May 23, 13 - so we know all these good things about cassandra - but what we know traditionally -> processing around reads - let’s explore. - what unusual things can we do with a write-optimized store? - how can we take advantage of how cassandra stores + works with its data?

AGGREGATE as we go • Example: ‘push sent to iOS
devices via REST’ • track: (‘push sent’) • track: (‘push sent’, ‘ios’) • track: (‘push sent’, ‘rest’) • track: (‘push sent’, ‘ios’, ‘rest’) • Fast reads, no matter how we slice it Thursday, May 23, 13 PUSH NOTIFICATIONS = PART OF OUR OFFERING instead of one write (w/ bag of properties as a blob or col values to query over later) --> write to FOUR places: all counters that might be affected --> in this case: one per combination of event name + properties a single read, from a known key = fast

SCHEMA PLANS • Row: as much as we can (within
reason) • Column Keys: bucket by time interval • Column Values: store a counter Thursday, May 23, 13 so let’s take a close look at how cass stores its data, match our data model to their storage patterns - all traffic related to one row is handled by a single node (or set of replicas, tbh) - take adv of wide rows, read from the minimum # of machines - a row’s column data is stored in order by key - grabbing a time slice = trivial slice over a row’s columns - values straightforward: counter columns = special column type

READING OUR DATA “Get counts of iOS pushes over the
last week” / , : / , : / , : / , : ... special “counter” value types rows[key] = row key = “ a eb cc:PushSent:ios: ” Thursday, May 23, 13 Notice a few things: - row key is just a string, with some values concatenated together. will get to that. - the row itself is just a key-value store, of column names -> values - Cassandra is essentially a map of a map - have shown how we map column names = “hour of the month"

QUIRKS • De-normalize and duplicate • Know thy data •
Too much ﬂexibility? • Standardize key generation Thursday, May 23, 13 - most data stores trend toward normalization. foreign keys, relations, etc - think of use cases / sample queries -> use to plan data model - watch out = diverging key-generation schemes, etc, could be awful - standardized: we use the same separator as other generated keys - we wrap key-building logic in helper methods, and MAKE SURE all logic goes through them

CAVEATS • Batch jobs aren’t evil • Old Time Series
data unlikely to change Thursday, May 23, 13 - goes back to “know your data” - you CAN do batch / rollup jobs (hourly columns -> year rows with daily columns?)

RECAP! • Cassandra: write-optimized K-V store • Know your data
• and know how Cassandra will store it • Think backwards QUESTIONS? @cyen p.s. we’re hiring :) Thursday, May 23, 13 - Cassandra --> write/read behavior from BIGTABLE - THINK BACKWARDS: from your queries - and if your strengths are “inverted”, then maybe your approach should be, too

GlueCon 2013 - Think Backwards: Realtime Analyt...

GlueCon 2013 - Think Backwards: Realtime Analytics + Cassandra

Christine Yen

More Decks by Christine Yen

Other Decks in Technology

Featured

Transcript

THINK BACKWARDS: REALTIME ANALYTICS + CASSANDRA Christine Yen Parse @cyen

STATUS QUO our backend As ﬂexible as possible for our

WANTED: analytics • Lots of writes • Bursty • Fairly

TRADITIONAL BIG DATA • Capture data as generically as possible

THINGS FALL APART aggregation On-the-ﬂy aggregation slows to a crawl

CASSANDRA • Write-optimized • Designed for data to be distributed

NEEDLE SCRATCH but reads can be frustratingly slow... HMM. Thursday,

AGGREGATE as we go • Example: ‘push sent to iOS

SCHEMA PLANS • Row: as much as we can (within

READING OUR DATA “Get counts of iOS pushes over the

QUIRKS • De-normalize and duplicate • Know thy data •

CAVEATS • Batch jobs aren’t evil • Old Time Series

RECAP! • Cassandra: write-optimized K-V store • Know your data