Lessons Learned from Parse Analytics

POWERING A REALTIME ANALYTICS PRODUCT: THE UPS, DOWNS, AND SIDEWAYS
CHRISTINE YEN / @CYEN

THE PAST OVERVIEW ▸ Backend as a Service ▸ Mobile
apps saved objects and ran queries, backed by Mongo ▸ New product: Analytics ▸ Knew (in 2012) that Mongo wouldn’t support a real Analytics workload

A NEW ANALYTICS PRODUCT, YOU SAY? WANTED ▸ Scale well
for write-heavy load ▸ Coarse, well-deﬁned time series ▸ Realtime results ▸ Minimize post-processing ▸ … aka write-time aggregation ▸ Ideas from Dynamo + Bigtable ▸ Designed to be distributed ▸ Tradeoff: highly available, eventual consistency ▸ Tradeoff: speedy writes, slower reads CASSANDRA

A LOOK INSIDE UNDERLYING STORAGE 3 0 32 1 61
2 83 4 ... 39 120 api_key:ApiRequest:ios:201604 743 161 3 0 1 2 4 57 132 140 ... 72 201 api_key:ApiRequest:201604 743 278 3 0 10 1 15 2 19 4 ... 8 27 api_key:ApiRequest:get:201604 743 63 3 0 8 1 12 2 16 4 ... 7 20 api_key:ApiRequest:ios:get:201604 743 49 ▸ Random partitioning by row key API Request: received a GET request from an iOS device TIME ——->

WHAT COULD POSSIBLY GO WRONG? NOW LET’S ALLOW USERS TO
DEFINE EVENTS + DIMENSIONS! CREDIT JARED ERONDU / UNSPLASH IT WORKS!

AKA FORESHADOWING INITIAL BELIEFS: KNOW THY (TIME SERIES) DATA ▸
Understand product constraints and only send data that makes sense ▸ Value ﬂexibility above all else, and will self-regulate ▸ … also be OK with limits on the # of k/v pairs per event (to prevent combinatorial explosion of Cassandra writes) AKA, USERS WILL…

READING YOUR CUSTOM EVENTS RETRIEVING CUSTOM TIME SERIES ▸ Needed
to store all dimension keys/values to generate time series row keys ▸ Started off with documents in Mongo + write-through cache "api_key": { "ApiRequest": { "platform": ["ios", "android"], "method": ["get", "post", "put"] } } MONGO "api_key:ApiRequest:ios:get:201604": { 0: 150, 1: 29, ... } CASSANDRA WHICH WORKED GREAT! UNTIL…

LET’S IMPOSE SOME LIMITS OK, OK, FINE. CREDIT VOLKAN OLMEZ
/ UNSPLASH

OLD & BUSTED # Mongo: "api_key": {"ApiRequest": { "platform": ["ios",
"android"], "method": ["get", "post", "put"] } } NEW HOTNESS # Mongo: "api_key": [{ "event": "ApiRequest", "dimensions": { "platform": false, "method": false } }, …] # Cassandra: "api_key:ApiRequest:platform" "api_key:ApiRequest:method" | "ios" | "android" | | "get" | "post" | "put" | | 7 | 3 | | 5 | 2 | 3 |

IN RETROSPECT PROBLEMS, A RECAP ▸ Row keys were constructed
with granularity of 1 month ▸ Random partitioning + relying on write-time fanout to spread load ▸ Still ran into hotspots ▸ Nodes were CPU bound (compaction!) ▸ Ran Cassandra 1.1.8 (no vnodes, yet) ▸ Solution should theoretically "linear"ly scale, required exponential ▸ All things considered, Cassandra itself was ﬁne; our requirements were not.

IN RETROSPECT LESSONS LEARNED ▸ Concrete limits paired with best
practices suggestions ▸ Don’t trust the user to "do what’s right," especially when assumptions abound ▸ Extreme "ﬂexibility" beneﬁts no one ▸ "Random" distribution of data isn’t good enough ▸ Write-time aggregation is not the future THANKS! @CYEN

Lessons Learned from Parse Analytics

Lessons Learned from Parse Analytics

Christine Yen

More Decks by Christine Yen

Other Decks in Technology

Featured

Transcript

POWERING A REALTIME ANALYTICS PRODUCT: THE UPS, DOWNS, AND SIDEWAYS

THE PAST OVERVIEW ▸ Backend as a Service ▸ Mobile

A NEW ANALYTICS PRODUCT, YOU SAY? WANTED ▸ Scale well

A LOOK INSIDE UNDERLYING STORAGE 3 0 32 1 61

WHAT COULD POSSIBLY GO WRONG? NOW LET’S ALLOW USERS TO

AKA FORESHADOWING INITIAL BELIEFS: KNOW THY (TIME SERIES) DATA ▸

READING YOUR CUSTOM EVENTS RETRIEVING CUSTOM TIME SERIES ▸ Needed

LET’S IMPOSE SOME LIMITS OK, OK, FINE. CREDIT VOLKAN OLMEZ

OLD & BUSTED # Mongo: "api_key": {"ApiRequest": { "platform": ["ios",

IN RETROSPECT PROBLEMS, A RECAP ▸ Row keys were constructed

IN RETROSPECT LESSONS LEARNED ▸ Concrete limits paired with best