Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned from Parse Analytics

Christine Yen
September 07, 2016

Lessons Learned from Parse Analytics

Christine Yen

September 07, 2016
Tweet

More Decks by Christine Yen

Other Decks in Technology

Transcript

  1. THE PAST OVERVIEW ▸ Backend as a Service ▸ Mobile

    apps saved objects and ran queries, backed by Mongo ▸ New product: Analytics ▸ Knew (in 2012) that Mongo wouldn’t support a real Analytics workload
  2. A NEW ANALYTICS PRODUCT, YOU SAY? WANTED ▸ Scale well

    for write-heavy load ▸ Coarse, well-defined time series ▸ Realtime results ▸ Minimize post-processing ▸ … aka write-time aggregation ▸ Ideas from Dynamo + Bigtable ▸ Designed to be distributed ▸ Tradeoff: highly available, eventual consistency ▸ Tradeoff: speedy writes, slower reads CASSANDRA
  3. A LOOK INSIDE UNDERLYING STORAGE 3 0 32 1 61

    2 83 4 ... 39 120 api_key:ApiRequest:ios:201604 743 161 3 0 1 2 4 57 132 140 ... 72 201 api_key:ApiRequest:201604 743 278 3 0 10 1 15 2 19 4 ... 8 27 api_key:ApiRequest:get:201604 743 63 3 0 8 1 12 2 16 4 ... 7 20 api_key:ApiRequest:ios:get:201604 743 49 ▸ Random partitioning by row key API Request: received a GET request from an iOS device TIME ——->
  4. WHAT COULD POSSIBLY GO WRONG? NOW LET’S ALLOW USERS TO

    DEFINE EVENTS + DIMENSIONS! CREDIT JARED ERONDU / UNSPLASH IT WORKS!
  5. AKA FORESHADOWING INITIAL BELIEFS: KNOW THY (TIME SERIES) DATA ▸

    Understand product constraints and only send data that makes sense ▸ Value flexibility above all else, and will self-regulate ▸ … also be OK with limits on the # of k/v pairs per event (to prevent combinatorial explosion of Cassandra writes) AKA, USERS WILL…
  6. READING YOUR CUSTOM EVENTS RETRIEVING CUSTOM TIME SERIES ▸ Needed

    to store all dimension keys/values to generate time series row keys ▸ Started off with documents in Mongo + write-through cache "api_key": { "ApiRequest": { "platform": ["ios", "android"], "method": ["get", "post", "put"] } } MONGO "api_key:ApiRequest:ios:get:201604": { 0: 150, 1: 29, ... } CASSANDRA WHICH WORKED GREAT! UNTIL…
  7. OLD & BUSTED # Mongo: "api_key": {"ApiRequest": { "platform": ["ios",

    "android"], "method": ["get", "post", "put"] } } NEW HOTNESS # Mongo: "api_key": [{ "event": "ApiRequest", "dimensions": { "platform": false, "method": false } }, …] # Cassandra: "api_key:ApiRequest:platform" "api_key:ApiRequest:method" | "ios" | "android" | | "get" | "post" | "put" | | 7 | 3 | | 5 | 2 | 3 |
  8. IN RETROSPECT PROBLEMS, A RECAP ▸ Row keys were constructed

    with granularity of 1 month ▸ Random partitioning + relying on write-time fanout to spread load ▸ Still ran into hotspots ▸ Nodes were CPU bound (compaction!) ▸ Ran Cassandra 1.1.8 (no vnodes, yet) ▸ Solution should theoretically "linear"ly scale, required exponential ▸ All things considered, Cassandra itself was fine; our requirements were not.
  9. IN RETROSPECT LESSONS LEARNED ▸ Concrete limits paired with best

    practices suggestions ▸ Don’t trust the user to "do what’s right," especially when assumptions abound ▸ Extreme "flexibility" benefits no one ▸ "Random" distribution of data isn’t good enough ▸ Write-time aggregation is not the future THANKS! @CYEN