Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MoPub's migration from App Engine to MongoDB - Nafis Jamal, MoPub

mongodb
January 03, 2012

MoPub's migration from App Engine to MongoDB - Nafis Jamal, MoPub

MongoSV 2011

This talk will introduce MoPub, our data needs, and the challenges faced as we grew from 50M request / day to 1 billion / requests per day. We now use MongoDB for realtime stats (lots of counters!), our budgeting system, and our user store.

mongodb

January 03, 2012
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. Migrating from Google App Engine to AWS + MongoDB Nafis

    Jamal (Co-Founder, VP Eng) MongoSV 2011
  2. A little about ... • Mobile monetization platform ‣ iPhone

    and Android applications developers use our service as a one-stop shop to manage their entire ads strategy • We’ve built for scale since day one about a year ago (and made lots of mistakes along the way :) ) ‣ Currently ~3K QPS ‣ Planning for 30K+ QPS (>1 billion ads per day) • 10 person engineering team
  3. Challenges we face • Web serving scale (horizontal scaling) •

    Low latency, high availability • Realtime statistics ‣ We can’t just serve ads we have to record many events along the way (lots of counters!) ‣ Many writes, few reads • Mobile user database
  4. Challenges we face • Web serving scale (horizontal scaling) •

    Low latency, high availability • Realtime statistics ‣ We can’t just serve ads we have to record many events along the way (lots of counters!) ‣ Many writes, few reads • Mobile user database
  5. When it all began... • 3 founders hoping to bring

    Angry Birds on to the platform • We needed something that would scale as we quickly brought on new customers (stair-step growth). • Zero ability to spend time on ops or sys admin • We chose Google App Engine • Python! (Guido works on this team!!) ‣ Anonymous “instances” (really singled threaded processes) with very little memory ‣ Autoscaling (pay per CPU-hour) ‣ Uber scalable, managed Datastore (built on BigTable)
  6. What stats do we collect? Account App AdUnit Campaign Creative

    Publisher Advertiser Lowest level interaction is adunit-creative Ad here
  7. What stats do we collect? Account App AdUnit Campaign Creative

    Publisher Advertiser We need to have easy access to all “cross sections” Ad here
  8. GAE Stats Model class StatsModel(db.Model): pub = db.ReferenceProperty() # App,

    AdUnit or * adv = db.ReferenceProperty() # Campaign, Creative or * acct = db.ReferenceProperty() # Account date = db.DateProperty() count = db.IntegerProperty() @classmethod def get_primary_key(cls, date, pub_id, adv_id, acct_id): """ Returns the `key_name` of the model based on pub_id, adv_id, acct_id and dt """ return "%s:%s:%s:%s" % (date, acct_id, pub_id, adv_id)
  9. Realtime Stats in GAE • Google App Engine gave us

    a good number of building blocks to build such a system ‣ Datastore - Expensive Writes - Slow writes (~1 write / second / “entity group”) - Fast reads (~3ms) ‣ Memcache ‣ TaskQueues • Cannot sum over objects or even native map/reduces ‣ All higher-level “rollups” must be pre-computed • Solution: Buffer into memcache then periodically flush to the datastore
  10. Realtime Stats in GAE API instances ... memecache Datastore taskqueue

    TQ worker #1 TQ worker #n ... raw events raw events aggregation, rollups (account, time_bucket) memcache
  11. Example Account Rovio id=account1 Angry Birds id=app1 Welcome id=adunit1 GameOver

    id=adunit2 Coke id=camp1 CokeZero id=crtv1 Wings id=crtv2 RedBull id=camp2 Campaigns Creatives Apps AdUnits Account
  12. Example Account Rovio id=account1 Angry Birds id=app1 Welcome id=adunit1 GameOver

    id=adunit2 Coke id=camp1 CokeZero id=crtv1 Wings id=crtv2 RedBull id=camp2 Campaigns Creatives Apps AdUnits Account Ad here Ad there
  13. • We create pseudo-files in memcache using a specific key

    naming template • Each account-time_bucket (aka time window) has a dedicated pseudo-file • We write within each pseudo-file at a time • We flush the entire pseudo-file when syncing Pseudo-Files in Memcache
  14. Example: time = 1 memcache 1 (adunit1, crtv1) . .

    . acct1:tb0 acct1:tb0:1 "line" # <account>:<time_bucket> (adunit1, crtv1)
  15. Example: time = 2 memecache 2 (adunit1, crtv1) (adunit1, crtv2)

    . . . acct1:tb0 acct1:tb0:1 acct1:tb0:2 "line" # <account>:<time_bucket> (adunit1, crtv2) memcache
  16. Example: time = 3 memecache 3 (adunit1, crtv1) (adunit1, crtv2)

    (adunit1, crtv1) . . . acct1:tb0 acct1:tb0:1 acct1:tb0:2 acct1:tb0:3 "line" # <account>:<time_bucket> (adunit1, crtv1) memcache
  17. Example: time = 4 memecache 4 (adunit1, crtv1) (adunit1, crtv2)

    (adunit1, crtv1) (adunit2, crtv1) . . . acct1:tb0 acct1:tb0:1 acct1:tb0:2 acct1:tb0:3 acct1:tb0:4 "line" # <account>:<time_bucket> (adunit2, crtv1) memcache
  18. Example: time = 5 memecache 5 (adunit1, crtv1) (adunit1, crtv2)

    (adunit1, crtv1) (adunit2, crtv1) (adunit2, crtv2) . . . acct1:tb0 acct1:tb0:1 acct1:tb0:2 acct1:tb0:3 acct1:tb0:4 acct1:tb0:5 "line" # <account>:<time_bucket> (adunit2, crtv2) memcache
  19. Example: time = n memecache n (adunit1, crtv1) (adunit1, crtv2)

    (adunit1, crtv1) (adunit2, crtv1) (adunit2, crtv2) . . . acct1:tb0 acct1:tb0:1 acct1:tb0:2 acct1:tb0:3 acct1:tb0:4 acct1:tb0:5 "line" # . . . (adunit2, crtv2) acct1:tb0:n <account>:<time_bucket> (adunit2, crtv2) memcache
  20. Flush to DB # of lines memcache n (adunit1, crtv1)

    . . . acct1:tb0 acct1:tb0:1 (adunit1, crtv1) (adunit1, crtv1) (adunit1, crtv1) (adunit1, crtv1) . . . (adunit1, crtv1) acct1:tb0:2 acct1:tb0:3 acct1:tb0:4 acct1:tb0:5 acct1:tb0:n (acct, tb0) taskqueue Aggregate, Rollup StatsModel(account=acct,pub=adunit1,adv=crtv1, count=5) StatsModel(account=acct,pub=adunit2,adv=crtv1, count=4) StatsModel(account=acct,pub=adunit1,adv=crtv2, count=2) StatsModel(account=acct,pub=adunit2,adv=crtv2, count=5) StatsModel(account=acct,pub=app1,adv=crtv1, count=9) StatsModel(account=acct,pub=app1,adv=crtv2, count=7) StatsModel(account=acct,pub=*,adv=*, count=16) . . . Datastore transaction read "page" with n lines from memcache
  21. GAE Shortcomings • Transient downtime for writes • Read before

    write (not fire and forget) • Memcache is fleeting => data loss • Objects have become too large for transactions • Writing to datastore, even if buffered is expensive! ‣ Pay for datastore CPU and application CPU ‣ Realtime stats systems accounts for 60% of our hefty GAE bill • Duct tape is starting to become threadbare
  22. AWS + Mongo • We have control of each service

    and the machines on which they run • Access to RAM! (simple pleasures) • Fire and forget increments! • Our entire machinery for recording multiple stats is a series of increment instructions sent to MongoDB (which handles all the buffering and flushing)
  23. First Attempt: Data Model One document per pub-adv per month.

    Each document stores data for all days and hours within the month: class StatsDocument(mdb.Document): _id = mdb.StringField(primary_key=True) pub_id = mdb.StringField() adv_id = mdb.StringField() acct_id = mdb.StringField() date = mdb.YearMonthField(require=True) day_counts = mdb.MapField(field=int) # key is day of month @classmethod def get_primary_key(cls, date, pub_id, adv_id, acct_id): """ Returns the `key_name` of the model based on pub_id, adv_id, acct_id and dt """ return "%s:%s:%s:%s " % (date, acct_id, pub_id, adv_id)
  24. Example Account Rovio id=account1 Angry Birds id=app1 Welcome id=adunit1 GameOver

    id=adunit2 Coke id=camp1 CokeZero id=crtv1 Wings id=crtv2 RedBull id=camp2 Campaigns Creatives Apps AdUnits Account Ad here Ad there
  25. First Attempt: Updates Document:(adunit1, crtv2) Document:(app1, crtv2) Document:(*, crtv2) Document:(adunit1,

    camp2) Document:(app1, camp2) Document:(*, camp2) Document:(adunit1, *) Document:(app1, *) Document:(*, *) . . . update((adunit1,crtv2), $inc(count)), One logical increment, gets translated into 9 db increments
  26. Initial Results • Pros: ‣ Scales with # of business

    objects, don’t have to think about accounts getting too large ‣ Queries for higher level cross products (e.g. App-Campaign are precomputed and are a simple db fetch) ‣ Queries for stats for a given date range result in only reading the # of months total documents in the range • Cons: ‣ Lots of little documents ‣ Each increment is really 9 increments to 9 documents ‣ Random write access resulting in lots of cache misses and intense load on the cluster
  27. Solution: New Data Model One document per account per day.

    Each document stores all the lowest level stats. class StatsDocument(mdb.Document): _id = mdb.StringField(primary_key=True) acct_id = mdb.StringField() date = mdb.DateField(require=True) counts = mdb.MapField(field=int) # key is "<adunit>:<creative>" @classmethod def get_primary_key(cls, date, acct_id): """Returns the `_id` of the model based on acct_id and date""" return "%s:%s" % (date, acct_id)
  28. Solution: New Data Model We created IndexDocument to store the

    reverse indices to do the higher-level cross sections class StatsIndex(mdb.Document): _id = mdb.StringField(primary_key=True) # { ‘app1:camp1’:[‘adunit1:crtv1’,‘adunit2:crtv1’],... # ‘app1:camp2’:[‘adunit1:crtv2’,‘adunit2:crtv2’],...} indx = mdb.MapField(field=mdb.ListField()) @classmethod def get_primary_key(cls, acct_id): """Returns the `_id` of the model based on acct_id""" return acct_id
  29. Solution: Updates StatsDocument:(account,date) (adunit1,crtv1) => 11 (adunit1,crtv2) => 10 (adunit2,crtv2)

    => 1 . . . update((acct1,adunit1,crtv2), $inc(count)), . . . StatsIndxDocument:(account) "app1:camp2" => $push(adunit1,crtv2) "*:camp1" => $push(adunit1,crtv2) "app1:*" => $push(adunit1,crtv2) "*:*" => $push(adunit1,crtv2) update index
  30. Solution: Reads function get_stats(acct_id, app_id, campaign_id, date): # get and

    read the index model indx_key = "%s:%s:%s" % (app_id,campaign_id) index_obj = StatsIndex.objects(_id=acct_id).\ only("indx.%s" % key).first() or StatsIndex() # lowest-level keys in the StatsModel count_keys = index_obj.indx.get(key, []) # fetch only the relevant parts of the stats model stats_model_key = StatsModel.get_primary_key(date, acct_id) stats_model = StatsModel.objects(_id=key).only(*count_keys).first() return sum([val in stats_model.counts.values()])
  31. Results • Two machines handle all the writes (One could

    do it but two for redundancy) • Mongo load is now minuscule • Each update is really two updates (StatsDocument and StatsIndex) • Each get is really two gets (StatsIndex then StatsDocument) • One document per account per day ‣ # of documents scale with the number of accounts ‣ We do the simple aggregation in python ‣ size of document scales with # of business objects within account ‣ We’ve got a lot of head room since our largest account has over 60K+ possible stat “cross-sections” as is only an object of 0.1MB (<< 64 MB)
  32. Mongo Lessons Learned • It’s better to have fewer, larger

    documents if there is a logical hierarchy than many smaller documents arranged flatly, disparately. • Log everything because you will make mistakes! ‣ We require audit trails for billing, etc so we are accustomed to having this fail safe ‣ This has saved us on numerous occasions during development and we have just cut the cord to spare the rest of the system ‣ Allows us to pretty aggressively test and build new features • Buy more memory! ‣ It’s cheap! ‣ We’ve learned a few lessons the hard way but can solve a lot of challenges with $$$ by just throwing memory at the problem (in the process of doing this now)
  33. Future Work • Mobile User store ‣ Record every interaction

    of all users within our network ‣ ML: Map/Reduce jobs to glean data about our users and segment them ‣ Per-user targeting with minimum latency ‣ Relevancy => Better user experience => app developers make more money