MongoDB Aggregation Framework - Millennial Media

Aggrega&ng with MongoDB February 19th 2014 Bertrand Dolimier

We are hiring: ￭  13 current Tech openings
￭  + 7 more to open any day… ￭  BalCmore, Boston & DC ￭  SoHware Engineers, System Engineers, OpCmizaCon Engineers, DevOps Engineers and Data ScienCsts ￭  Drew McCarl: [email protected]

Millennial Media ￭  Millennial Media -‐ What we do?
￭  Campaign Metrics reporCng, which problem to solve! ￭  How we selected MongoDB ￭  Why noSql? ￭  Which one? ￭  How we are using it ￭  AggregaCon strategies and Data Modeling ￭  Flexibility, Performance, Volume ￭  3 AggregaCon tacCcal choices ￭ Map Reduce, AggregaCon Framework and Group command ￭  TTL ￭  One more level of extracCon for conﬁguraCon ￭  Some code

What we do? ￭  Serve AdverCsing on Mobile Apps
￭  Volume: about 20,000 per seconds ￭  For each impression we log: When, What, Where, from Whom, to Whom, How and why? @12:25 displayed a 280x46 banner for an upcoming Universal Studio blockbuster on an iPhone 5s through a Music App with Millennial Media Sdk x At&T in Kansas city to a Movie Buﬀ... who click on the Ad…

What can we do with it? ￭  Scheduled Need
￭  Billing / Payout ￭  Data ExploraCon: ￭  Data science ￭  Changing Dimensions, Dataset, Trends ￭  Monitoring Campaign AcCviCes ￭  ReporCng ￭  VisualizaCon & Dashboard ￭  Limited & Stable Dataset ￭  Small output ￭  Current Data ￭  Wide Dataset ￭  Evolving data ￭  High volume ￭  Historical Data ￭  Limited Dataset ￭  Changing Dataset ￭  Fluctuate with Business ￭  Timely & Trend Data

Requirements ￭  Support the fastest Monitoring UI possible
￭  Data as real Cme as possible ￭  Flexible / Adaptable as business / trends change ￭  Be able to aggregate from mulCple data sources ￭  Simple!!

Aggrega&on ?? ￭  Raw = 60 Million / hour
Where When App Device Ad … .. .. .. .. Rome 10:25 Game A iPhone9 Big Blockbuster At&t V 2 ￭  Group on all dimensions ≈ 10 Million / Hour ￭  Many grouping on 1 dimension (ie. country ≈ 300,000 / Hour ) Where When App Device Ad … .. .. .. .. View Count Click Count Rome 10 Game A iPhone9 Big Blockbuster At&t V 2 6 1 Ad Where When View Count Device Big Blockbuster Rome 10 to 11 9,999 1,111 Ad App Whe n View Count Device Big Blockbuster Game A 10 to 11 9,999,99 9 1,111,11 1 Ad Device When View Count Device Big Blockbuster iPhone 9 10 to 11 9,999,999 111

Aggrega&on? 3 approaches 1.  Key by all Dimensions
2.  Roll up by preset dimensions 3.  Cascade AggregaCon ￭  Only acCve data ￭  Constant aggregaCon and drop ￭  Dimension on diﬀerent Cme frame ￭  AdapCve TTL

Hourly Campaign Daily Campaign Monthly Year Source of records
Hourly Creative Daily Creative Month Daily Crea. Type Month Year Daily Campaign / Carrier Camp. Group Month Year the Cascade approach

Priori&ze features 0 1 2 3
4 5 6 Read Performance Flexibility Heteregenous Source Write Performance Complex Data Type Scalability/Growth Integrity CRUD Repor&ng ReporCng

4 5 6 Read Performance Flexibility Heteregenous Source Write Performance Complex Data Type Scalability/Growth Integrity CRUD Needs RelaConal

4 5 6 Read Performance Flexibility Heteregenous Source Write Performance Complex Data Type Scalability/Growth Integrity CRUD Needs Key Value Column

4 5 6 Read Performance Flexibility Heteregenous Source Write Performance Complex Data Type Scalability/Growth Integrity CRUD Needs Document Graph

The NoSql op&ons Key-‐Value ￭  Terracola ￭ 
Redis ￭  Riak ￭  Dynamo ￭  Dynomite ￭  Tokyo ￭  MemCacheDB ￭  Voldemort ￭  Caché ￭  U2 ￭  … Columnar Store ￭  Cassandra ￭  Hbase ￭  Hypertable ￭  … Graph ￭  Neo4j ￭  AllegroGraph ￭  Virtuoso ￭  FlockDB ￭  InﬁniteGraph ￭  … Document ￭  MongoDB ￭  CouchBase ￭  CouchDB ￭  BaseX ￭  SimpleDB ￭  OrientDB ￭  Jackrabbit ￭  MarkLogic ￭  …

Preliminary Stress Test ￭  Intel(R) Xeon(R) CPU @ 2.50GHz
￭  Memory: 32 Gb ￭  CPU MHz: 2500 ￭  Cache: 15 Mb ￭  2 socket / 6 core per socket / 2 thread per core…. Latency in Milli Seconds Mongo CouchBase Write 500,000 doc 0.116 (8,000 inserts / sec) 0.097 Read 500,000 doc 0.006 0.018

￭  Single purpose ￭  count, disCnct, group ￭ 
Map Reduce ￭  MulC Phase for large volume ￭  AggregaCon pipeline ￭  Document through a pipe of

Which Aggrega&on to choose? Group() ￭  Pro:
￭  Simple ￭  Allow custom javascript code ￭  Return array ￭  Con: ￭  Read lock ￭  Does not work with Sharded collecCon ￭  16 Mb result limit Map Reduce() ￭  Pro: ￭  Several output opCons (inline, new collecCon, merge, replace, reduce) ￭  Incremental aggregaCon over large collecCons ￭  Con: ￭  Diﬃcult to debug ￭  Slow Aggregate() – aka AggregaCon Framework ￭  Pro: ￭  Designed for performance and usability ￭  Return result set inline ￭  Support sharded collecCons ￭  Uses pipeline approach through operators (match, project, sort, group…) ￭  Virtual, computed, sub-‐ﬁelds… ￭  Con: ￭  Limited to the framework, no custom funcCons ￭  16 Mb result limit

collection db. lkp_campaigns . find ( { _id : 1234
} , { “campaign_id” : 1 , “nb_view” : 1 , “nb_click” : 1 } ) ; db . lkp_campaigns . insert ( { _id : 1235, “campaign_id” : 4578, “name” : “A Blockbuster Campaign”, “view” : 78451 , “click” : 4578 } ) ; db . lkp_campaigns . update ( { “nb_view” : { $gt : 100000 } , { $set : { “some_indicator” : “big” } , $inc : { qualifier : 1 } , { upsert : false } , { multi : true } ) ;

collection method db. lkp_campaigns . find ( { _id :
1234 } , { “campaign_id” : 1 , “nb_view” : 1 , “nb_click” : 1 } ) ; db . lkp_campaigns . insert ( { _id : 1235, “campaign_id” : 4578, “name” : “A Blockbuster Campaign”, “view” : 78451 , “click” : 4578 } ) ; db . lkp_campaigns . update ( { “nb_view” : { $gt : 100000 } , { $set : { “some_indicator” : “big” } , $inc : { qualifier : 1 } , { upsert : false } , { multi : true } ) ;

collection method query / criteria db. lkp_campaigns . find (
{ _id : 1234 } , { “campaign_id” : 1 , “nb_view” : 1 , “nb_click” : 1 } ) ; db . lkp_campaigns . insert ( { _id : 1235, “campaign_id” : 4578, “name” : “A Blockbuster Campaign”, “view” : 78451 , “click” : 4578 } ) ; db . lkp_campaigns . update ( { “nb_view” : { $gt : 100000 } , { $set : { “some_indicator” : “big” } , $inc : { qualifier : 1 } , { upsert : false } , { multi : true } ) ;

collection method query / criteria action / projection db. lkp_campaigns
. find ( { _id : 1234 } , { “campaign_id” : 1 , “nb_view” : 1 , “nb_click” : 1 } ) ; db . lkp_campaigns . insert ( { _id : 1235, “campaign_id” : 4578, “name” : “A Blockbuster Campaign”, “view” : 78451 , “click” : 4578 } ) ; db . lkp_campaigns . update ( { “nb_view” : { $gt : 100000 } , { $set : { “some_indicator” : “big” } , $inc : { qualifier : 1 } , { upsert : false } , { multi : true } ) ;

collection method query / criteria action / projection db. lkp_campaigns
. find ( { _id : 1234 } , { “campaign_id” : 1 , “nb_view” : 1 , “nb_click” : 1 } ) ; db . lkp_campaigns . insert ( { _id : 1235, “campaign_id” : 4578, “name” : “A Blockbuster Campaign”, “view” : 78451 , “click” : 4578 } ) ; db . lkp_campaigns . update ( { “nb_view” : { $gt : 100000 } , { $set : { “some_indicator” : “big” } , $inc : { qualifier : 1 } , ) ;

Aggregate Commands ￭  $match ￭  $group ￭ 
$sort ￭  $limit ￭  $project : renames excludes calculates ﬁelds… ￭  $skip : skip documents in the pipeline ￭  $unwind : explodes internal documents array into documents ￭  $geoNear : return doc in distance order

{ year : 2013 ,
month : 12 , day : 01 , campaign : ABC } { year : 2013 , month : 12 , day : 01 , campaign : ABC } { year : 2013 , month : 12 , day : 02 , campaign : ABC } { year : 2013 , month : 12 , day : 01 , campaign : XYZ } { year : 2013 , month : 11 , day : 30 , campaign : XYZ } { year : 2013 , month : 12 , day : 02 , campaign : ABC } { year : 2013 , month : 12 , day : 01 , campaign : XYZ } { _id : ABC , count : 2 } { _id : XYZ , count : 1 } db . agg_campaigns_day . aggregate ( $match { month : 12 } , $group { _id : “$campaign” , count : { $sum : 1 } } ) ; $match $group

￭  lkp_apps { _id
: ”App A", verCcals : [ "Games", "Entertainement"] } { _id : ”App B", verCcals : [ "News", "Entertainement”] } { _id : ”App C”, verCcals : [ "Sports", "Entertainement”, “Mens Interest”] } // db. lkp_campaigns . aggregate ( [ { $unwind : "$verCcals" }, { $group : { _id : "$verCcals" , number : { $sum : 1 } } }, { $sort : { number : -‐1 } }, { $limit : 5 } ] )

theMonth = 12 ; theYear = 2013 ; monthly_camp =
db. agg_campaigns_day . aggregate ( { $match : { “_id . year” : { $gte : theYear } , “_id . month” : { $gte : theMonth } } } , { $group : { _id : { campaign_id : "$_id.campaign_id" , year : "$_id.year" , month : "$_id.month" } , count : { $sum : 1 }, views : { $sum : ”$views" }, clicks : { $sum : ”$clicks" }, } } , ) ; db . agg_campaigns_month . insert ( monthly_camp.result ) ;

Aggregate using conﬁg collec&on var agg_suite_cur =
db . cfg_aggsuite . ﬁnd( ) . sort( { _id : 1 } ) ; while ( agg_suite_cur . hasNext ( ) ) { agg_suite_doc = agg_suite_cur . next() ; from_collec&on= agg_suite_doc . from_collecCon ; to_collec&on = agg_suite_doc . to_collecCon ; match = eval("(" + agg_suite_doc . match + ")") ; group = eval("(" + agg_suite_doc . group + ")") ; rm_match = eval("(" + agg_suite_doc . rm_match+ ")") ; aggput = db[from_collec&on] . aggregate ( { $match : match } , { $group : group } ) ; db[to_collec&on] . remove ( rm_match ) ; db[to_collec&on] . insert ( aggput . result ) ; }

TTL : Time To Live ￭  Agg. collecCon create
with TTL: db[to_collec&on] . ensureIndex( { campaign_end_date : 1 } , { expireAderSeconds : el_oﬀset } ) ;

Edge cases ￭  Result set greater than 16 Mb
￭  Late data ￭  EnCty name from other source

Take away ￭  60 Million raw records aggregates into
275,000 inserts / hour ￭  in 33 aggreg collecCons ￭ … in 4 minutes ￭  Speed : Kept all query dimension in the key ( _id ) ￭  Flexibility: Queries, grouping, sorCng opCon in Conﬁg_collec:on ￭  Speed :Used insert instead of update ￭  Scope : Used collecCon speciﬁc TTL to drop old documents ￭  Scope : Separate collecCon for each dimension / :me frame

Thank you Bertrand Dolimier DataBase Architect [email protected]

Map Reduce db.runCommand (
{ mapReduce: <collecCon>, map: <funcCon>, reduce: <funcCon>, out: <output>, query: <document>, sort: <document>, limit: <number>, ﬁnalize: <funcCon>, scope: <document>, jsMode: <boolean>, verbose: <boolean> } )

MongoDB Aggregation Framework - Millennial Media

MongoDB Aggregation Framework - Millennial Media

Michael Barrett

Other Decks in Programming

Featured

Transcript

Aggrega&ng with MongoDB February 19th 2014 Bertrand Dolimier

We are hiring: ￭  13 current Tech openings

Millennial Media ￭  Millennial Media -‐ What we do?

What we do? ￭  Serve AdverCsing on Mobile Apps

What can we do with it? ￭  Scheduled Need

Requirements ￭  Support the fastest Monitoring UI possible

Aggrega&on ?? ￭  Raw = 60 Million / hour

Aggrega&on? 3 approaches 1.  Key by all Dimensions

Hourly Campaign Daily Campaign Monthly Year Source of records

Priori&ze features 0 1 2 3

Priori&ze features 0 1 2 3

Priori&ze features 0 1 2 3

Priori&ze features 0 1 2 3

The NoSql op&ons Key-‐Value ￭  Terracola ￭

Preliminary Stress Test ￭  Intel(R) Xeon(R) CPU @ 2.50GHz

￭  Single purpose ￭  count, disCnct, group ￭

Which Aggrega&on to choose? Group() ￭  Pro:

collection db. lkp_campaigns . find ( { _id : 1234

collection method db. lkp_campaigns . find ( { _id :

collection method query / criteria db. lkp_campaigns . find (

collection method query / criteria action / projection db. lkp_campaigns

collection method query / criteria action / projection db. lkp_campaigns

Aggregate Commands ￭  $match ￭  $group ￭

{ year : 2013 ,

￭  lkp_apps { _id

theMonth = 12 ; theYear = 2013 ; monthly_camp =

Aggregate using conﬁg collec&on var agg_suite_cur =

TTL : Time To Live ￭  Agg. collecCon create

Edge cases ￭  Result set greater than 16 Mb

Take away ￭  60 Million raw records aggregates into

Thank you Bertrand Dolimier DataBase Architect [email protected]

Map Reduce db.runCommand (