Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using MongoDB for Data Processing

mongodb
April 06, 2012

Using MongoDB for Data Processing

MongoDB Shanghai 2012

mongodb

April 06, 2012
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. Aggregation Framework: motivation • Used to perform complex analytics tasks

    on massive amounts of data • Users are currently using it for aggregation… • totaling, averaging, etc Map/Reduce is a big hammer
  2. Aggregation Framework: motivation • Used to perform complex analytics tasks

    on massive amounts of data • Users are currently using it for aggregation… • totaling, averaging, etc Map/Reduce is a big hammer But...
  3. • It should be easier to do simple aggregations •

    Want to avoid overhead of JavaScript engine Aggregation Framework: motivation
  4. New Aggregation Framework • Out now in 2.1! (unstable) •

    Declarative • No JavaScript required • C++ implementation • Higher performance than JavaScript • Expression evaluation • Return computed values • Framework: we can add new operations easily
  5. Pipeline • Series of operations • Members of a collection

    are passed through a pipeline to produce a result
  6. Pipeline Operations • $match • Uses a query predicate (like

    .find({…})) as a filter • $project • Uses a sample document to determine the shape of the result (similar to .find()’s optional argument) • This can include computed values • $unwind • Hands out array elements one at a time • $group • Aggregates items into buckets defined by a key
  7. Pipeline Operations (continued) • $sort • Sort documents • $limit

    • Only allow the specified number of documents to pass • $skip • Skip over the specified number of documents
  8. Example: finding popular checkins checkin_1 = { location: location_id, user:

    user_id, ts: "20/03/2010" } // Find most popular locations in last 3 hours > agg = db.checkins.aggregate( {$match: {ts: {$gt: now_minus_3_hrs}}}, {$group: {_id: "$location", numEntries: {$sum: 1}}} {$sort: {numEntries : -1 }} )
  9. Example: finding popular checkins checkin_1 = { location: location_id, user:

    user_id, ts: "20/03/2010" } // Find most popular locations in last 3 hours > agg = db.checkins.aggregate( {$match: {ts: {$gt: now_minus_3_hrs}}}, {$group: {_id: "$location", numEntries: {$sum: 1}}} {$sort: {numEntries : -1 }} ) > agg.result [{"_id": "Din Tai Fung", "numEntries" : 17}, ....]
  10. More details: $project • Project can reshape a document •

    add, remove, rename, move • Similar to .find()’s field selection syntax • But much more powerful • Can generate computed values
  11. Example 2: $project {            title

     :  "this  is  my  title"  ,          author  :  "bob"  ,          posted  :  new  Date(1079895594000)  ,          pageViews  :  5  ,          tags  :  [  "fun"  ,  "good"  ,  "fun"  ]  ,        comments  :  [                        {  author  :  "steve"  ,  ts  :  "4/1/12",  text:  "I  like  your  post"}          ] }
  12. Example 2: $project {            title

     :  "this  is  my  title"  ,          author  :  "bob"  ,          posted  :  new  Date(1079895594000)  ,          pageViews  :  5  ,          tags  :  [  "fun"  ,  "good"  ,  "fun"  ]  ,        comments  :  [                        {  author  :  "steve"  ,  ts  :  "4/1/12",  text:  "I  like  your  post"}          ] } >  db.posts.aggregate(          {  $project  :  {    title  :  1,  author  :  1,        "comments.author"  :  1  }} )  
  13. Example 2: $project {            title

     :  "this  is  my  title"  ,          author  :  "bob"  ,          posted  :  new  Date(1079895594000)  ,          pageViews  :  5  ,          tags  :  [  "fun"  ,  "good"  ,  "fun"  ]  ,        comments  :  [                        {  author  :  "steve"  ,  ts  :  "4/1/12",  text:  "I  like  your  post"}          ] } >  db.posts.aggregate(          {  $project  :  {    title  :  1,  author  :  1,        "comments.author"  :  1  }} )   result  :  [{      title  :  "this  is  my  title"  ,    author  :  "bob"  ,  comments  :  [author  :  "steve"]  }]
  14. More details: $unwind • Produces document for each value in

    an array where the array value is single array element {            title  :  "this  is  my  title"  ,              author  :  "Kevin"  ,              tags  :  [  "fun"  ,  "good"  ,  "awesome"  ]   }
  15. {            ...      

         tags  :  "fun"            ... }, {            ...            tags  :  "good"            ... } {            ...            tags  :  "awesome"            ... }
  16. $unwind db.article.aggregate( { $project : { author : 1 ,

    tags : 1 }}, { $unwind : "$tags" } );
  17. "result" : [ { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "author" : "Kevin",

    "tags" : "fun" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "author" : "Kevin", "tags" : "good" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "author" : "Kevin", "tags" : "fun" } ]
  18. db.article.aggregate( { $project : { author : 1, tags :

    1, }}, { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }} );
  19.   "result"  :  [     {      

    "_id"  :  {  "tags"  :  "cool"  },       "authors"  :  [  "jane","dave"  ]     },     {       "_id"  :  {  "tags"  :  "fun"  },       "authors"  :  [  "dave",  "bob"  ]     },     {       "_id"  :  {  "tags"  :  "good"  },       "authors"  :  [  "bob"  ]     },     {       "_id"  :  {  "tags"  :  "awful"  },       "authors"  :  [  "jane"  ]     }   ]
  20. Usage Tips • Use $match in a pipeline as early

    as possible • The query optimizer can then be used to choose an index and avoid scanning the entire collection • Use $sort in a pipeline as early as possible • The query optimizer can sometimes be used to choose an index to scan instead of sorting the result
  21. Driver Support • Initial version is a command • For

    any language, build a JSON database object, and execute the command • { aggregate : <collection>, pipeline : {…} } • Beware of command result size limit • Document size limit is 16MB
  22. Sharding support • Initial release does support sharding • Mongos

    analyzes pipeline, and forwards operations up to $group or $sort to shards; combines shard server results and returns them
  23. Pipeline Operations – Future Plans • $out • Saves the

    document stream to a collection • Similar to M/R $out, but with sharded output • Functions like a tee, so that intermediate results can be saved