Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Webinar: MongoDB's New Aggregation Framework

mongodb
May 10, 2012
1.5k

Webinar: MongoDB's New Aggregation Framework

We're working on a new aggregation framework for MongoDB that will introduce a new aggregation system that will make it a lot easier to do simple tasks like counting, averaging, and finding minima or maxima while grouping by keys in a collection. The new aggregation features are not a replacement for map-reduce, but will make it possible to do a number of things much more easily, without having to resort to the big hammer that is map-reduce. After introducing the syntax and usage patterns for the new aggregation system, we will give some demonstrations of aggregation using the new system.

mongodb

May 10, 2012
Tweet

Transcript

  1. What we want to do SELECT customer_id, SUM(price) FROM orders

    WHERE active=true GROUP BY customer_id Thursday, May 10, 12
  2. What we want to do SELECT customer_id, SUM(price) FROM orders

    WHERE active=true GROUP BY customer_id Computed   Fields Grouping Thursday, May 10, 12
  3. Map Reduce • Used to perform complex analytics tasks on

    massive amounts of data • Users are currently using it for aggregation… • totaling, averaging, etc Map/Reduce is a big hammer Thursday, May 10, 12
  4. • It should be easier to do simple aggregations •

    Shouldn’t need to write JavaScript • Avoid the overhead of JavaScript engine Problem Thursday, May 10, 12
  5. New Aggregation Framework • Declarative • No JavaScript required •

    C++ implementation • Higher performance than JavaScript • Expression evaluation • Return computed values • Framework: we can add new operations easily Thursday, May 10, 12
  6. Pipeline • Series of operations • Members of a collection

    are passed through a pipeline to produce a result Thursday, May 10, 12
  7. The Aggregation Command • Takes two arguments • Aggregate --

    name of collection • Pipeline -- array of pipeline operators db.runCommand(    {          aggregate  :  "article",        pipeline  :  [  {$op1,  $op2,  ...}  ]    } ); Thursday, May 10, 12
  8. Aggregation helper db.article.aggregate( {  $pipeline_op1  }, {  $pipeline_op2  }, {

     $pipeline_op3  }, {  $pipeline_op4  }, ... ); Thursday, May 10, 12
  9. Pipeline Operators • $match • $sort • $limit • $skip

    • $project • $unwind • $group Old  Faves New  Hotness Thursday, May 10, 12
  10. $match • Uses a query predicate (like .find({…})) as a

    filter {            title  :  "this  is  my  title"  ,              author  :  "bob"  ,              posted  :  new  Date(1079895594000)  ,              pageViews  :  5  ,              tags  :  [  "fun"  ,  "good"  ,  "fun"  ]  , } { $match : { author : "bob" } } { $match : { pgv : { $gt : 50, $lte : 90 } } } Thursday, May 10, 12
  11. • Sorts input documents • Requires sort key -- specified

    like index keys { $sort : { name : 1, age: -1 } } $sort Thursday, May 10, 12
  12. $limit • Limits the number of JSON documents $skip •

    Skips a number of JSON documents { $limit : 5 } { $skip : 5 } Thursday, May 10, 12
  13. $project • Project can reshape a document • add, remove,

    rename, move • Similar to .find()’s field selection syntax • But much more powerful • Can generate computed values Thursday, May 10, 12
  14. $project (include and exclude fields) { $project : { title

    : 1 , /* include this field, if it exists */ author : 1 , /* include this field, if it exists */ "comments.author" : 1 } } { $project : { title : 0 , /* exclude this field */ author : 0 , /* exclude this field */ } } Thursday, May 10, 12
  15. $project (computed fields) { $project : { title : 1,

    /* include this field if it exists */ doctoredPageViews : { $add: ["$pageViews", 10] } } } Thursday, May 10, 12
  16. Computed Fields • Prefix expression language • Add two fields

    • $add:[“$field1”, “$field2”] • Provide a value for a missing field • $ifnull:[“$field1”, “$field2”] • Nesting • $add:[“$field1”, $ifnull:[“$field2”, “$field3”]] • Date field extraction • Get year, month, day, hour, etc, from Date • Date arithmetic Thursday, May 10, 12
  17. $project (rename and pull fields up) { $project : {

    title : 1 , page_views : "$pageViews" , /* rename this field */ upgrade : "$other.foo" /* move to top level */ } } Thursday, May 10, 12
  18. $project (push fields down) { $project : { title :

    1 , stats : { pv : "$pageViews", /* rename this from the top-level */ } } } Thursday, May 10, 12
  19. $unwind • Produces document for each value in an array

    where the array value is single array element {            title  :  "this  is  my  title"  ,              author  :  "bob"  ,              posted  :  new  Date(1079895594000)  ,              pageViews  :  5  ,              tags  :  [  "fun"  ,  "good"  ,  "awesome"  ]  ,            comments  :  [                      {  author  :"joe"  ,  text  :  "this  is  cool"  }  ,                      {  author  :"sam"  ,  text  :  "this  is  bad"  }          ],            other  :  {  foo  :  5  } } Thursday, May 10, 12
  20. {            ...      

         tags  :  "fun"            ... }, {            ...            tags  :  "good"            ... } {            ...            tags  :  "awesome"            ... } Thursday, May 10, 12
  21. $unwind db.article.aggregate( { $project : { author : 1 ,

    /* include this field */ title : 1 , /* include this field */ tags : 1 /* include this field */ }}, { $unwind : "$tags" } ); Thursday, May 10, 12
  22. { "result" : [ { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "title" :

    "this is my title", "author" : "bob", "tags" : "fun" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "title" : "this is my title", "author" : "bob", "tags" : "good" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "title" : "this is my title", "author" : "bob", "tags" : "fun" } ], "ok" : 1 } Thursday, May 10, 12
  23. Grouping • $group aggregation expressions • Total of column values:

    $sum • Average of column values: $avg • Collect column values in an array: $push { $group : { _id: "$author", fieldname: { $aggfunc: “$field” } } } Thursday, May 10, 12
  24. $group example db.article.aggregate( { $group : { _id : "$author",

    viewsPerAuthor : { $sum : "$pageViews" } }} ); Thursday, May 10, 12
  25. {   "result"  :  [     {    

      "_id"  :  "jane",       "viewsPerAuthor"  :  6     },     {       "_id"  :  "dave",       "viewsPerAuthor"  :  7     },     {       "_id"  :  "bob",       "viewsPerAuthor"  :  5     }   ],   "ok"  :  1 } Thursday, May 10, 12
  26. Pulling it all together {          

     title  :  "this  is  my  title"  ,              author  :  "bob"  ,              posted  :  new  Date(1079895594000)  ,              pageViews  :  5  ,              tags  :  [  "fun"  ,  "good"  ,  "fun"  ] } {          tag  :  “fun” authors:  [  ...,  ...,  ...  ] }, { tag:  “good” authors:  [  ...,  ...,  ...  ] } Thursday, May 10, 12
  27. db.article.aggregate( { $project : { author : 1, tags :

    1, }}, { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }} ); Thursday, May 10, 12
  28.   "result"  :  [     {      

    "_id"  :  {  "tags"  :  "cool"  },       "authors"  :  [  "jane","dave"  ]     },     {       "_id"  :  {  "tags"  :  "fun"  },       "authors"  :  [  "dave",  "bob"  ]     },     {       "_id"  :  {  "tags"  :  "good"  },       "authors"  :  [  "bob"  ]     },     {       "_id"  :  {  "tags"  :  "awful"  },       "authors"  :  [  "jane"  ]     }   ] Thursday, May 10, 12
  29. Usage Tips • Use $match in a pipeline as early

    as possible • The query optimizer can then be used to choose an index and avoid scanning the entire collection Thursday, May 10, 12
  30. Driver Support • Initial version is a command • For

    any language, build a JSON database object, and execute the command • { aggregate : <collection>, pipeline : [ ] } • Beware of command result size limit Thursday, May 10, 12
  31. Sharding support • Initial release will support sharding • Mongos

    analyzes pipeline, and forwards operations up to first $group or $sort to shards; combines shard server results and continues Thursday, May 10, 12
  32. Common SQL • Distinct • aggregate({ $group: { _id: "$author"

    }}) • Count • aggregate({ $group: {_id:null, count: {$sum:1}}}]) • Sum • aggregate({ $group: {_id:null, total: {$sum: "$price"}}}) Thursday, May 10, 12