Slide 1

Slide 1 text

MongoDB’s New Aggregation Framework Jared Rosoff @forjared Thursday, May 10, 12

Slide 2

Slide 2 text

What we want to do SELECT customer_id, SUM(price) FROM orders WHERE active=true GROUP BY customer_id Thursday, May 10, 12

Slide 3

Slide 3 text

What we want to do SELECT customer_id, SUM(price) FROM orders WHERE active=true GROUP BY customer_id Computed   Fields Grouping Thursday, May 10, 12

Slide 4

Slide 4 text

2.1 available now (unstable) Thursday, May 10, 12

Slide 5

Slide 5 text

Map Reduce • Used to perform complex analytics tasks on massive amounts of data • Users are currently using it for aggregation… • totaling, averaging, etc Map/Reduce is a big hammer Thursday, May 10, 12

Slide 6

Slide 6 text

• It should be easier to do simple aggregations • Shouldn’t need to write JavaScript • Avoid the overhead of JavaScript engine Problem Thursday, May 10, 12

Slide 7

Slide 7 text

New Aggregation Framework • Declarative • No JavaScript required • C++ implementation • Higher performance than JavaScript • Expression evaluation • Return computed values • Framework: we can add new operations easily Thursday, May 10, 12

Slide 8

Slide 8 text

Pipeline • Series of operations • Members of a collection are passed through a pipeline to produce a result Thursday, May 10, 12

Slide 9

Slide 9 text

The Aggregation Command • Takes two arguments • Aggregate -- name of collection • Pipeline -- array of pipeline operators db.runCommand(    {          aggregate  :  "article",        pipeline  :  [  {$op1,  $op2,  ...}  ]    } ); Thursday, May 10, 12

Slide 10

Slide 10 text

Aggregation helper db.article.aggregate( {  $pipeline_op1  }, {  $pipeline_op2  }, {  $pipeline_op3  }, {  $pipeline_op4  }, ... ); Thursday, May 10, 12

Slide 11

Slide 11 text

Pipeline Operators • $match • $sort • $limit • $skip • $project • $unwind • $group Old  Faves New  Hotness Thursday, May 10, 12

Slide 12

Slide 12 text

$match • Uses a query predicate (like .find({…})) as a filter {            title  :  "this  is  my  title"  ,              author  :  "bob"  ,              posted  :  new  Date(1079895594000)  ,              pageViews  :  5  ,              tags  :  [  "fun"  ,  "good"  ,  "fun"  ]  , } { $match : { author : "bob" } } { $match : { pgv : { $gt : 50, $lte : 90 } } } Thursday, May 10, 12

Slide 13

Slide 13 text

• Sorts input documents • Requires sort key -- specified like index keys { $sort : { name : 1, age: -1 } } $sort Thursday, May 10, 12

Slide 14

Slide 14 text

$limit • Limits the number of JSON documents $skip • Skips a number of JSON documents { $limit : 5 } { $skip : 5 } Thursday, May 10, 12

Slide 15

Slide 15 text

$project • Project can reshape a document • add, remove, rename, move • Similar to .find()’s field selection syntax • But much more powerful • Can generate computed values Thursday, May 10, 12

Slide 16

Slide 16 text

$project (include and exclude fields) { $project : { title : 1 , /* include this field, if it exists */ author : 1 , /* include this field, if it exists */ "comments.author" : 1 } } { $project : { title : 0 , /* exclude this field */ author : 0 , /* exclude this field */ } } Thursday, May 10, 12

Slide 17

Slide 17 text

$project (computed fields) { $project : { title : 1, /* include this field if it exists */ doctoredPageViews : { $add: ["$pageViews", 10] } } } Thursday, May 10, 12

Slide 18

Slide 18 text

Computed Fields • Prefix expression language • Add two fields • $add:[“$field1”, “$field2”] • Provide a value for a missing field • $ifnull:[“$field1”, “$field2”] • Nesting • $add:[“$field1”, $ifnull:[“$field2”, “$field3”]] • Date field extraction • Get year, month, day, hour, etc, from Date • Date arithmetic Thursday, May 10, 12

Slide 19

Slide 19 text

$project (rename and pull fields up) { $project : { title : 1 , page_views : "$pageViews" , /* rename this field */ upgrade : "$other.foo" /* move to top level */ } } Thursday, May 10, 12

Slide 20

Slide 20 text

$project (push fields down) { $project : { title : 1 , stats : { pv : "$pageViews", /* rename this from the top-level */ } } } Thursday, May 10, 12

Slide 21

Slide 21 text

$unwind • Produces document for each value in an array where the array value is single array element {            title  :  "this  is  my  title"  ,              author  :  "bob"  ,              posted  :  new  Date(1079895594000)  ,              pageViews  :  5  ,              tags  :  [  "fun"  ,  "good"  ,  "awesome"  ]  ,            comments  :  [                      {  author  :"joe"  ,  text  :  "this  is  cool"  }  ,                      {  author  :"sam"  ,  text  :  "this  is  bad"  }          ],            other  :  {  foo  :  5  } } Thursday, May 10, 12

Slide 22

Slide 22 text

{            ...            tags  :  "fun"            ... }, {            ...            tags  :  "good"            ... } {            ...            tags  :  "awesome"            ... } Thursday, May 10, 12

Slide 23

Slide 23 text

$unwind db.article.aggregate( { $project : { author : 1 , /* include this field */ title : 1 , /* include this field */ tags : 1 /* include this field */ }}, { $unwind : "$tags" } ); Thursday, May 10, 12

Slide 24

Slide 24 text

{ "result" : [ { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "title" : "this is my title", "author" : "bob", "tags" : "fun" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "title" : "this is my title", "author" : "bob", "tags" : "good" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "title" : "this is my title", "author" : "bob", "tags" : "fun" } ], "ok" : 1 } Thursday, May 10, 12

Slide 25

Slide 25 text

Grouping • $group aggregation expressions • Total of column values: $sum • Average of column values: $avg • Collect column values in an array: $push { $group : { _id: "$author", fieldname: { $aggfunc: “$field” } } } Thursday, May 10, 12

Slide 26

Slide 26 text

$group example db.article.aggregate( { $group : { _id : "$author", viewsPerAuthor : { $sum : "$pageViews" } }} ); Thursday, May 10, 12

Slide 27

Slide 27 text

{   "result"  :  [     {       "_id"  :  "jane",       "viewsPerAuthor"  :  6     },     {       "_id"  :  "dave",       "viewsPerAuthor"  :  7     },     {       "_id"  :  "bob",       "viewsPerAuthor"  :  5     }   ],   "ok"  :  1 } Thursday, May 10, 12

Slide 28

Slide 28 text

Group Aggregation Functions $min $avg $push $sum $addToSet $first $last $max Thursday, May 10, 12

Slide 29

Slide 29 text

Pulling it all together {            title  :  "this  is  my  title"  ,              author  :  "bob"  ,              posted  :  new  Date(1079895594000)  ,              pageViews  :  5  ,              tags  :  [  "fun"  ,  "good"  ,  "fun"  ] } {          tag  :  “fun” authors:  [  ...,  ...,  ...  ] }, { tag:  “good” authors:  [  ...,  ...,  ...  ] } Thursday, May 10, 12

Slide 30

Slide 30 text

db.article.aggregate( { $project : { author : 1, tags : 1, }}, { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }} ); Thursday, May 10, 12

Slide 31

Slide 31 text

  "result"  :  [     {       "_id"  :  {  "tags"  :  "cool"  },       "authors"  :  [  "jane","dave"  ]     },     {       "_id"  :  {  "tags"  :  "fun"  },       "authors"  :  [  "dave",  "bob"  ]     },     {       "_id"  :  {  "tags"  :  "good"  },       "authors"  :  [  "bob"  ]     },     {       "_id"  :  {  "tags"  :  "awful"  },       "authors"  :  [  "jane"  ]     }   ] Thursday, May 10, 12

Slide 32

Slide 32 text

Usage Tips • Use $match in a pipeline as early as possible • The query optimizer can then be used to choose an index and avoid scanning the entire collection Thursday, May 10, 12

Slide 33

Slide 33 text

Driver Support • Initial version is a command • For any language, build a JSON database object, and execute the command • { aggregate : , pipeline : [ ] } • Beware of command result size limit Thursday, May 10, 12

Slide 34

Slide 34 text

Sharding support • Initial release will support sharding • Mongos analyzes pipeline, and forwards operations up to first $group or $sort to shards; combines shard server results and continues Thursday, May 10, 12

Slide 35

Slide 35 text

Common SQL • Distinct • aggregate({ $group: { _id: "$author" }}) • Count • aggregate({ $group: {_id:null, count: {$sum:1}}}]) • Sum • aggregate({ $group: {_id:null, total: {$sum: "$price"}}}) Thursday, May 10, 12