Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB Aggregation Framework Overview

Ben Wen
November 28, 2012

MongoDB Aggregation Framework Overview

Two walkthroughs on use of the new MongoDB Aggregation Framework

Ben Wen

November 28, 2012
Tweet

More Decks by Ben Wen

Other Decks in Programming

Transcript

  1. About MongoLab • MongoDB-as-a-Service • IaaS: Amazon AWS, Rackspace, Joyent

    Cloud, Windows Azure • PaaS: Heroku, AppFog, EngineYard, AppHarbor, CloudControl, Azure, Nodejitsu • Automatic backups, 24/7 monitoring, expert support, Replica Set provisioning • Free 500MB databases
  2. What is MongoDB • JSON denormalized document database • The

    world is denormalized! • Has primary and secondary indexes • Great for web applications that have a main data pattern • (e.g. blog posts, tweets, customers, emails) • But what about alternate paths through data? • A built-in Map/Reduce engine using Javascript was the typical way
  3. Aggregation Framework • Roughly akin to GROUP BY in SQL

    • Some popular calculations include finding averages, maxima and minima, statistics • Can also pivot data from an internal array • Alternate to map/reduce (M/R) • C++ based = better performance over MongoDB M/R • Declarative chained pipeline of operations • Official documentation: http://docs.mongodb.org/manual/ applications/aggregation/
  4. Pipeline Operators Description Notes $group group documents together most important

    operator $unwind “unfold” an internal array like a cross-product of array x parent $match query-like interface to filter out documents uses indexes; place early in pipeline $project filter and modify fields in a document can create computed fields $sort sort documents uses indexes, if possible $limit limit result set size use with $skip to paginate $skip skip documents in result set for pagination
  5. Pipeline Operators Description Notes $group group documents together most important

    operator $unwind “unfold” an internal array like a cross-product of array x parent $match query-like interface to filter out documents uses indexes; place early in pipeline $project filter and modify fields in a document can create computed fields $sort sort documents uses indexes, if possible $limit limit result set size use with $skip to paginate $skip skip documents in result set for pagination
  6. { "_id" : ObjectId("…06"), "title": "this is my title", "author":

    "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", "posted": …, "pageViews": 7, "tags": [ "fun", "nasty" ], … }, { "_id" : ObjectId("…08"), "title": "this is some other title", "author": "jane", "posted": …, "pageViews": 6, "tags": [ "nasty", "filthy" ], … } { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", "nasty" ], "pageViews": 7 }, { "_id" : ObjectId("…08"), "author": "jane", "tags": [ "nasty", "filthy" ], "pageViews": 6 } { "result" : [ { "_id" : "filthy", "docsByTag" : 1, "viewsByTag" : 6, "mostViewsByTag" : 6, "avgByTag" : 6 }, { "_id" : "good", "docsByTag" : 1, "viewsByTag" : 5, "mostViewsByTag" : 5, "avgByTag" : 5 }, { "_id" : "nasty", "docsByTag" : 2, "viewsByTag" : 13, "mostViewsByTag" : 7, "avgByTag" : 6.5 }, { "_id" : "fun", "docsByTag" : 3, "viewsByTag" : 17, "mostViewsByTag" : 7, "avgByTag" : 5.666…7 }], "Ok" : 1} { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "good" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "filthy" } Collection Intermediate-1 Intermediate-2 Result $unwind : "$tags" $project : { author : 1, tags : 1, pageViews : 1} $group : { _id : "$tags", docsByTag : { $sum : 1 }, viewsByTag : { $sum : "$pageViews" }, mostViewsByTag : { $max : "$pageViews" }, avgByTag : { $avg : "$pageViews" }} MongoDB 2.2 Aggregation Framework (follow the "fun" tag) Follow the flow of three documents in a MongoDB 2.2 Collection as they undergo three stages of an aggreation pipeline to the Result. The path of the three embed- ded array elements “fun” highlights the $group expression. It pivots on array elements, while calculating cross-document values. For more information, including the actual aggregation pipeline and dataset, see http://blog.mongolab.com/2012/07/aggregation-example/ or http://bit.ly/22agg Infographic © 2012 ObjectLabs Corp. “MongoDB” and “Mongo” are †l of 10gen, Inc. and are used with permission. Code via Chris Westin, 10gen. Tnx!
  7. { "_id" : ObjectId("…06"), "title": "this is my title", "author":

    "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", { "_id" : Ob "author": " "tags": [ "fun", "good", "fun" ], "pageView }, { "_id" : Ob "author": " "tags": [ "fun", "nasty" Collection Interm $project : { author : 1, tags : 1, pageViews : 1} (fo
  8. { "_id" : ObjectId("…06"), "title": "this is my title", "author":

    "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", Collection Intermediate-1 $unwind : "$tag $project : { author : 1, tags : 1, pageViews : 1} (follow the "fun" tag)
  9. ctId("…06"), s my title", ob", ": 5, … ctId("…07"), s

    your title", ave", ": 7, { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", "nasty" ], "pageViews": 7 }, { "_id" : ObjectId("…08"), "author": "jane", "tags": [ "nasty", { "re { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "good" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { on Intermediate-1 Intermediate-2 $unwind : "$tags" $project : { author : 1, tags : 1, pageViews : 1} (follow the "fun" tag) ded array elements “fun” highlights the $group e elements, while calculating cross-document value the actual aggregation pipeline and dataset, see http://blog.mongolab.com/2012/07/a or http://bit.ly/22agg
  10. d("…07"), e", 7 d("…08"), ", 6 "_id" : "good", "docsByTag"

    : 1, "viewsByTag" : 5, "mostViewsByTag" : 5, "avgByTag" : 5 }, { "_id" : "nasty", "docsByTag" : 2, "viewsByTag" : 13, "mostViewsByTag" : 7, "avgByTag" : 6.5 }, { "_id" : "fun", "docsByTag" : 3, "viewsByTag" : 17, "mostViewsByTag" : 7, "avgByTag" : 5.666…7 }], "Ok" : 1} { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "filthy" } $group : { _id : "$tags", docsByTag : { $sum : 1 }, viewsByTag : { $sum : "$pageViews" }, mostViewsByTag : { $max : "$pageViews" }, avgByTag : { $avg : "$pageViews" }} permission. Code via Chris Westin, 10gen. Tnx!
  11. { "_id" : ObjectId("…06"), "title": "this is my title", "author":

    "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", "posted": …, "pageViews": 7, "tags": [ "fun", "nasty" ], … }, { "_id" : ObjectId("…08"), "title": "this is some other title", "author": "jane", "posted": …, "pageViews": 6, "tags": [ "nasty", "filthy" ], … } { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", "nasty" ], "pageViews": 7 }, { "_id" : ObjectId("…08"), "author": "jane", "tags": [ "nasty", "filthy" ], "pageViews": 6 } { "result" : [ { "_id" : "filthy", "docsByTag" : 1, "viewsByTag" : 6, "mostViewsByTag" : 6, "avgByTag" : 6 }, { "_id" : "good", "docsByTag" : 1, "viewsByTag" : 5, "mostViewsByTag" : 5, "avgByTag" : 5 }, { "_id" : "nasty", "docsByTag" : 2, "viewsByTag" : 13, "mostViewsByTag" : 7, "avgByTag" : 6.5 }, { "_id" : "fun", "docsByTag" : 3, "viewsByTag" : 17, "mostViewsByTag" : 7, "avgByTag" : 5.666…7 }], "Ok" : 1} { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "good" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "filthy" } Collection Intermediate-1 Intermediate-2 Result $unwind : "$tags" $project : { author : 1, tags : 1, pageViews : 1} $group : { _id : "$tags", docsByTag : { $sum : 1 }, viewsByTag : { $sum : "$pageViews" }, mostViewsByTag : { $max : "$pageViews" }, avgByTag : { $avg : "$pageViews" }} MongoDB 2.2 Aggregation Framework (follow the "fun" tag) Follow the flow of three documents in a MongoDB 2.2 Collection as they undergo three stages of an aggreation pipeline to the Result. The path of the three embed- ded array elements “fun” highlights the $group expression. It pivots on array elements, while calculating cross-document values. For more information, including the actual aggregation pipeline and dataset, see http://blog.mongolab.com/2012/07/aggregation-example/ or http://bit.ly/22agg Infographic © 2012 ObjectLabs Corp. “MongoDB” and “Mongo” are †l of 10gen, Inc. and are used with permission. Code via Chris Westin, 10gen. Tnx!
  12. Pipeline Operators Description Notes $group group documents together most important

    operator $unwind “unfold” an internal array like a cross-product of array x parent $match query-like interface to filter out documents uses indexes; place early in pipeline $project filter and modify fields in a document can create computed fields $sort sort documents uses indexes, if possible $limit limit result set size use with $skip to paginate $skip skip documents in result set for pagination
  13. CEX 2009 BLS data sample [ { "seasonal": "U", "column_text":

    "All Consumer Units", "item_text": "Bakery products", "seriesdata": [ { "period": "A01", "value": 178.000, "year": 1984 }, { "period": "A01", "value": 194.000, "year": 1985 }, { "period": "A01", "value": 183.000, "year": 1986 }, Collection
  14. Summary • Aggregation Framework in MongoDB helps compute across documents

    • Uses a declarative pipeline of operations • Also reshapes and pivots documents • New in 2.1 and 2.2 • Try it out, all the kids are doing it
  15. Thanks to • 10gen for MongoDB • Chris Westin, initial

    author of the aggregation framework • SVWB and Hurricane Electric for hosting