Slide 1

Slide 1 text

Aggregation Framework Ben Wen, VP MongoLab @benwen November 2012 Slides available at speakerdeck.com/benwen

Slide 2

Slide 2 text

About MongoLab • MongoDB-as-a-Service • IaaS: Amazon AWS, Rackspace, Joyent Cloud, Windows Azure • PaaS: Heroku, AppFog, EngineYard, AppHarbor, CloudControl, Azure, Nodejitsu • Automatic backups, 24/7 monitoring, expert support, Replica Set provisioning • Free 500MB databases

Slide 3

Slide 3 text

What is MongoDB • JSON denormalized document database • The world is denormalized! • Has primary and secondary indexes • Great for web applications that have a main data pattern • (e.g. blog posts, tweets, customers, emails) • But what about alternate paths through data? • A built-in Map/Reduce engine using Javascript was the typical way

Slide 4

Slide 4 text

Aggregation Framework • Roughly akin to GROUP BY in SQL • Some popular calculations include finding averages, maxima and minima, statistics • Can also pivot data from an internal array • Alternate to map/reduce (M/R) • C++ based = better performance over MongoDB M/R • Declarative chained pipeline of operations • Official documentation: http://docs.mongodb.org/manual/ applications/aggregation/

Slide 5

Slide 5 text

Pipeline Operators Description Notes $group group documents together most important operator $unwind “unfold” an internal array like a cross-product of array x parent $match query-like interface to filter out documents uses indexes; place early in pipeline $project filter and modify fields in a document can create computed fields $sort sort documents uses indexes, if possible $limit limit result set size use with $skip to paginate $skip skip documents in result set for pagination

Slide 6

Slide 6 text

Pipeline Operators Description Notes $group group documents together most important operator $unwind “unfold” an internal array like a cross-product of array x parent $match query-like interface to filter out documents uses indexes; place early in pipeline $project filter and modify fields in a document can create computed fields $sort sort documents uses indexes, if possible $limit limit result set size use with $skip to paginate $skip skip documents in result set for pagination

Slide 7

Slide 7 text

Demo One: Pivot to tags on blog articles

Slide 8

Slide 8 text

Project Unwind Group

Slide 9

Slide 9 text

{ "_id" : ObjectId("…06"), "title": "this is my title", "author": "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", "posted": …, "pageViews": 7, "tags": [ "fun", "nasty" ], … }, { "_id" : ObjectId("…08"), "title": "this is some other title", "author": "jane", "posted": …, "pageViews": 6, "tags": [ "nasty", "filthy" ], … } { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", "nasty" ], "pageViews": 7 }, { "_id" : ObjectId("…08"), "author": "jane", "tags": [ "nasty", "filthy" ], "pageViews": 6 } { "result" : [ { "_id" : "filthy", "docsByTag" : 1, "viewsByTag" : 6, "mostViewsByTag" : 6, "avgByTag" : 6 }, { "_id" : "good", "docsByTag" : 1, "viewsByTag" : 5, "mostViewsByTag" : 5, "avgByTag" : 5 }, { "_id" : "nasty", "docsByTag" : 2, "viewsByTag" : 13, "mostViewsByTag" : 7, "avgByTag" : 6.5 }, { "_id" : "fun", "docsByTag" : 3, "viewsByTag" : 17, "mostViewsByTag" : 7, "avgByTag" : 5.666…7 }], "Ok" : 1} { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "good" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "filthy" } Collection Intermediate-1 Intermediate-2 Result $unwind : "$tags" $project : { author : 1, tags : 1, pageViews : 1} $group : { _id : "$tags", docsByTag : { $sum : 1 }, viewsByTag : { $sum : "$pageViews" }, mostViewsByTag : { $max : "$pageViews" }, avgByTag : { $avg : "$pageViews" }} MongoDB 2.2 Aggregation Framework (follow the "fun" tag) Follow the flow of three documents in a MongoDB 2.2 Collection as they undergo three stages of an aggreation pipeline to the Result. The path of the three embed- ded array elements “fun” highlights the $group expression. It pivots on array elements, while calculating cross-document values. For more information, including the actual aggregation pipeline and dataset, see http://blog.mongolab.com/2012/07/aggregation-example/ or http://bit.ly/22agg Infographic © 2012 ObjectLabs Corp. “MongoDB” and “Mongo” are †l of 10gen, Inc. and are used with permission. Code via Chris Westin, 10gen. Tnx!

Slide 10

Slide 10 text

{ "_id" : ObjectId("…06"), "title": "this is my title", "author": "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", { "_id" : Ob "author": " "tags": [ "fun", "good", "fun" ], "pageView }, { "_id" : Ob "author": " "tags": [ "fun", "nasty" Collection Interm $project : { author : 1, tags : 1, pageViews : 1} (fo

Slide 11

Slide 11 text

{ "_id" : ObjectId("…06"), "title": "this is my title", "author": "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", Collection Intermediate-1 $unwind : "$tag $project : { author : 1, tags : 1, pageViews : 1} (follow the "fun" tag)

Slide 12

Slide 12 text

ctId("…06"), s my title", ob", ": 5, … ctId("…07"), s your title", ave", ": 7, { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", "nasty" ], "pageViews": 7 }, { "_id" : ObjectId("…08"), "author": "jane", "tags": [ "nasty", { "re { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "good" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { on Intermediate-1 Intermediate-2 $unwind : "$tags" $project : { author : 1, tags : 1, pageViews : 1} (follow the "fun" tag) ded array elements “fun” highlights the $group e elements, while calculating cross-document value the actual aggregation pipeline and dataset, see http://blog.mongolab.com/2012/07/a or http://bit.ly/22agg

Slide 13

Slide 13 text

d("…07"), e", 7 d("…08"), ", 6 "_id" : "good", "docsByTag" : 1, "viewsByTag" : 5, "mostViewsByTag" : 5, "avgByTag" : 5 }, { "_id" : "nasty", "docsByTag" : 2, "viewsByTag" : 13, "mostViewsByTag" : 7, "avgByTag" : 6.5 }, { "_id" : "fun", "docsByTag" : 3, "viewsByTag" : 17, "mostViewsByTag" : 7, "avgByTag" : 5.666…7 }], "Ok" : 1} { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "filthy" } $group : { _id : "$tags", docsByTag : { $sum : 1 }, viewsByTag : { $sum : "$pageViews" }, mostViewsByTag : { $max : "$pageViews" }, avgByTag : { $avg : "$pageViews" }} permission. Code via Chris Westin, 10gen. Tnx!

Slide 14

Slide 14 text

{ "_id" : ObjectId("…06"), "title": "this is my title", "author": "bob", "posted": …, "pageViews": 5, "tags": [ "fun", "good", "fun" ], "comments": … "other": … }, { "_id" : ObjectId("…07"), "title": "this is your title", "author": "dave", "posted": …, "pageViews": 7, "tags": [ "fun", "nasty" ], … }, { "_id" : ObjectId("…08"), "title": "this is some other title", "author": "jane", "posted": …, "pageViews": 6, "tags": [ "nasty", "filthy" ], … } { "_id" : ObjectId("…06"), "author": "bob", "tags": [ "fun", "good", "fun" ], "pageViews": 5 }, { "_id" : ObjectId("…07"), "author": "dave", "tags": [ "fun", "nasty" ], "pageViews": 7 }, { "_id" : ObjectId("…08"), "author": "jane", "tags": [ "nasty", "filthy" ], "pageViews": 6 } { "result" : [ { "_id" : "filthy", "docsByTag" : 1, "viewsByTag" : 6, "mostViewsByTag" : 6, "avgByTag" : 6 }, { "_id" : "good", "docsByTag" : 1, "viewsByTag" : 5, "mostViewsByTag" : 5, "avgByTag" : 5 }, { "_id" : "nasty", "docsByTag" : 2, "viewsByTag" : 13, "mostViewsByTag" : 7, "avgByTag" : 6.5 }, { "_id" : "fun", "docsByTag" : 3, "viewsByTag" : 17, "mostViewsByTag" : 7, "avgByTag" : 5.666…7 }], "Ok" : 1} { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "good" }, { "_id" : ObjectId("…06"), "author" : "bob", "pageViews" : 5, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "fun" }, { "_id" : ObjectId("…07"), "author" : "dave", "pageViews" : 7, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "nasty" }, { "_id" : ObjectId("…08"), "author" : "jane", "pageViews" : 6, "tags" : "filthy" } Collection Intermediate-1 Intermediate-2 Result $unwind : "$tags" $project : { author : 1, tags : 1, pageViews : 1} $group : { _id : "$tags", docsByTag : { $sum : 1 }, viewsByTag : { $sum : "$pageViews" }, mostViewsByTag : { $max : "$pageViews" }, avgByTag : { $avg : "$pageViews" }} MongoDB 2.2 Aggregation Framework (follow the "fun" tag) Follow the flow of three documents in a MongoDB 2.2 Collection as they undergo three stages of an aggreation pipeline to the Result. The path of the three embed- ded array elements “fun” highlights the $group expression. It pivots on array elements, while calculating cross-document values. For more information, including the actual aggregation pipeline and dataset, see http://blog.mongolab.com/2012/07/aggregation-example/ or http://bit.ly/22agg Infographic © 2012 ObjectLabs Corp. “MongoDB” and “Mongo” are †l of 10gen, Inc. and are used with permission. Code via Chris Westin, 10gen. Tnx!

Slide 15

Slide 15 text

Demo Two: Pivot and sum by year

Slide 16

Slide 16 text

Pipeline Operators Description Notes $group group documents together most important operator $unwind “unfold” an internal array like a cross-product of array x parent $match query-like interface to filter out documents uses indexes; place early in pipeline $project filter and modify fields in a document can create computed fields $sort sort documents uses indexes, if possible $limit limit result set size use with $skip to paginate $skip skip documents in result set for pagination

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

CEX 2009 BLS data sample [ { "seasonal": "U", "column_text": "All Consumer Units", "item_text": "Bakery products", "seriesdata": [ { "period": "A01", "value": 178.000, "year": 1984 }, { "period": "A01", "value": 194.000, "year": 1985 }, { "period": "A01", "value": 183.000, "year": 1986 }, Collection

Slide 19

Slide 19 text

Pipeline

Slide 20

Slide 20 text

Summary • Aggregation Framework in MongoDB helps compute across documents • Uses a declarative pipeline of operations • Also reshapes and pivots documents • New in 2.1 and 2.2 • Try it out, all the kids are doing it

Slide 21

Slide 21 text

Thanks to • 10gen for MongoDB • Chris Westin, initial author of the aggregation framework • SVWB and Hurricane Electric for hosting

Slide 22

Slide 22 text

Aggregation Framework Ben Wen, VP MongoLab @benwen November 2012 Slides available at speakerdeck.com/benwen