Using MongoDB for Data Processing

Kevin Matulef - [email protected] Richard Kreuter - [email protected] Using MongoDB
for Data Processing

This talk • New Aggregation Framework • Map/Reduce • Hadoop
integration

Aggregation Framework: motivation • Used to perform complex analytics tasks
on massive amounts of data • Users are currently using it for aggregation… • totaling, averaging, etc Map/Reduce is a big hammer

Aggregation Framework: motivation • Used to perform complex analytics tasks
on massive amounts of data • Users are currently using it for aggregation… • totaling, averaging, etc Map/Reduce is a big hammer But...

• It should be easier to do simple aggregations •
Want to avoid overhead of JavaScript engine Aggregation Framework: motivation

New Aggregation Framework • Out now in 2.1! (unstable) •
Declarative • No JavaScript required • C++ implementation • Higher performance than JavaScript • Expression evaluation • Return computed values • Framework: we can add new operations easily

Pipeline • Series of operations • Members of a collection
are passed through a pipeline to produce a result

Aggregation helper db.mycollection.aggregate( { $pipeline_op1 }, { $pipeline_op2 }, {
$pipeline_op3 }, { $pipeline_op4 }, ... );

Pipeline Operations • $match • Uses a query predicate (like
.find({…})) as a filter • $project • Uses a sample document to determine the shape of the result (similar to .find()’s optional argument) • This can include computed values • $unwind • Hands out array elements one at a time • $group • Aggregates items into buckets defined by a key

Pipeline Operations (continued) • $sort • Sort documents • $limit
• Only allow the specified number of documents to pass • $skip • Skip over the specified number of documents

Example: ﬁnding popular checkins checkin_1 = { location: location_id, user:
user_id, ts: "20/03/2010" }

user_id, ts: "20/03/2010" } // Find most popular locations in last 3 hours > agg = db.checkins.aggregate( {$match: {ts: {$gt: now_minus_3_hrs}}}, {$group: {_id: "$location", numEntries: {$sum: 1}}} {$sort: {numEntries : -1 }} )

user_id, ts: "20/03/2010" } // Find most popular locations in last 3 hours > agg = db.checkins.aggregate( {$match: {ts: {$gt: now_minus_3_hrs}}}, {$group: {_id: "$location", numEntries: {$sum: 1}}} {$sort: {numEntries : -1 }} ) > agg.result [{"_id": "Din Tai Fung", "numEntries" : 17}, ....]

More details: $project • Project can reshape a document •
add, remove, rename, move • Similar to .find()’s field selection syntax • But much more powerful • Can generate computed values

Example 2: $project { title
: "this is my title" , author : "bob" , posted : new Date(1079895594000) , pageViews : 5 , tags : [ "fun" , "good" , "fun" ] , comments : [ { author : "steve" , ts : "4/1/12", text: "I like your post"} ] }

: "this is my title" , author : "bob" , posted : new Date(1079895594000) , pageViews : 5 , tags : [ "fun" , "good" , "fun" ] , comments : [ { author : "steve" , ts : "4/1/12", text: "I like your post"} ] } > db.posts.aggregate( { $project : { title : 1, author : 1, "comments.author" : 1 }} )

: "this is my title" , author : "bob" , posted : new Date(1079895594000) , pageViews : 5 , tags : [ "fun" , "good" , "fun" ] , comments : [ { author : "steve" , ts : "4/1/12", text: "I like your post"} ] } > db.posts.aggregate( { $project : { title : 1, author : 1, "comments.author" : 1 }} ) result : [{ title : "this is my title" , author : "bob" , comments : [author : "steve"] }]

More details: $unwind • Produces document for each value in
an array where the array value is single array element { title : "this is my title" , author : "Kevin" , tags : [ "fun" , "good" , "awesome" ] }

{ ...
tags : "fun" ... }, { ... tags : "good" ... } { ... tags : "awesome" ... }

$unwind db.article.aggregate( { $project : { author : 1 ,
tags : 1 }}, { $unwind : "$tags" } );

"result" : [ { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "author" : "Kevin",
"tags" : "fun" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "author" : "Kevin", "tags" : "good" }, { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "author" : "Kevin", "tags" : "fun" } ]

db.article.aggregate( { $project : { author : 1, tags :
1, }}, { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }} );

"result" : [ {
"_id" : { "tags" : "cool" }, "authors" : [ "jane","dave" ] }, { "_id" : { "tags" : "fun" }, "authors" : [ "dave", "bob" ] }, { "_id" : { "tags" : "good" }, "authors" : [ "bob" ] }, { "_id" : { "tags" : "awful" }, "authors" : [ "jane" ] } ]

Usage Tips • Use $match in a pipeline as early
as possible • The query optimizer can then be used to choose an index and avoid scanning the entire collection • Use $sort in a pipeline as early as possible • The query optimizer can sometimes be used to choose an index to scan instead of sorting the result

Driver Support • Initial version is a command • For
any language, build a JSON database object, and execute the command • { aggregate : <collection>, pipeline : {…} } • Beware of command result size limit • Document size limit is 16MB

Sharding support • Initial release does support sharding • Mongos
analyzes pipeline, and forwards operations up to $group or $sort to shards; combines shard server results and returns them

Pipeline Operations – Future Plans • $out • Saves the
document stream to a collection • Similar to M/R $out, but with sharded output • Functions like a tee, so that intermediate results can be saved

Using MongoDB for Data Processing

Using MongoDB for Data Processing

mongodb

More Decks by mongodb

Other Decks in Technology

Featured

Transcript

Kevin Matulef - [email protected] Richard Kreuter - [email protected] Using MongoDB

This talk • New Aggregation Framework • Map/Reduce • Hadoop

Aggregation Framework: motivation • Used to perform complex analytics tasks

Aggregation Framework: motivation • Used to perform complex analytics tasks

• It should be easier to do simple aggregations •

New Aggregation Framework • Out now in 2.1! (unstable) •

Pipeline • Series of operations • Members of a collection

Aggregation helper db.mycollection.aggregate( { $pipeline_op1 }, { $pipeline_op2 }, {

Pipeline Operations • $match • Uses a query predicate (like

Pipeline Operations (continued) • $sort • Sort documents • $limit

Example: ﬁnding popular checkins checkin_1 = { location: location_id, user:

Example: ﬁnding popular checkins checkin_1 = { location: location_id, user:

Example: ﬁnding popular checkins checkin_1 = { location: location_id, user:

More details: $project • Project can reshape a document •

Example 2: $project { title

Example 2: $project { title

Example 2: $project { title

More details: $unwind • Produces document for each value in

{ ...

$unwind db.article.aggregate( { $project : { author : 1 ,

"result" : [ { "_id" : ObjectId("4e6e4ef557b77501a49233f6"), "author" : "Kevin",

db.article.aggregate( { $project : { author : 1, tags :

"result" : [ {

Usage Tips • Use $match in a pipeline as early

Driver Support • Initial version is a command • For

Sharding support • Initial release does support sharding • Mongos

Pipeline Operations – Future Plans • $out • Saves the