Upgrade to Pro — share decks privately, control downloads, hide ads and more …

aggregation

mpobrien
March 21, 2012
660

 aggregation

mpobrien

March 21, 2012
Tweet

Transcript

  1. Quick Overview of Document-oriented Schemaless JSON-style documents Rich Queries Scales

    Horizontally db.users.find({ last_name: 'Smith', age: {$gt : 10} }); SELECT * FROM users WHERE last_name=‘Smith’ AND age > 10; Thursday, March 22, 12
  2. Computing Aggregations in Databases SQL-based RDBMS JOIN GROUP BY AVG(),

    COUNT(), SUM(), FIRST(), LAST(), etc. MongoDB 2.0 MapReduce MongoDB 2.1+ MapReduce Aggregation Framework Thursday, March 22, 12
  3. MapReduce var map = function(){ ... emit(key, val); } var

    reduce = function(key, vals){ ... return resultVal; } Data Map() emit(k,v) Sort(k) Group(k) Reduce(k,values) k,v Finalize(k,v) k,v MongoDB map iterates on documents Document is $this 1 at time per shard Input matches output Can run multiple times Thursday, March 22, 12
  4. What’s wrong with just using MapReduce? • Map/Reduce is very

    powerful, but often overkill • Lots of users relying on it for simple aggregation tasks Thursday, March 22, 12
  5. What’s wrong with just using MapReduce? • Easy to screw

    up JavaScript • Debugging a M/R job sucks • Writing more JS for simple tasks should not be necessary (ಠợಠ) Thursday, March 22, 12
  6. New Framework • Declarative (no need to write JS) •

    Implemented directly in C++ • Expression Evaluation • Return computed values • Framework: We can extend it with new ops Thursday, March 22, 12
  7. db.article.aggregate( { $project : {author : 1,tags : 1}}, {

    $unwind : "$tags" }, { $group : {_id : “$tags”, authors:{ $addToSet:"$author"}} } ); Here’s what an aggregation query looks like: Thursday, March 22, 12
  8. db.article.aggregate( { $project : {author : 1, tags : 1}},

    { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet:"$author"} }} ); New Helper Method: .aggregate() Operator pipeline db.runCommand({ aggregate : "article", pipeline : [ {$op1, $op2, ...} ] } Thursday, March 22, 12
  9. { "result" : [ { "_id" : "art", "authors" :

    [ "bill", "bob" ] }, { "_id" : "sports", "authors" : [ "jane", "bob" ] }, { "_id" : "food", "authors" : [ "jane", "bob" ] }, { "_id" : "science", "authors" : [ "jane", "bill", "bob" ] } ], "ok" : 1 } Output Document Looks like this: result: array of pipeline output ok: 1 for success, 0 otherwise Thursday, March 22, 12
  10. Pipeline • Input to the start of the pipeline is

    a collection • Series of operators - each one filters or transforms its input • Passes output data to next operator in the pipeline • Output of the pipeline is the result document ps -ax | tee processes.txt | more Kind of like UNIX: Thursday, March 22, 12
  11. Let’s do: 1. Tour of the pipeline operators 2. A

    couple examples based on common SQL aggregation tasks $match $unwind $group $project $skip $limit $sort Thursday, March 22, 12
  12. filters documents from pipeline with a query predicate filtered with:

    {$match: {author:”bob”}} $match {author: "bob", pageViews:5, title:"Lorem Ipsum..."} {author: "bill", pageViews:3, title:"dolor sit amet..."} {author: "joe", pageViews:52, title:"consectetur adipi..."} {author: "jane", pageViews:51, title:"sed diam..."} {author: "bob", pageViews:14, title:"magna aliquam..."} {author: "bob", pageViews:53, title:"claritas est..."} filtered with: {$match: {pageViews:{$gt:50}} {author:"bob",pageViews:5,title:"Lorem Ipsum..."} {author:"bob",pageViews:14,title:"magna aliquam..."} {author:"bob",pageViews:53,title:"claritas est..."} {author: "joe", pageViews:52, title:"consectetur adipiscing..."} {author: "jane", pageViews:51, title:"sed diam..."} {author: "bob", pageViews:53, title:"claritas est..."} Input: Thursday, March 22, 12
  13. $unwind { "_id" : ObjectId("4f...146"), "author" : "bob", "tags" :[

    "fun","good","awesome"] } explode the “tags” array with: { $unwind : ”$tags” } { _id : ObjectId("4f...146"), author : "bob", tags:"fun"}, { _id : ObjectId("4f...146"), author : "bob", tags:"good"}, { _id : ObjectId("4f...146"), author : "bob", tags:"awesome"} produces output: Produce a new document for each value in an input array Thursday, March 22, 12
  14. Bucket a subset of docs together, calculate an aggregated output

    doc from the bucket $sum $max, $min $avg $first, $last $addToSet $push db.article.aggregate( { $group : { _id : "$author", viewsPerAuthor : { $sum : "$pageViews" } } } ); $group Output Calculation Operators: Thursday, March 22, 12
  15. db.article.aggregate( { $group : { _id : "$author", viewsPerAuthor :

    { $sum : "$pageViews" } } } ); _id: selects a field to use as bucket key for grouping Output field name Operation used to calculate the output value ($sum, $max, $avg, etc.) $group (cont’d) • dot notation (nested fields) • a constant • a multi-key expression inside {...} also allowed here: Thursday, March 22, 12
  16. An example with $match and $group SELECT SUM(price) FROM orders

    WHERE customer_id = 4; MongoDB: SQL: db.orders.aggregate( {$match : {“$customer_id” : 4}}, {$group : { _id : null, total: {$sum : “price”}}) English: Find the sum of all prices of the orders placed by customer #4 Thursday, March 22, 12
  17. An example with $unwind and $group MongoDB: SQL: English: db.posts.aggregate(

    { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }} ); For all tags used in blog posts, produce a list of authors that have posted under each tag SELECT tag, author FROM post_tags LEFT JOIN posts ON post_tags.post_id = posts.id GROUP BY tag, author; Thursday, March 22, 12
  18. More operators - Controlling Pipeline Input $skip $limit $sort Similar

    to: .skip() .limit() .sort() in a regular Mongo query Thursday, March 22, 12
  19. $sort specified the same way as index keys: { $sort

    : { name : 1, age: -1 } } Must be used in order to take advantage of $first/$last with $group. order input documents Thursday, March 22, 12
  20. $limit limit the number of input documents {$limit : 5}

    $skip skips over documents {$skip : 5} Thursday, March 22, 12
  21. $project Use for: Add, Remove, Pull up, Push down, Rename

    Fields Building computed fields Reshape a document Thursday, March 22, 12
  22. $project (cont’d) Include or exclude fields {$project : { title

    : 1, author : 1} } Only pass on fields “title” and “author” {$project : { comments : 0} Exclude “comments” field, keep everything else Thursday, March 22, 12
  23. Moving + Renaming fields {$project : { page_views : “$pageViews”,

    catName : “$category.name”, info : { published : “$ctime”, update : “$mtime” } } } Rename page_views to pageViews Take nested field “category.name”, move it into top-level field called “catName” Populate a new sub-document into the output $project (cont’d) Thursday, March 22, 12
  24. db.article.aggregate( { $project : { name : 1, ! age_fixed

    : { $add:["$age", 2] } }} ); Building a Computed Field Output (computed field) Operands Expression $project (cont’d) Thursday, March 22, 12
  25. Lots of Available Expressions $project (cont’d) Numeric $add $sub $mod

    $divide $multiply Logical $eq $lte/$lt $gte/$gt $and $not $or $eq Dates $dayOfMonth $dayOfYear $dayOfWeek $second $minute $hour $week $month $isoDate Strings $substr $add $toLower $toUpper $strcasecmp Thursday, March 22, 12
  26. Example: $sort → $limit → $project → $group MongoDB: SQL:

    English: Of the most recent 1000 blog posts, how many were posted within each calendar year? SELECT YEAR(pub_time) as pub_year, COUNT(*) FROM (SELECT pub_time FROM posts ORDER BY pub_time desc) GROUP BY pub_year; db.test.aggregate( {$sort : {pub_time: -1}}, {$limit : 1000}, {$project:{pub_year:{$year:["$pub_time"]}}}, {$group: {_id:"$pub_year", num_year:{$sum:1}}} ) Thursday, March 22, 12
  27. Some Usage Notes In BSON, order matters - so computed

    fields always show up after regular fields We use $ in front of field names to distinguish fields from string literals in expressions “$name” “name” vs. Thursday, March 22, 12
  28. Some Usage Notes Use a $match,$sort and $limit first in

    pipeline if possible Cumulative Operators $group: be aware of memory usage Use $project to discard unneeded fields Remember the 16MB output limit Thursday, March 22, 12
  29. MapReduce is still important • Framework is geared towards counting/accumulating

    • If you need something more exotic, use MapReduce • No 16MB constraint on output size with MapReduce • JS in M/R is not limited to any fixed set of expressions • Hadoop Thursday, March 22, 12
  30. thanks! ✌(-‿-)✌ questions? $$$ BTW: we are hiring! http://10gen.com/jobs $$$

    @mpobrien github.com/mpobrien hit me up: Thursday, March 22, 12