Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin MUG - Aggregation Overview

Berlin MUG - Aggregation Overview

MongoDB 2.2 Aggregation overview

cj_harris5

June 08, 2012
Tweet

More Decks by cj_harris5

Other Decks in Technology

Transcript

  1. Terminology RDBMS MongoDB Table Collection Row(s) JSON Document Index Index

    Join Embedding & Linking Partition Shard Partition Key Shard Key Friday, 8 June 12
  2. Here is a “simple” SQL Model mysql> select * from

    book; +----+----------------------------------------------------------+ | id | title | +----+----------------------------------------------------------+ | 1 | The Demon-Haunted World: Science as a Candle in the Dark | | 2 | Cosmos | | 3 | Programming in Scala | +----+----------------------------------------------------------+ 3 rows in set (0.00 sec) mysql> select * from bookauthor; +---------+-----------+ | book_id | author_id | +---------+-----------+ | 1 | 1 | | 2 | 1 | | 3 | 2 | | 3 | 3 | | 3 | 4 | +---------+-----------+ 5 rows in set (0.00 sec) mysql> select * from author; +----+-----------+------------+-------------+-------------+---------------+ | id | last_name | first_name | middle_name | nationality | year_of_birth | +----+-----------+------------+-------------+-------------+---------------+ | 1 | Sagan | Carl | Edward | NULL | 1934 | | 2 | Odersky | Martin | NULL | DE | 1958 | | 3 | Spoon | Lex | NULL | NULL | NULL | | 4 | Venners | Bill | NULL | NULL | NULL | +----+-----------+------------+-------------+-------------+---------------+ 4 rows in set (0.00 sec) Friday, 8 June 12
  3. The Same Data in MongoDB { "_id" : ObjectId("4dfa6baa9c65dae09a4bbda5"), "title"

    : "Programming in Scala", "author" : [ { "first_name" : "Martin", "last_name" : "Odersky", "nationality" : "DE", "year_of_birth" : 1958 }, { "first_name" : "Lex", "last_name" : "Spoon" }, { "first_name" : "Bill", "last_name" : "Venners" } ] } Friday, 8 June 12
  4. Map Reduce The caller provides map and reduce functions written

    in JavaScript // Emit each tag > map = "this['tags'].forEach( function(item) {emit(item, 1);} );" // Calculate totals > reduce = "function(key, values) { var total = 0; var valuesSize = values.length; for (var i=0; i < valuesSize; i++) { total += parseInt(values[i], 10); } return total; }; Friday, 8 June 12
  5. // run the map reduce > db.posts.mapReduce(map, reduce, {"out": {

    inline : 1}}); { "results" : [ {"_id" : "databases", "value" : 1}, {"_id" : "tech", "value" : 1 } ], "timeMillis" : 1, "counts" : { "input" : 1, "emit" : 2, "reduce" : 0, "output" : 2 }, "ok" : 1, } Map Reduce Friday, 8 June 12
  6. What problem are we solving? • Map/Reduce can be used

    for aggregation… • Currently being used for totaling, averaging, etc • Map/Reduce is a big hammer • Simpler tasks should be easier • Shouldn’t need to write JavaScript • Avoid the overhead of JavaScript engine • We’re seeing requests for help in handling complex documents • Select only matching subdocuments or arrays Friday, 8 June 12
  7. How will we solve the problem? • New aggregation framework

    • Declarative framework (no JavaScript) • Describe a chain of operations to apply • Expression evaluation • Return computed values • Framework: new operations added easily • C++ implementation Friday, 8 June 12
  8. Aggregation - Pipelines • Aggregation requests specify a pipeline •

    A pipeline is a series of operations • Members of a collection are passed through a pipeline to produce a result • ps -ef | grep -i mongod Friday, 8 June 12
  9. Example - twitter { "_id" : ObjectId("4f47b268fb1c80e141e9888c"), "user" : {

    "friends_count" : 73, "location" : "Brazil", "screen_name" : "Bia_cunha1", "name" : "Beatriz Helena Cunha", "followers_count" : 102, } } • Find the # of followers and # friends by location Friday, 8 June 12
  10. Example - twitter db.tweets.aggregate( {$match: {"user.friends_count": { $gt: 0 },

    "user.followers_count": { $gt: 0 } } }, {$project: { location: "$user.location", friends: "$user.friends_count", followers: "$user.followers_count" } }, {$group: {_id: "$location", friends: {$sum: "$friends"}, followers: {$sum: "$followers"} } } ); Friday, 8 June 12
  11. Example - twitter db.tweets.aggregate( {$match: {"user.friends_count": { $gt: 0 },

    "user.followers_count": { $gt: 0 } } }, {$project: { location: "$user.location", friends: "$user.friends_count", followers: "$user.followers_count" } }, {$group: {_id: "$location", friends: {$sum: "$friends"}, followers: {$sum: "$followers"} } } ); Predicate Friday, 8 June 12
  12. Example - twitter db.tweets.aggregate( {$match: {"user.friends_count": { $gt: 0 },

    "user.followers_count": { $gt: 0 } } }, {$project: { location: "$user.location", friends: "$user.friends_count", followers: "$user.followers_count" } }, {$group: {_id: "$location", friends: {$sum: "$friends"}, followers: {$sum: "$followers"} } } ); Predicate Parts of the document you want to project Friday, 8 June 12
  13. Example - twitter db.tweets.aggregate( {$match: {"user.friends_count": { $gt: 0 },

    "user.followers_count": { $gt: 0 } } }, {$project: { location: "$user.location", friends: "$user.friends_count", followers: "$user.followers_count" } }, {$group: {_id: "$location", friends: {$sum: "$friends"}, followers: {$sum: "$followers"} } } ); Predicate Parts of the document you want to project Function to apply to the result set Friday, 8 June 12
  14. Example - twitter { "result" : [ { "_id" :

    "Far Far Away", "friends" : 344, "followers" : 789 }, ... ], "ok" : 1 } Friday, 8 June 12
  15. Projections • $project can reshape results • Include or exclude

    fields • Computed fields • Arithmetic expressions • Pull fields from nested documents to the top • Push fields from the top down into new virtual documents Friday, 8 June 12
  16. Unwinding • $unwind can “stream” arrays • Array values are

    doled out one at time in the context of their surrounding documents • Makes it possible to filter out elements before returning Friday, 8 June 12
  17. Grouping • $group aggregation expressions • Define a grouping key

    as the _id of the result • Total grouped column values: $sum • Average grouped column values: $avg • Collect grouped column values in an array or set: $push, $addToSet • Other functions • $min, $max, $first, $last Friday, 8 June 12
  18. Sorting • $sort can sort documents • Sort specifications are

    the same as today, e.g., $sort:{ key1: 1, key2: -1, …} Friday, 8 June 12
  19. Computed Expressions • Available in $project operations • Prefix expression

    language • $add:[“$field1”, “$field2”] • $ifNull:[“$field1”, “$field2”] • Nesting: $add:[“$field1”, $ifNull:[“$field2”, “$field3”]] • Other functions…. • $divide, $mod, $multiply Friday, 8 June 12
  20. Computed Expressions • String functions • $toUpper, $toLower, $substr •

    Date field extraction • $year, $month, $day, $hour... • Date arithmetic • $ifNull • Ternary conditional • Return one of two values based on a predicate Friday, 8 June 12
  21. Current Situation Reader Reader Reader Reader MongoDB • Global Reader/Writer

    Lock • Supports Concurrent Readers on a MongoDB instance Friday, 8 June 12
  22. Current Situation Reader Reader Writer Reader MongoDB • Only one

    thread can write at a time • Writer will block all readers • All data is consistent Friday, 8 June 12
  23. How can we scale? Reader Reader Writer Reader Shard 1

    Reader Reader Writer Reader Shard 2 • Writes can be distributed across shards Friday, 8 June 12
  24. Challenges • Need to shard more • Read starvation for

    secondaries on a high write system • Shard even if you don’t have a lot of data • Increases Hardware Cost • Increase Complexity and Operational Cost Friday, 8 June 12
  25. Database Locking Reader Reader Reader Reader MongoDB • Move to

    Database Level Locking • Concurrent reads on a database User DB Product DB Friday, 8 June 12
  26. Database Locking Reader Reader Reader MongoDB • Writes will only

    lock the target database • People using a Replica Set will have at least two database User DB Poduct DB Writer Friday, 8 June 12