Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB and Aggregation

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for rozza rozza
October 13, 2012

MongoDB and Aggregation

Talk given at the Dublin MongoDB Usergroup (http://www.meetup.com/DublinMUG/)

Avatar for rozza

rozza

October 13, 2012
Tweet

More Decks by rozza

Other Decks in Technology

Transcript

  1. Group /* SQL */ SELECT a, b, SUM(c) csum FROM

    coll WHERE active = 1 GROUP BY a, b /* In MongoDB */ db.coll.group({ key: { a: true, b: true }, cond: { active: 1 }, reduce: function(obj, prev) { prev.csum += obj.c; }, initial: { csum: 0 } );
  2. Group • Similar to SQL Group By • Limited results

    size (10k) • Doesn't work in sharded systems • No one uses this...
  3. Map Reduce • Map/Reduce can be used for aggregation… •

    Currently being used for totalling, averaging, etc • Map/Reduce is a big hammer • Simpler tasks should be easier • Shouldn't need to write JavaScript • Avoid the overhead of JavaScript engine
  4. Hows it work? post = { "author": "Bob", "title": "Best

    Blog Post Ever", "body": "Blah blah blah", "tags": ['best', 'blog', '#omg'], "comments": [ {"author": "Arnie", "body": "wow - just wow" } ] }
  5. Hows it work? map = function() { this.tags.forEach(function(tag) { emit(tag,

    1); }); } reduce = function(key, values) { var total = 0; values.forEach(function(value){ total += value; }); return total; }
  6. Results { "results" : [ {"_id" : "#omg", "value" :

    1}, {"_id" : "best", "value" : 1}, {"_id" : "blog", "value" : 2}, ], "timeMillis" : 0, "counts" : { "input" : 2, "emit" : 6, "reduce" : 1, "output" : 5 }, "ok" : 1, }
  7. Map Reduce • Simple - if you know javascript •

    Nightmare to debug • Single threaded • Parsing BSON in and out of the JS engine expensive • Reduce only called if more than 1 emit...
  8. Aggregation Framework • Declarative (BSON, not JavaScript) • Implemented in

    C++ • Flexible, functional and simple • Operation pipeline • Computational expressions • Plays nice with sharding
  9. All about the pipeline • Process a stream of documents

    • Original input is a collection • Final output is a result document • Series of operators • Filter or transform data • Input/output chain
  10. All about the pipeline • Process a stream of documents

    • Original input is a collection • Final output is a result document • Series of operators • Filter or transform data • Input/output chain ps ax | grep mongod | head -n 1
  11. Core Pipeline Operations $match • Filter documents • Place early

    in the pipeline if possible • Uses existing query syntax • No geospatial operations or $where
  12. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $match Find all Russian books
  13. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $match { $match: { language: "Russian" }} Find all Russian books
  14. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $match { $match: { language: "Russian" }} { title: "War and Peace", pages: 1440, language: "Russian" } Find all Russian books
  15. $group • Group documents by an id • Field path

    reference • Object with multiple references • Constant value • Other output fields are computed $max, $min, $avg, $sum, $addToSet, $push, $first, $last • Processes all data in memory
  16. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group Calculate the average number of pages per language
  17. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", avgPages: {$avg: "$pages"} }} Calculate the average number of pages per language
  18. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", avgPages: {$avg: "$pages"} }} { _id: "Russian", avgPages: 1440 }, { _id: "English", avgPages: 653 } Calculate the average number of pages per language
  19. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum:"$pages"} }} Calculate number of books and pages per language
  20. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum:"$pages"} }} { _id: "Russian", numTitles: 1, sumPages: 1440 }, { _id: "English", numTitles: 2, sumPages: 1306 } Calculate number of books and pages per language
  21. $unwind • Operate on an array field • Yield documents

    for each array value • Nothing for absent or empty fields • Error for non-array fields • Complements $match / $group
  22. { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long

    Island", "New York", "1920s" ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] } $unwind Yielding multiple documents from one
  23. { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long

    Island", "New York", "1920s" ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] } $unwind { $unwind: "$subjects" } Yielding multiple documents from one
  24. { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long

    Island", "New York", "1920s" ] } $unwind { $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" }, { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" }, { title: "The Great Gatsby", ISBN: "9781857150193", Yielding multiple documents from one
  25. Other operators $sort • Sort documents by one or more

    fields • Uses familiar cursor format • Waits for earlier pipeline operator to return • In-memory unless early and indexed $skip • Skip over documents in the pipeline $limit • Skip over documents in the pipeline
  26. { title: "The Great Gatsby" }, { title: "Brave New

    World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } { title: "The Great Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip Paginate through an alphabetical list
  27. { $sort: { title: 1 }} { title: "The Great

    Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip { title: "Animal Farm" }, { title: "Brave New World" }, { title: "Fahrenheit 451" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Lord of the Flies" }, { title: "The Grapes of Wrath"}, { title: "The Great Gatsby" } Paginate through an alphabetical list
  28. { $sort: { title: 1 }} { $sort: { title:

    1 }}, { $skip : 2} { title: "The Great Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip Paginate through an alphabetical list { title: "Fahrenheit 451" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Lord of the Flies" }, { title: "The Grapes of Wrath"}, { title: "The Great Gatsby" }
  29. { $sort: { title: 1 }} { $sort: { title:

    1 }}, { $skip : 2} { title: "The Great Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip { $sort: { title: 1 }}, { $skip : 2}, { $limit : 2 } Paginate through an alphabetical list { title: "Fahrenheit 451" }, { title: "Fathers and Sons" }
  30. Usage db.col.aggregate() db.runCommand({ aggregate : <col>, pipeline : [] });

    Limitations • Beware of command result size limit • Document size limit is 16MB • Unsupported field types • Binary, Code, MinKey, MaxKey • Deprecated BSON types
  31. Sharding support • Supports sharding • Mongos analyses pipeline, and

    forwards operations up to $group or $sort to shards; combines shard server results and returns them
  32. Usage Tips • Use $match in a pipeline as early

    as possible • The query optimizer can then choose to scan an index and avoid scanning the entire collection • Use $sort in a pipeline as early as possible • The query optimizer can then be used to choose an index to scan instead of sorting the result
  33. Future Plans • More optimisations • $out pipeline operation •

    Saves the document stream to a collection • Similar to M/R $out, but with sharded output • Functions like a tee, so that intermediate results can be saved
  34. Financial Candlesticks • Plotting the EUR/GBP exchange rate • For

    every minute calculate the: $high, $low, $min, $max, $avg • Paginate through the dataset
  35. /* Required output */ { "_id" : Date("2012-02-16T12:57:00Z"), "bid" :

    { "open" : 1.29994, "close" : 1.29997, "high" : 1.3001, "low" : 1.29992, "avg" : 1.2999995 } } /* Sample */ { "_id" : ObjectId("xxx"), "bid" : 1.3002, "ts" : Date("2012-02-16T12:57:00Z") } Financial Candlesticks
  36. Possible in map reduce // Candle stick graph m =

    function() { new_date = new Date(this.ts); // Turn the ts into a minute bucket minute = new Date(new_date.setSeconds(0)) emit(minute, this); } r = function(key, values) { var bid = { low: Number.MAX_VALUE, high: Number.MIN_VALUE, count: 0, total: 0, open: 0, open_ts: 0, close: 0, close_ts: 0, } values.forEach(function(v) { var b = v['bid']; var a = v['ask']; var ts = v['ts']; // first - find opening bid in the range if (bid.open_ts == 0) { bid.open_ts = ts; bid.open = b; } else if ( ts < bid.open_ts ) { bid.open_ts = ts; bid.open = b; } // last - final closing bid in the range if (bid.close_ts == 0) { bid.close_ts = ts; bid.close = b; } else if ( ts > bid.close_ts ) { bid.close_ts = ts; bid.close = b; } // max - find max bid for the range if (b > bid.high) { bid.high = b; } // min - find min bid for the range if (b < bid.low) { bid.low = b; } // count - count the number of entries for the range bid.count += 1; // sum - sum the bid for the range bid.total += b; }); return { minute: key, bid: bid}; } f = function(key, obj) { // average - calc the average for if (obj.bid.count > 0){ obj.bid.avg = obj.bid.total / } delete obj.bid.count; delete obj.bid.open_ts; delete obj.bid.close_ts; delete obj.bid.total; delete obj.minute; return obj; } // Run the Map/Reduce db.money.mapReduce(m, r, {out : {in
  37. Aggregation framework {$project : { minute : { 0: {"$year":

    "$ts"}, 1: {"$month": "$ts"}, 2: {"$dayOfMonth": "$ts"}, 3: {"$hour": "$ts"}, 4: {"$minute": "$ts"} }, ts : 1, bid : 1 }}, { $sort : { "ts" : 1 }}, { $group : { _id : "$minute", ts : { $first: "$ts"}, bid_open : { $first : "$bid"}, bid_close : { $last : "$bid"}, bid_high : { $max : "$bid"}, bid_low : { $min : "$bid"}, bid_avg : { $avg : "$bid"} }}, // Continued... { $sort : { _id : 1 }}, { $skip : 0 }, { $limit : 5 }, { $project : { _id : "$ts", bid : { open : "$bid_open", close : "$bid_close", high : "$bid_high", low : "$bid_low", avg : "$bid_avg" } } }