Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB and Aggregation

rozza
October 13, 2012

MongoDB and Aggregation

Talk given at the Dublin MongoDB Usergroup (http://www.meetup.com/DublinMUG/)

rozza

October 13, 2012
Tweet

More Decks by rozza

Other Decks in Technology

Transcript

  1. Group /* SQL */ SELECT a, b, SUM(c) csum FROM

    coll WHERE active = 1 GROUP BY a, b /* In MongoDB */ db.coll.group({ key: { a: true, b: true }, cond: { active: 1 }, reduce: function(obj, prev) { prev.csum += obj.c; }, initial: { csum: 0 } );
  2. Group • Similar to SQL Group By • Limited results

    size (10k) • Doesn't work in sharded systems • No one uses this...
  3. Map Reduce • Map/Reduce can be used for aggregation… •

    Currently being used for totalling, averaging, etc • Map/Reduce is a big hammer • Simpler tasks should be easier • Shouldn't need to write JavaScript • Avoid the overhead of JavaScript engine
  4. Hows it work? post = { "author": "Bob", "title": "Best

    Blog Post Ever", "body": "Blah blah blah", "tags": ['best', 'blog', '#omg'], "comments": [ {"author": "Arnie", "body": "wow - just wow" } ] }
  5. Hows it work? map = function() { this.tags.forEach(function(tag) { emit(tag,

    1); }); } reduce = function(key, values) { var total = 0; values.forEach(function(value){ total += value; }); return total; }
  6. Results { "results" : [ {"_id" : "#omg", "value" :

    1}, {"_id" : "best", "value" : 1}, {"_id" : "blog", "value" : 2}, ], "timeMillis" : 0, "counts" : { "input" : 2, "emit" : 6, "reduce" : 1, "output" : 5 }, "ok" : 1, }
  7. Map Reduce • Simple - if you know javascript •

    Nightmare to debug • Single threaded • Parsing BSON in and out of the JS engine expensive • Reduce only called if more than 1 emit...
  8. Aggregation Framework • Declarative (BSON, not JavaScript) • Implemented in

    C++ • Flexible, functional and simple • Operation pipeline • Computational expressions • Plays nice with sharding
  9. All about the pipeline • Process a stream of documents

    • Original input is a collection • Final output is a result document • Series of operators • Filter or transform data • Input/output chain
  10. All about the pipeline • Process a stream of documents

    • Original input is a collection • Final output is a result document • Series of operators • Filter or transform data • Input/output chain ps ax | grep mongod | head -n 1
  11. Core Pipeline Operations $match • Filter documents • Place early

    in the pipeline if possible • Uses existing query syntax • No geospatial operations or $where
  12. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $match Find all Russian books
  13. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $match { $match: { language: "Russian" }} Find all Russian books
  14. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $match { $match: { language: "Russian" }} { title: "War and Peace", pages: 1440, language: "Russian" } Find all Russian books
  15. $group • Group documents by an id • Field path

    reference • Object with multiple references • Constant value • Other output fields are computed $max, $min, $avg, $sum, $addToSet, $push, $first, $last • Processes all data in memory
  16. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group Calculate the average number of pages per language
  17. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", avgPages: {$avg: "$pages"} }} Calculate the average number of pages per language
  18. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", avgPages: {$avg: "$pages"} }} { _id: "Russian", avgPages: 1440 }, { _id: "English", avgPages: 653 } Calculate the average number of pages per language
  19. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } { title: "The Great Gatsby", pages: 218, language: "English" }, { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum:"$pages"} }} Calculate number of books and pages per language
  20. { title: "The Great Gatsby", pages: 218, language: "English" },

    { title: "War and Peace", pages: 1440, language: "Russian" }, { title: "Atlas Shrugged", pages: 1088, language: "English" } $group { $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum:"$pages"} }} { _id: "Russian", numTitles: 1, sumPages: 1440 }, { _id: "English", numTitles: 2, sumPages: 1306 } Calculate number of books and pages per language
  21. $unwind • Operate on an array field • Yield documents

    for each array value • Nothing for absent or empty fields • Error for non-array fields • Complements $match / $group
  22. { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long

    Island", "New York", "1920s" ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] } $unwind Yielding multiple documents from one
  23. { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long

    Island", "New York", "1920s" ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] } $unwind { $unwind: "$subjects" } Yielding multiple documents from one
  24. { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long

    Island", "New York", "1920s" ] } $unwind { $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" }, { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" }, { title: "The Great Gatsby", ISBN: "9781857150193", Yielding multiple documents from one
  25. Other operators $sort • Sort documents by one or more

    fields • Uses familiar cursor format • Waits for earlier pipeline operator to return • In-memory unless early and indexed $skip • Skip over documents in the pipeline $limit • Skip over documents in the pipeline
  26. { title: "The Great Gatsby" }, { title: "Brave New

    World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } { title: "The Great Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip Paginate through an alphabetical list
  27. { $sort: { title: 1 }} { title: "The Great

    Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip { title: "Animal Farm" }, { title: "Brave New World" }, { title: "Fahrenheit 451" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Lord of the Flies" }, { title: "The Grapes of Wrath"}, { title: "The Great Gatsby" } Paginate through an alphabetical list
  28. { $sort: { title: 1 }} { $sort: { title:

    1 }}, { $skip : 2} { title: "The Great Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip Paginate through an alphabetical list { title: "Fahrenheit 451" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Lord of the Flies" }, { title: "The Grapes of Wrath"}, { title: "The Great Gatsby" }
  29. { $sort: { title: 1 }} { $sort: { title:

    1 }}, { $skip : 2} { title: "The Great Gatsby" }, { title: "Brave New World" }, { title: "The Grapes of Wrath"}, { title: "Animal Farm" }, { title: "Lord of the Flies" }, { title: "Fathers and Sons" }, { title: "Invisible Man" }, { title: "Fahrenheit 451" } $sort, limit, skip { $sort: { title: 1 }}, { $skip : 2}, { $limit : 2 } Paginate through an alphabetical list { title: "Fahrenheit 451" }, { title: "Fathers and Sons" }
  30. Usage db.col.aggregate() db.runCommand({ aggregate : <col>, pipeline : [] });

    Limitations • Beware of command result size limit • Document size limit is 16MB • Unsupported field types • Binary, Code, MinKey, MaxKey • Deprecated BSON types
  31. Sharding support • Supports sharding • Mongos analyses pipeline, and

    forwards operations up to $group or $sort to shards; combines shard server results and returns them
  32. Usage Tips • Use $match in a pipeline as early

    as possible • The query optimizer can then choose to scan an index and avoid scanning the entire collection • Use $sort in a pipeline as early as possible • The query optimizer can then be used to choose an index to scan instead of sorting the result
  33. Future Plans • More optimisations • $out pipeline operation •

    Saves the document stream to a collection • Similar to M/R $out, but with sharded output • Functions like a tee, so that intermediate results can be saved
  34. Financial Candlesticks • Plotting the EUR/GBP exchange rate • For

    every minute calculate the: $high, $low, $min, $max, $avg • Paginate through the dataset
  35. /* Required output */ { "_id" : Date("2012-02-16T12:57:00Z"), "bid" :

    { "open" : 1.29994, "close" : 1.29997, "high" : 1.3001, "low" : 1.29992, "avg" : 1.2999995 } } /* Sample */ { "_id" : ObjectId("xxx"), "bid" : 1.3002, "ts" : Date("2012-02-16T12:57:00Z") } Financial Candlesticks
  36. Possible in map reduce // Candle stick graph m =

    function() { new_date = new Date(this.ts); // Turn the ts into a minute bucket minute = new Date(new_date.setSeconds(0)) emit(minute, this); } r = function(key, values) { var bid = { low: Number.MAX_VALUE, high: Number.MIN_VALUE, count: 0, total: 0, open: 0, open_ts: 0, close: 0, close_ts: 0, } values.forEach(function(v) { var b = v['bid']; var a = v['ask']; var ts = v['ts']; // first - find opening bid in the range if (bid.open_ts == 0) { bid.open_ts = ts; bid.open = b; } else if ( ts < bid.open_ts ) { bid.open_ts = ts; bid.open = b; } // last - final closing bid in the range if (bid.close_ts == 0) { bid.close_ts = ts; bid.close = b; } else if ( ts > bid.close_ts ) { bid.close_ts = ts; bid.close = b; } // max - find max bid for the range if (b > bid.high) { bid.high = b; } // min - find min bid for the range if (b < bid.low) { bid.low = b; } // count - count the number of entries for the range bid.count += 1; // sum - sum the bid for the range bid.total += b; }); return { minute: key, bid: bid}; } f = function(key, obj) { // average - calc the average for if (obj.bid.count > 0){ obj.bid.avg = obj.bid.total / } delete obj.bid.count; delete obj.bid.open_ts; delete obj.bid.close_ts; delete obj.bid.total; delete obj.minute; return obj; } // Run the Map/Reduce db.money.mapReduce(m, r, {out : {in
  37. Aggregation framework {$project : { minute : { 0: {"$year":

    "$ts"}, 1: {"$month": "$ts"}, 2: {"$dayOfMonth": "$ts"}, 3: {"$hour": "$ts"}, 4: {"$minute": "$ts"} }, ts : 1, bid : 1 }}, { $sort : { "ts" : 1 }}, { $group : { _id : "$minute", ts : { $first: "$ts"}, bid_open : { $first : "$bid"}, bid_close : { $last : "$bid"}, bid_high : { $max : "$bid"}, bid_low : { $min : "$bid"}, bid_avg : { $avg : "$bid"} }}, // Continued... { $sort : { _id : 1 }}, { $skip : 0 }, { $limit : 5 }, { $project : { _id : "$ts", bid : { open : "$bid_open", close : "$bid_close", high : "$bid_high", low : "$bid_low", avg : "$bid_avg" } } }