Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Aggregation Framework Performance Pitfalls

Naomi Pentrel
November 06, 2018

Aggregation Framework Performance Pitfalls

The aggregation framework is incredibly powerful. But with great power comes great responsibility. For application developers, this means ensuring that queries are performant. In this talk, I will share with you some key tips to avoid the most common aggregation framework performance pitfalls in your own deployments.

Naomi Pentrel

November 06, 2018
Tweet

More Decks by Naomi Pentrel

Other Decks in Technology

Transcript

  1. By Naomi Pentrel Aggregation Framework: Performance Pitfalls 30min in the

    life of a Support Engineer @naomi_pen ➜ db.collection.aggregate( [ { <stage> }, ... ] ) {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…}
  2. @naomi_pen MongoDB's Aggregation Framework allows you to manipulate and transform

    data in ways that are not possible with normal queries.
  3. Aggregation Framework: A Reminder Aggregation Pipeline: In Theory @naomi_pen #

    Pipeline Stages (think unix pipes): ➜ db.collection.aggregate( [ { <stage> }, ... ] ) # Operator expressions: ➜ { <operator>: [ <argument1>, <argument2> ... ] } ➜ { <operator>: <argument> } {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…}
  4. Aggregation Framework: A Reminder Example: Grouping Orders @naomi_pen ➜ db.orders.aggregate(

    [ { $match: { status: "A"} }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }, ] ) pipeline specification Operator
  5. Aggregation Framework: A Reminder Pipeline Stages @naomi_pen $addFields
 $bucket
 $bucketAuto


    $collStats
 $count
 $currentOp
 $facet
 $geoNear
 $graphLookup
 $group
 $indexStats
 $limit
 $listLocalSessions
 $listSessions
 $lookup
 $match
 $out
 $project
 $redact
 $replaceRoot
 $sample
 $skip
 $sort
 $sortByCount 
 $unwind
  6. Aggregation Framework: A Reminder Pipeline Operators Arithmetic Expression Operators $abs

    $add $ceil $divide $exp $floor $ln $log $log10 $mod $multiply $pow $sqrt $subtract $trunc 
 Array Expression Operators $arrayElemAt $arrayToObject $concatArrays $filter $in $indexOfArray $isArray $map $objectToArray $range $reduce $reverseArray $size $slice $zip Boolean Expression Operators $and $not $or Comparison Expression Operators $cmp $eq $gt $gte $lt $lte $ne 
 Conditional Expression Operators $cond $ifNull $switch Date Expression Operators $dateFromParts $dateFromString $dateToParts $dateToString $dayOfMonth $dayOfWeek $dayOfYear $hour $isoDayOfWeek $isoWeek $isoWeekYear $millisecond $minute $month $second $toDate $week $year $add $subtract Literal Expression Operator $literal Object Expression Operators $mergeObjects $objectToArray Set Expression Operators $allElementsTru e $anyElementTru e $setDifference $setEquals $setIntersection $setIsSubset $setUnion 
 String Expression Operators $concat $dateFromString $dateToString $indexOfBytes $indexOfCP $ltrim $rtrim $split $strLenBytes $strLenCP $strcasecmp $substrBytes $substrCP $toLower $toString $trim $toUpper Text Expression Operator $meta Type Expression Operators $convert $toBool $toDate $toDecimal $toDouble $toInt $toLong $toObjectId $toString $type Accumulators ($group) $addToSet $avg $first $last $max $mergeObjects $min $push $stdDevPop $stdDevSamp $sum Accumulators ($project) $avg $max $min $stdDevPop $stdDevSamp $sum Variable Expression Operators $let @naomi_pen A bit much? Here’s the reference: bit.ly/MongoDBAR
  7. Aggregation Framework: A Reminder Explain…? @naomi_pen ➜ db.sales.aggregate( [ {

    $group: { _id: "$date", revenue: {$sum: “$amount"} } } ], {explain: true} )
  8. @naomi_pen 1 Aggregations require more RAM and CPU Truths about

    Aggregations 1 2 3 The amount of data to process is large The operation will be run frequently The operation is intended to service a real-time app feature 2 Each stage has a manipulated copy of the data 3 May use lots of memory; allocated outside WT 4 Aggregations need consideration, especially if:
  9. Automatic Optimizations The Optimizer will transform your queries @naomi_pen •

    $project or $addFields + $match Sequence Optimization • $sort + $match Sequence Optimization • $redact + $match Sequence Optimization • $project + $skip Sequence Optimization • $sort + $limit Coalescence • $limit + $limit Coalescence • $skip + $skip Coalescence • $match + $match Coalescence • $lookup + $unwind Coalescence • $sort + $skip + $limit Sequence
  10. The Customer Meet Robin @naomi_pen Robin is a Software Engineer

    at Petflix They love MongoDB for its ease of use Robin made the decision to move Petflix to MongoDB
  11. @naomi_pen { _id: "tt0944947", title: "Game of Throws", genres: [

    "Drama", "Adventure", "Fantasy" ], director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: { user: "naomi", time_started: "1541364977160", }, movie_id: "tt0944947", location: "Germany" } { _id: "naomi", email: "[email protected]", … } Videos Views Users The Customer: Data model
  12. Support Ticket 1 @naomi_pen CPU Spikes with Query Hi there,

    We have an issue with one of our queries. We count movies for two genres, and list titles for each genre in our new UI. But it’s slow and causes spikes: db.movies.aggregate([ { $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: {"_id": "$genres", "count": {$sum: 1}, "titles": {$push: "$title"}}} ]) Thanks, Robin
 Robin from Petflix 1 minute ago
  13. Support Ticket 1 Investigate the query @naomi_pen ➜ db.movies.aggregate([ {

    $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])
  14. Support Ticket 1 Investigate the logs* @naomi_pen 2018-11-04T15:11:59.670-0800 I COMMAND

    [conn2] command petflix.movies appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $unwind: "$genres" }, { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $group: { _id: "$genres", count: { $sum: 1.0 }, titles: { $push: "$title" } } } ], cursor: {}, lsid: { id: UUID("610579f8- c68e-44f7-881d-2ddc407e5594") }, $db: "petflix" } planSummary: COLLSCAN keysExamined:0 docsExamined:49997 cursorExhausted:1 numYields:392 nreturned:2 reslen:813 locks:{ Global: { acquireCount: { r: 397 } }, Database: { acquireCount: { r: 397 } }, Collection: { acquireCount: { r: 397 } } } protocol:op_msg 121ms *Queries only appear in logs if they run for 100+ ms. Queries not in the logs can still cause load!
  15. Support Ticket 1 Investigate the query @naomi_pen ➜ db.movies.aggregate([ {

    $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])
  16. Support Ticket 1 { $unwind: "$genres" } @naomi_pen { _id:

    "tt0944947", title: "Game of Throws", genres: [ "Drama", "Adventure", "Fantasy" ], director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Drama", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Adventure", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Fantasy", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { $match: { "genres": { $in: ["Fantasy", "Adventure"] } }
  17. Support Ticket 1 { $unwind: "$genres" } @naomi_pen { _id:

    "tt1032765", title: "Catmanji", genres: [ "Comedy", "Family", ], director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { _id: "tt1032765", title: "Catmanji", genres: "Comedy", director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { _id: "tt1032765", title: "Catmanji", genres: "Family", director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { $match: { "genres": { $in: [ "Fantasy", “Adventure” ] } } }
  18. Support Ticket 1 Solution: Duplicate the $match stage & ensure

    index @naomi_pen ➜ db.movies.aggregate([ { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } } { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])
  19. Support Ticket 1 @naomi_pen Re: CPU Spikes with Query We’re

    happy to help. To improve your query’s performance you need to limit input as much as possible. In your query the match after the unwind discards all documents that aren't in those genres. If you add an additional match before the unwind, then documents that don’t have those genres are never unwound, because they never enter the pipeline. db.movies.aggregate([ { $match: {"genres": {$in: ["Fantasy", "Family"]}}} { $unwind: "genres"}, { $match: {"genres": {$in: ["Fantasy", "Family"]}}} { "$group": {"_id": "$genres", "count": {"$sum": 1}, "titles": {$push: "$title"} } ] ) As a bonus, the initial match stage can also make use of existing indexes, so ensure you have an index on genres. Does this help? Cheers, Naomi Naomi from MongoDB
  20. Support Ticket 1 Now in the logs* @naomi_pen 2018-11-04T15:14:51.491-0800 I

    COMMAND [conn2] command petflix.movies appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $unwind: "$genres" }, { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $group: { _id: "$genres", count: { $sum: 1.0 }, titles: { $push: "$title" } } } ], cursor: {}, lsid: { id: UUID("610579f8-c68e-44f7-881d-2ddc407e5594") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } keysExamined:28 docsExamined:18 cursorExhausted:1 numYields:0 nreturned:2 reslen:813 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 1ms *Logs by default only log queries than run longer than 100ms.
  21. Support Ticket 2 @naomi_pen New Join Query is Slow Hi

    team, We're again experiencing performance issues. Can you look into this? db.movies.aggregate([ { $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1}} ]) Cheers, Robin Robin from Petflix 3 minute ago
  22. Support Ticket 2 Investigate the query @naomi_pen ➜ db.movies.aggregate([ {

    $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1} } ])
  23. Support Ticket 2 Investigate the logs* @naomi_pen 2018-11-04T16:35:00.235-0800 I COMMAND

    [conn4] command petflix.views appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Drama" ] } } }, { $lookup: { from: "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1.0, title: 1.0 } }, { $sort: { viewsData.time_started: -1.0 } } ], cursor: {}, lsid: { id: UUID("617b7851-4985-47e8-a647-991173e60667") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } cursorid:8208857423668239231 keysExamined:24997 docsExamined:24997 hasSortStage:1 numYields:200310 nreturned:51 reslen:15709 locks:{ Global: { acquireCount: { r: 250310 } }, Database: { acquireCount: { r: 250310 } }, Collection: { acquireCount: { r: 250309 } } } protocol:op_msg 11172ms *Queries only appear in logs if they run for 100+ ms. Queries not in the logs can still cause load!
  24. Support Ticket 2 Investigate the query & indexes @naomi_pen ➜

    db.movies.aggregate([ { $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1} } ]) ➜ db.movies.getIndexes() # result: _id, genres ➜ db.views.getIndexes() # result: _id
  25. Support Ticket 2 Solution: Index views.movie_id, optionally add limit @naomi_pen

    ➜ db.movies.aggregate([ { $match: {"genres": { $in : ["Drama"] }} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $sort: { "viewsData.time_started": -1}}, { $limit: 50 } ])
  26. Support Ticket 2 @naomi_pen Re: New Join Query is Slow

    Hi Robin, Of course! We looked into the query and have a few suggestions: 1. The foreignField specified in the $lookup should be indexed. Please add an index on views.movie_id. This should boost performance quite a bit. 2. If possible you could also add a $limit stage to improve the performance of the sort. Even with these changes, this query may still be slow. If this is a common query pattern, we would suggest considering changes to your data model. Does this help?
 Best,
 Naomi Naomi from MongoDB
  27. Support Ticket 2 Now in the logs @naomi_pen 2018-11-05T03:43:22.109-0800 I

    COMMAND [conn4] command petflix.views appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Drama" ] } } }, { $lookup: { from: "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $sort: { viewsData.time_started: -1.0 } }, { $limit: 50.0 } ], cursor: {}, lsid: { id: UUID("617b7851-4985-47e8-a647-991173e60667") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } keysExamined:24997 docsExamined: 24997 hasSortStage:1 cursorExhausted:1 numYields:208 nreturned:51 reslen: 9100 locks:{ Global: { acquireCount: { r: 50206 } }, Database: { acquireCount: { r: 50206 } }, Collection: { acquireCount: { r: 50205 } } } protocol:op_msg 1245ms
  28. Support Ticket 3 @naomi_pen Help with Metrics Hello old friend,

    Thanks so much for all your help so far. I think I got most of the aggregation framework performance pitfalls figured out by now… Unfortunately our Marketing team is pushing us hard to deliver metrics around how many movies are streamed per country at any given time. We tried to run the queries but I think if we run this query every few seconds it will bring down our deployment. Is there any magic solution? Thanks, Robin Robin from Petflix 5 minute ago
  29. Support Ticket 3 @naomi_pen Re: Help with Metrics Hi Robin,

    I understand your concern. We agree that the additional load might be a bit much for your deployment. We would suggest you consider using a dedicated analytics node for these queries. That way the additional load would not impact your other nodes. Let us know if this would suit you. If you have any questions let me know.
 Best,
 Naomi Naomi from MongoDB
  30. Support Ticket 4 @naomi_pen URGENT HI, Our analytics node was

    going great. So great in fact we decided to use the analytics node for a search feature. Now everything is burning. But we don't want to completely turn it off. There's only some queries that are really bad. Can we somehow limit a queries impact? HELP Thx, R Robin from Petflix 1 minute ago
  31. Support Ticket 4 @naomi_pen Re: URGENT Hi Robin, You can

    consider setting maxTimeMS (http://bit.ly/maxTimeMS, http://bit.ly/ maxTimeMS2) on the queries. 
 This setting will protect the database server by ending operations that take longer than maxTimeMS. That said, be aware that:
 • In some cases, partial results are returned by terminated operations. • maxTimeMS is not a guarantee. In high-queue situations, operations that are queued but not active may not be terminated until they become active again. 
 Please note that this is a safeguard, not a solution. To help alleviate the load you should add indexes for commonly used query patterns (to all nodes or only the analytics node). 
 Best, Naomi Naomi from MongoDB
  32. Support Ticket 5 @naomi_pen How do I query this…? We

    started tracking views information by deviceIDs in the collection `deviceviews`. Now we're trying to find all views where `movie_completed` is false. However, I don't know how to do that since the children of `views` are uuids… { 
 "_id": ObjectId("5be090ca8dd84581f7120e5f"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-f02a-4eca-acaa-690498c8590e": { "id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, … } } Robin from Petflix 7 minute ago
  33. Support Ticket 5 Investigate the data model @naomi_pen { 


    "_id": ObjectId("5be090ca8dd84581f7120e5f"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-f02a-4eca-acaa-690498c8590e": { "id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, … } }
  34. Support Ticket 5 @naomi_pen Re: How do I query this…?

    You can solve this with $objectToArray (code attached) but this would result in extra calculation load. We would suggest restructuring your schema instead: { "_id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It’s a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, } } … Naomi from MongoDB
  35. Support Ticket 5 $project: { views: { $objectToArray: "$views" },

    deviceId: 1 } @naomi_pen { 
 "_id": ObjectId("5be090ca8d…"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-…": { "id": "e5a52daf-…", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, "f7d34ced-…": { … }, … } } { 
 "_id": ObjectId("5be090ca8d…"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views" : [ { "k" : "e5a52daf-…", "v" : { "id" : "e5a52daf-…", "movie_title" : "It's a wonderful dog life", "movie_completed" : false, "time_started" : "1541443718297" } }, { "k" : "f7d34ced-…", "v" : { … } } ] }
  36. Support Ticket 5 Investigate the query @naomi_pen ➜ db.deviceviews.aggregate([ {

    $project: { views: { $objectToArray: "$views" }, deviceId: 1 } }, { $unwind: "$views" }, { $match: { "views.v.movie_completed": false } } ])
  37. @naomi_pen 1 Indexable stages before unindexable stages Start with stages

    that can use indexes (e.g. $sort). Index the foreignField of a $lookup. Aggregated Aggregation Tips 3 2 Start with selective stages. If matching on fields in an array after an $unwind, $match before too. $match documents early 3 Place $limit close to $sort to allow for more efficient sorting. Use $limit 5 maxTimeMS can be used as a safeguard. Use a dedicated analytics node 4 Expensive aggregations that run often may indicate a need for a schema change. Consider changing the data model
  38. @naomi_pen Thank you! Hi folks, We’re off for today. Thank

    you so much for all your work!! Have a wonderful day, Robin Robin from Petflix 1 minute ago
  39. @naomi_pen By Naomi Pentrel Thank you! Find me later with

    any questions!
 
 Slides are available at speakerdeck.com/npentrel @naomi_pen