Aggregation Framework Performance Pitfalls

Slide 1

Slide 1 text

By Naomi Pentrel Aggregation Framework: Performance Pitfalls 30min in the life of a Support Engineer @naomi_pen ➜ db.collection.aggregate( [ { }, ... ] ) {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…}

Slide 2

Slide 2 text

@naomi_pen A Reminder Automatic Optimizations The Customer Support Tickets 5 Tips (#4 will SHOCK you) 3 4 5 1 2

Slide 3

Slide 3 text

@naomi_pen MongoDB's Aggregation Framework allows you to manipulate and transform data in ways that are not possible with normal queries.

Slide 4

Slide 4 text

Aggregation Framework: A Reminder Aggregation Pipeline: In Theory @naomi_pen # Pipeline Stages (think unix pipes): ➜ db.collection.aggregate( [ { }, ... ] ) # Operator expressions: ➜ { : [ , ... ] } ➜ { : } {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…}

Slide 5

Slide 5 text

Aggregation Framework: A Reminder Example: Grouping Orders @naomi_pen ➜ db.orders.aggregate( [ { $match: { status: "A"} }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }, ] ) pipeline specification Operator

Slide 6

Slide 6 text

Aggregation Framework: A Reminder Example: Grouping Orders @naomi_pen Stages

Slide 7

Slide 7 text

Aggregation Framework: A Reminder Pipeline Stages @naomi_pen $addFields  $bucket  $bucketAuto  $collStats  $count  $currentOp  $facet  $geoNear  $graphLookup  $group  $indexStats  $limit  $listLocalSessions  $listSessions  $lookup  $match  $out  $project  $redact  $replaceRoot  $sample  $skip  $sort  $sortByCount   $unwind

Slide 8

Slide 8 text

Aggregation Framework: A Reminder Pipeline Operators Arithmetic Expression Operators $abs $add $ceil $divide $exp $floor $ln $log $log10 $mod $multiply $pow $sqrt $subtract $trunc   Array Expression Operators $arrayElemAt $arrayToObject $concatArrays $filter $in $indexOfArray $isArray $map $objectToArray $range $reduce $reverseArray $size $slice $zip Boolean Expression Operators $and $not $or Comparison Expression Operators $cmp $eq $gt $gte $lt $lte $ne   Conditional Expression Operators $cond $ifNull $switch Date Expression Operators $dateFromParts $dateFromString $dateToParts $dateToString $dayOfMonth $dayOfWeek $dayOfYear $hour $isoDayOfWeek $isoWeek $isoWeekYear $millisecond $minute $month $second $toDate $week $year $add $subtract Literal Expression Operator $literal Object Expression Operators $mergeObjects $objectToArray Set Expression Operators $allElementsTru e $anyElementTru e $setDifference $setEquals $setIntersection $setIsSubset $setUnion   String Expression Operators $concat $dateFromString $dateToString $indexOfBytes $indexOfCP $ltrim $rtrim $split $strLenBytes $strLenCP $strcasecmp $substrBytes $substrCP $toLower $toString $trim $toUpper Text Expression Operator $meta Type Expression Operators $convert $toBool $toDate $toDecimal $toDouble $toInt $toLong $toObjectId $toString $type Accumulators ($group) $addToSet $avg $first $last $max $mergeObjects $min $push $stdDevPop $stdDevSamp $sum Accumulators ($project) $avg $max $min $stdDevPop $stdDevSamp $sum Variable Expression Operators $let @naomi_pen A bit much? Here’s the reference: bit.ly/MongoDBAR

Slide 9

Slide 9 text

Aggregation Framework: A Reminder Explain…? @naomi_pen ➜ db.sales.aggregate( [ { $group: { _id: "$date", revenue: {$sum: “$amount"} } } ], {explain: true} )

Slide 10

Slide 10 text

Aggregation Framework: A Reminder Explain…? @naomi_pen ➜ db.sales.explain("executionStats").aggregate( [ { $group: { _id: “$date", revenue: {$sum: "$amount"} } } ] )

Slide 11

Slide 11 text

@naomi_pen 1 Aggregations require more RAM and CPU Truths about Aggregations 1 2 3 The amount of data to process is large The operation will be run frequently The operation is intended to service a real-time app feature 2 Each stage has a manipulated copy of the data 3 May use lots of memory; allocated outside WT 4 Aggregations need consideration, especially if:

Slide 12

Slide 12 text

@naomi_pen A Reminder Automatic Optimizations The Customer Support Tickets 5 Tips (#4 will SHOCK you) 3 4 5 1 2

Slide 13

Slide 13 text

Automatic Optimizations The Optimizer will transform your queries @naomi_pen • $project or $addFields + $match Sequence Optimization • $sort + $match Sequence Optimization • $redact + $match Sequence Optimization • $project + $skip Sequence Optimization • $sort + $limit Coalescence • $limit + $limit Coalescence • $skip + $skip Coalescence • $match + $match Coalescence • $lookup + $unwind Coalescence • $sort + $skip + $limit Sequence

Slide 14

Slide 14 text

@naomi_pen A Reminder Automatic Optimizations The Customer Learnings 5 Tips (#4 will SHOCK you) 4 5 1 3 2

Slide 15

Slide 15 text

The Customer Meet Robin @naomi_pen Robin is a Software Engineer at Petflix They love MongoDB for its ease of use Robin made the decision to move Petflix to MongoDB

Slide 16

Slide 16 text

The Customer @naomi_pen

Slide 17

Slide 17 text

@naomi_pen { _id: "tt0944947", title: "Game of Throws", genres: [ "Drama", "Adventure", "Fantasy" ], director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: { user: "naomi", time_started: "1541364977160", }, movie_id: "tt0944947", location: "Germany" } { _id: "naomi", email: "[email protected]", … } Videos Views Users The Customer: Data model

Slide 18

Slide 18 text

@naomi_pen A Reminder Automatic Optimizations The Customer Support Tickets 5 Tips (#4 will SHOCK you) 5 1 2 4 3

Slide 19

Slide 19 text

Support Ticket 1 @naomi_pen CPU Spikes with Query Hi there, We have an issue with one of our queries. We count movies for two genres, and list titles for each genre in our new UI. But it’s slow and causes spikes: db.movies.aggregate([ { $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: {"_id": "$genres", "count": {$sum: 1}, "titles": {$push: "$title"}}} ]) Thanks, Robin  Robin from Petflix 1 minute ago

Slide 20

Slide 20 text

Support Ticket 1 Investigate the query @naomi_pen ➜ db.movies.aggregate([ { $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])

Slide 21

Slide 21 text

Support Ticket 1 Investigate the logs* @naomi_pen 2018-11-04T15:11:59.670-0800 I COMMAND [conn2] command petflix.movies appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $unwind: "$genres" }, { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $group: { _id: "$genres", count: { $sum: 1.0 }, titles: { $push: "$title" } } } ], cursor: {}, lsid: { id: UUID("610579f8- c68e-44f7-881d-2ddc407e5594") }, $db: "petflix" } planSummary: COLLSCAN keysExamined:0 docsExamined:49997 cursorExhausted:1 numYields:392 nreturned:2 reslen:813 locks:{ Global: { acquireCount: { r: 397 } }, Database: { acquireCount: { r: 397 } }, Collection: { acquireCount: { r: 397 } } } protocol:op_msg 121ms *Queries only appear in logs if they run for 100+ ms. Queries not in the logs can still cause load!

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Support Ticket 1 { $unwind: "$genres" } @naomi_pen { _id: "tt0944947", title: "Game of Throws", genres: [ "Drama", "Adventure", "Fantasy" ], director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Drama", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Adventure", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Fantasy", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { $match: { "genres": { $in: ["Fantasy", "Adventure"] } }

Slide 24

Slide 24 text

Support Ticket 1 { $unwind: "$genres" } @naomi_pen { _id: "tt1032765", title: "Catmanji", genres: [ "Comedy", "Family", ], director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { _id: "tt1032765", title: "Catmanji", genres: "Comedy", director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { _id: "tt1032765", title: "Catmanji", genres: "Family", director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { $match: { "genres": { $in: [ "Fantasy", “Adventure” ] } } }

Slide 25

Slide 25 text

Support Ticket 1 Solution: Duplicate the $match stage & ensure index @naomi_pen ➜ db.movies.aggregate([ { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } } { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])

Slide 26

Slide 26 text

Support Ticket 1 @naomi_pen Re: CPU Spikes with Query We’re happy to help. To improve your query’s performance you need to limit input as much as possible. In your query the match after the unwind discards all documents that aren't in those genres. If you add an additional match before the unwind, then documents that don’t have those genres are never unwound, because they never enter the pipeline. db.movies.aggregate([ { $match: {"genres": {$in: ["Fantasy", "Family"]}}} { $unwind: "genres"}, { $match: {"genres": {$in: ["Fantasy", "Family"]}}} { "$group": {"_id": "$genres", "count": {"$sum": 1}, "titles": {$push: "$title"} } ] ) As a bonus, the initial match stage can also make use of existing indexes, so ensure you have an index on genres. Does this help? Cheers, Naomi Naomi from MongoDB

Slide 27

Slide 27 text

Support Ticket 1 Now in the logs* @naomi_pen 2018-11-04T15:14:51.491-0800 I COMMAND [conn2] command petflix.movies appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $unwind: "$genres" }, { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $group: { _id: "$genres", count: { $sum: 1.0 }, titles: { $push: "$title" } } } ], cursor: {}, lsid: { id: UUID("610579f8-c68e-44f7-881d-2ddc407e5594") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } keysExamined:28 docsExamined:18 cursorExhausted:1 numYields:0 nreturned:2 reslen:813 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 1ms *Logs by default only log queries than run longer than 100ms.

Slide 28

Slide 28 text

Support Ticket 2 @naomi_pen New Join Query is Slow Hi team, We're again experiencing performance issues. Can you look into this? db.movies.aggregate([ { $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1}} ]) Cheers, Robin Robin from Petflix 3 minute ago

Slide 29

Slide 29 text

Support Ticket 2 Investigate the query @naomi_pen ➜ db.movies.aggregate([ { $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1} } ])

Slide 30

Slide 30 text

Support Ticket 2 Investigate the logs* @naomi_pen 2018-11-04T16:35:00.235-0800 I COMMAND [conn4] command petflix.views appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Drama" ] } } }, { $lookup: { from: "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1.0, title: 1.0 } }, { $sort: { viewsData.time_started: -1.0 } } ], cursor: {}, lsid: { id: UUID("617b7851-4985-47e8-a647-991173e60667") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } cursorid:8208857423668239231 keysExamined:24997 docsExamined:24997 hasSortStage:1 numYields:200310 nreturned:51 reslen:15709 locks:{ Global: { acquireCount: { r: 250310 } }, Database: { acquireCount: { r: 250310 } }, Collection: { acquireCount: { r: 250309 } } } protocol:op_msg 11172ms *Queries only appear in logs if they run for 100+ ms. Queries not in the logs can still cause load!

Slide 31

Slide 31 text

Support Ticket 2 Investigate the query & indexes @naomi_pen ➜ db.movies.aggregate([ { $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1} } ]) ➜ db.movies.getIndexes() # result: _id, genres ➜ db.views.getIndexes() # result: _id

Slide 32

Slide 32 text

Support Ticket 2 Solution: Index views.movie_id, optionally add limit @naomi_pen ➜ db.movies.aggregate([ { $match: {"genres": { $in : ["Drama"] }} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $sort: { "viewsData.time_started": -1}}, { $limit: 50 } ])

Slide 33

Slide 33 text

Support Ticket 2 @naomi_pen Re: New Join Query is Slow Hi Robin, Of course! We looked into the query and have a few suggestions: 1. The foreignField specified in the $lookup should be indexed. Please add an index on views.movie_id. This should boost performance quite a bit. 2. If possible you could also add a $limit stage to improve the performance of the sort. Even with these changes, this query may still be slow. If this is a common query pattern, we would suggest considering changes to your data model. Does this help?  Best,  Naomi Naomi from MongoDB

Slide 34

Slide 34 text

Support Ticket 2 Now in the logs @naomi_pen 2018-11-05T03:43:22.109-0800 I COMMAND [conn4] command petflix.views appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Drama" ] } } }, { $lookup: { from: "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $sort: { viewsData.time_started: -1.0 } }, { $limit: 50.0 } ], cursor: {}, lsid: { id: UUID("617b7851-4985-47e8-a647-991173e60667") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } keysExamined:24997 docsExamined: 24997 hasSortStage:1 cursorExhausted:1 numYields:208 nreturned:51 reslen: 9100 locks:{ Global: { acquireCount: { r: 50206 } }, Database: { acquireCount: { r: 50206 } }, Collection: { acquireCount: { r: 50205 } } } protocol:op_msg 1245ms

Slide 35

Slide 35 text

Support Ticket 3 @naomi_pen Help with Metrics Hello old friend, Thanks so much for all your help so far. I think I got most of the aggregation framework performance pitfalls figured out by now… Unfortunately our Marketing team is pushing us hard to deliver metrics around how many movies are streamed per country at any given time. We tried to run the queries but I think if we run this query every few seconds it will bring down our deployment. Is there any magic solution? Thanks, Robin Robin from Petflix 5 minute ago

Slide 36

Slide 36 text

Support Ticket 3 @naomi_pen Re: Help with Metrics Hi Robin, I understand your concern. We agree that the additional load might be a bit much for your deployment. We would suggest you consider using a dedicated analytics node for these queries. That way the additional load would not impact your other nodes. Let us know if this would suit you. If you have any questions let me know.  Best,  Naomi Naomi from MongoDB

Slide 37

Slide 37 text

Support Ticket 4 @naomi_pen URGENT HI, Our analytics node was going great. So great in fact we decided to use the analytics node for a search feature. Now everything is burning. But we don't want to completely turn it off. There's only some queries that are really bad. Can we somehow limit a queries impact? HELP Thx, R Robin from Petflix 1 minute ago

Slide 38

Slide 38 text

Support Ticket 4 @naomi_pen Re: URGENT Hi Robin, You can consider setting maxTimeMS (http://bit.ly/maxTimeMS, http://bit.ly/ maxTimeMS2) on the queries.   This setting will protect the database server by ending operations that take longer than maxTimeMS. That said, be aware that:  • In some cases, partial results are returned by terminated operations. • maxTimeMS is not a guarantee. In high-queue situations, operations that are queued but not active may not be terminated until they become active again.   Please note that this is a safeguard, not a solution. To help alleviate the load you should add indexes for commonly used query patterns (to all nodes or only the analytics node).   Best, Naomi Naomi from MongoDB

Slide 39

Slide 39 text

Support Ticket 5 @naomi_pen How do I query this…? We started tracking views information by deviceIDs in the collection `deviceviews`. Now we're trying to find all views where `movie_completed` is false. However, I don't know how to do that since the children of `views` are uuids… {   "_id": ObjectId("5be090ca8dd84581f7120e5f"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-f02a-4eca-acaa-690498c8590e": { "id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, … } } Robin from Petflix 7 minute ago

Slide 40

Slide 40 text

Support Ticket 5 Investigate the data model @naomi_pen {   "_id": ObjectId("5be090ca8dd84581f7120e5f"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-f02a-4eca-acaa-690498c8590e": { "id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, … } }

Slide 41

Slide 41 text

Support Ticket 5 @naomi_pen Re: How do I query this…? You can solve this with $objectToArray (code attached) but this would result in extra calculation load. We would suggest restructuring your schema instead: { "_id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It’s a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, } } … Naomi from MongoDB

Slide 42

Slide 42 text

Support Ticket 5 $project: { views: { $objectToArray: "$views" }, deviceId: 1 } @naomi_pen {   "_id": ObjectId("5be090ca8d…"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-…": { "id": "e5a52daf-…", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, "f7d34ced-…": { … }, … } } {   "_id": ObjectId("5be090ca8d…"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views" : [ { "k" : "e5a52daf-…", "v" : { "id" : "e5a52daf-…", "movie_title" : "It's a wonderful dog life", "movie_completed" : false, "time_started" : "1541443718297" } }, { "k" : "f7d34ced-…", "v" : { … } } ] }

Slide 43

Slide 43 text

Support Ticket 5 Investigate the query @naomi_pen ➜ db.deviceviews.aggregate([ { $project: { views: { $objectToArray: "$views" }, deviceId: 1 } }, { $unwind: "$views" }, { $match: { "views.v.movie_completed": false } } ])

Slide 44

Slide 44 text

@naomi_pen A Reminder Automatic Optimizations The Customer Support Tickets 5 Tips (#4 will SHOCK you) 4 1 2 5 3

Slide 45

Slide 45 text

@naomi_pen 1 Indexable stages before unindexable stages Start with stages that can use indexes (e.g. $sort). Index the foreignField of a $lookup. Aggregated Aggregation Tips 3 2 Start with selective stages. If matching on fields in an array after an $unwind, $match before too. $match documents early 3 Place $limit close to $sort to allow for more efficient sorting. Use $limit 5 maxTimeMS can be used as a safeguard. Use a dedicated analytics node 4 Expensive aggregations that run often may indicate a need for a schema change. Consider changing the data model

Slide 46

Slide 46 text

@naomi_pen Thank you! Hi folks, We’re off for today. Thank you so much for all your work!! Have a wonderful day, Robin Robin from Petflix 1 minute ago

Slide 47

Slide 47 text

@naomi_pen By Naomi Pentrel Thank you! Find me later with any questions!    Slides are available at speakerdeck.com/npentrel @naomi_pen