Aggregation Framework Performance Pitfalls

By Naomi Pentrel Aggregation Framework: Performance Pitfalls 30min in the
life of a Support Engineer @naomi_pen ➜ db.collection.aggregate( [ { <stage> }, ... ] ) {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…}

@naomi_pen A Reminder Automatic Optimizations The Customer Support Tickets 5
Tips (#4 will SHOCK you) 3 4 5 1 2

@naomi_pen MongoDB's Aggregation Framework allows you to manipulate and transform
data in ways that are not possible with normal queries.

Aggregation Framework: A Reminder Aggregation Pipeline: In Theory @naomi_pen #
Pipeline Stages (think unix pipes): ➜ db.collection.aggregate( [ { <stage> }, ... ] ) # Operator expressions: ➜ { <operator>: [ <argument1>, <argument2> ... ] } ➜ { <operator>: <argument> } {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…} {…}

Aggregation Framework: A Reminder Example: Grouping Orders @naomi_pen ➜ db.orders.aggregate(
[ { $match: { status: "A"} }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }, ] ) pipeline specification Operator

Aggregation Framework: A Reminder Example: Grouping Orders @naomi_pen Stages

Aggregation Framework: A Reminder Pipeline Stages @naomi_pen $addFields  $bucket  $bucketAuto 
$collStats  $count  $currentOp  $facet  $geoNear  $graphLookup  $group  $indexStats  $limit  $listLocalSessions  $listSessions  $lookup  $match  $out  $project  $redact  $replaceRoot  $sample  $skip  $sort  $sortByCount   $unwind

Aggregation Framework: A Reminder Pipeline Operators Arithmetic Expression Operators $abs
$add $ceil $divide $exp $floor $ln $log $log10 $mod $multiply $pow $sqrt $subtract $trunc   Array Expression Operators $arrayElemAt $arrayToObject $concatArrays $filter $in $indexOfArray $isArray $map $objectToArray $range $reduce $reverseArray $size $slice $zip Boolean Expression Operators $and $not $or Comparison Expression Operators $cmp $eq $gt $gte $lt $lte $ne   Conditional Expression Operators $cond $ifNull $switch Date Expression Operators $dateFromParts $dateFromString $dateToParts $dateToString $dayOfMonth $dayOfWeek $dayOfYear $hour $isoDayOfWeek $isoWeek $isoWeekYear $millisecond $minute $month $second $toDate $week $year $add $subtract Literal Expression Operator $literal Object Expression Operators $mergeObjects $objectToArray Set Expression Operators $allElementsTru e $anyElementTru e $setDifference $setEquals $setIntersection $setIsSubset $setUnion   String Expression Operators $concat $dateFromString $dateToString $indexOfBytes $indexOfCP $ltrim $rtrim $split $strLenBytes $strLenCP $strcasecmp $substrBytes $substrCP $toLower $toString $trim $toUpper Text Expression Operator $meta Type Expression Operators $convert $toBool $toDate $toDecimal $toDouble $toInt $toLong $toObjectId $toString $type Accumulators ($group) $addToSet $avg $first $last $max $mergeObjects $min $push $stdDevPop $stdDevSamp $sum Accumulators ($project) $avg $max $min $stdDevPop $stdDevSamp $sum Variable Expression Operators $let @naomi_pen A bit much? Here’s the reference: bit.ly/MongoDBAR

Aggregation Framework: A Reminder Explain…? @naomi_pen ➜ db.sales.aggregate( [ {
$group: { _id: "$date", revenue: {$sum: “$amount"} } } ], {explain: true} )

Aggregation Framework: A Reminder Explain…? @naomi_pen ➜ db.sales.explain("executionStats").aggregate( [ {
$group: { _id: “$date", revenue: {$sum: "$amount"} } } ] )

@naomi_pen 1 Aggregations require more RAM and CPU Truths about
Aggregations 1 2 3 The amount of data to process is large The operation will be run frequently The operation is intended to service a real-time app feature 2 Each stage has a manipulated copy of the data 3 May use lots of memory; allocated outside WT 4 Aggregations need consideration, especially if:

Automatic Optimizations The Optimizer will transform your queries @naomi_pen •
$project or $addFields + $match Sequence Optimization • $sort + $match Sequence Optimization • $redact + $match Sequence Optimization • $project + $skip Sequence Optimization • $sort + $limit Coalescence • $limit + $limit Coalescence • $skip + $skip Coalescence • $match + $match Coalescence • $lookup + $unwind Coalescence • $sort + $skip + $limit Sequence

@naomi_pen A Reminder Automatic Optimizations The Customer Learnings 5 Tips
(#4 will SHOCK you) 4 5 1 3 2

The Customer Meet Robin @naomi_pen Robin is a Software Engineer
at Petflix They love MongoDB for its ease of use Robin made the decision to move Petflix to MongoDB

The Customer @naomi_pen

@naomi_pen { _id: "tt0944947", title: "Game of Throws", genres: [
"Drama", "Adventure", "Fantasy" ], director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: { user: "naomi", time_started: "1541364977160", }, movie_id: "tt0944947", location: "Germany" } { _id: "naomi", email: "naomi.pentrel@mongodb.com", … } Videos Views Users The Customer: Data model

Support Ticket 1 @naomi_pen CPU Spikes with Query Hi there,
We have an issue with one of our queries. We count movies for two genres, and list titles for each genre in our new UI. But it’s slow and causes spikes: db.movies.aggregate([ { $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: {"_id": "$genres", "count": {$sum: 1}, "titles": {$push: "$title"}}} ]) Thanks, Robin  Robin from Petflix 1 minute ago

Support Ticket 1 Investigate the query @naomi_pen ➜ db.movies.aggregate([ {
$unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])

Support Ticket 1 Investigate the logs* @naomi_pen 2018-11-04T15:11:59.670-0800 I COMMAND
[conn2] command petflix.movies appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $unwind: "$genres" }, { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $group: { _id: "$genres", count: { $sum: 1.0 }, titles: { $push: "$title" } } } ], cursor: {}, lsid: { id: UUID("610579f8- c68e-44f7-881d-2ddc407e5594") }, $db: "petflix" } planSummary: COLLSCAN keysExamined:0 docsExamined:49997 cursorExhausted:1 numYields:392 nreturned:2 reslen:813 locks:{ Global: { acquireCount: { r: 397 } }, Database: { acquireCount: { r: 397 } }, Collection: { acquireCount: { r: 397 } } } protocol:op_msg 121ms *Queries only appear in logs if they run for 100+ ms. Queries not in the logs can still cause load!

$unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])

Support Ticket 1 { $unwind: "$genres" } @naomi_pen { _id:
"tt0944947", title: "Game of Throws", genres: [ "Drama", "Adventure", "Fantasy" ], director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Drama", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Adventure", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { _id: "tt0944947", title: "Game of Throws", genres: "Fantasy", director: "Jeremy Doggo", rating: 8.6, description: "Nine noble dog families fight over throws, (…)", actors: [ "Emilia Samoyed", "Peter Terrier", … ], year: "2011" … } { $match: { "genres": { $in: ["Fantasy", "Adventure"] } }

Support Ticket 1 { $unwind: "$genres" } @naomi_pen { _id:
"tt1032765", title: "Catmanji", genres: [ "Comedy", "Family", ], director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { _id: "tt1032765", title: "Catmanji", genres: "Comedy", director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { _id: "tt1032765", title: "Catmanji", genres: "Family", director: "Kit Cat", rating: 7.9, description: "Four cats play a game that gets way too real (…)", actors: [ "Cat McCatface", "Kitty Fish"… ], year: "2018" … } { $match: { "genres": { $in: [ "Fantasy", “Adventure” ] } } }

Support Ticket 1 Solution: Duplicate the $match stage & ensure
index @naomi_pen ➜ db.movies.aggregate([ { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } }, { $unwind: "$genres" }, { $match: { "genres": { $in: ["Fantasy", "Adventure"] } } } { $group: { "_id": "$genres", "count":{$sum: 1}, "titles": {$push: "$title"} } } ])

Support Ticket 1 @naomi_pen Re: CPU Spikes with Query We’re
happy to help. To improve your query’s performance you need to limit input as much as possible. In your query the match after the unwind discards all documents that aren't in those genres. If you add an additional match before the unwind, then documents that don’t have those genres are never unwound, because they never enter the pipeline. db.movies.aggregate([ { $match: {"genres": {$in: ["Fantasy", "Family"]}}} { $unwind: "genres"}, { $match: {"genres": {$in: ["Fantasy", "Family"]}}} { "$group": {"_id": "$genres", "count": {"$sum": 1}, "titles": {$push: "$title"} } ] ) As a bonus, the initial match stage can also make use of existing indexes, so ensure you have an index on genres. Does this help? Cheers, Naomi Naomi from MongoDB

Support Ticket 1 Now in the logs* @naomi_pen 2018-11-04T15:14:51.491-0800 I
COMMAND [conn2] command petflix.movies appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $unwind: "$genres" }, { $match: { genres: { $in: [ "Fantasy", "Adventure" ] } } }, { $group: { _id: "$genres", count: { $sum: 1.0 }, titles: { $push: "$title" } } } ], cursor: {}, lsid: { id: UUID("610579f8-c68e-44f7-881d-2ddc407e5594") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } keysExamined:28 docsExamined:18 cursorExhausted:1 numYields:0 nreturned:2 reslen:813 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 1ms *Logs by default only log queries than run longer than 100ms.

Support Ticket 2 @naomi_pen New Join Query is Slow Hi
team, We're again experiencing performance issues. Can you look into this? db.movies.aggregate([ { $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1}} ]) Cheers, Robin Robin from Petflix 3 minute ago

$match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1} } ])

Support Ticket 2 Investigate the logs* @naomi_pen 2018-11-04T16:35:00.235-0800 I COMMAND
[conn4] command petflix.views appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Drama" ] } } }, { $lookup: { from: "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1.0, title: 1.0 } }, { $sort: { viewsData.time_started: -1.0 } } ], cursor: {}, lsid: { id: UUID("617b7851-4985-47e8-a647-991173e60667") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } cursorid:8208857423668239231 keysExamined:24997 docsExamined:24997 hasSortStage:1 numYields:200310 nreturned:51 reslen:15709 locks:{ Global: { acquireCount: { r: 250310 } }, Database: { acquireCount: { r: 250310 } }, Collection: { acquireCount: { r: 250309 } } } protocol:op_msg 11172ms *Queries only appear in logs if they run for 100+ ms. Queries not in the logs can still cause load!

Support Ticket 2 Investigate the query & indexes @naomi_pen ➜
db.movies.aggregate([ { $match: {"genres": "Drama"} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $project: { viewsData: 1, title: 1} }, { $sort: { "viewsData.time_started": -1} } ]) ➜ db.movies.getIndexes() # result: _id, genres ➜ db.views.getIndexes() # result: _id

Support Ticket 2 Solution: Index views.movie_id, optionally add limit @naomi_pen
➜ db.movies.aggregate([ { $match: {"genres": { $in : ["Drama"] }} }, { $lookup: { from : "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $sort: { "viewsData.time_started": -1}}, { $limit: 50 } ])

Support Ticket 2 @naomi_pen Re: New Join Query is Slow
Hi Robin, Of course! We looked into the query and have a few suggestions: 1. The foreignField specified in the $lookup should be indexed. Please add an index on views.movie_id. This should boost performance quite a bit. 2. If possible you could also add a $limit stage to improve the performance of the sort. Even with these changes, this query may still be slow. If this is a common query pattern, we would suggest considering changes to your data model. Does this help?  Best,  Naomi Naomi from MongoDB

Support Ticket 2 Now in the logs @naomi_pen 2018-11-05T03:43:22.109-0800 I
COMMAND [conn4] command petflix.views appName: "MongoDB Shell" command: aggregate { aggregate: "movies", pipeline: [ { $match: { genres: { $in: [ "Drama" ] } } }, { $lookup: { from: "views", localField: "_id", foreignField: "movie_id", as: "viewsData" } }, { $sort: { viewsData.time_started: -1.0 } }, { $limit: 50.0 } ], cursor: {}, lsid: { id: UUID("617b7851-4985-47e8-a647-991173e60667") }, $db: "petflix" } planSummary: IXSCAN { genres: 1 } keysExamined:24997 docsExamined: 24997 hasSortStage:1 cursorExhausted:1 numYields:208 nreturned:51 reslen: 9100 locks:{ Global: { acquireCount: { r: 50206 } }, Database: { acquireCount: { r: 50206 } }, Collection: { acquireCount: { r: 50205 } } } protocol:op_msg 1245ms

Support Ticket 3 @naomi_pen Help with Metrics Hello old friend,
Thanks so much for all your help so far. I think I got most of the aggregation framework performance pitfalls figured out by now… Unfortunately our Marketing team is pushing us hard to deliver metrics around how many movies are streamed per country at any given time. We tried to run the queries but I think if we run this query every few seconds it will bring down our deployment. Is there any magic solution? Thanks, Robin Robin from Petflix 5 minute ago

Support Ticket 3 @naomi_pen Re: Help with Metrics Hi Robin,
I understand your concern. We agree that the additional load might be a bit much for your deployment. We would suggest you consider using a dedicated analytics node for these queries. That way the additional load would not impact your other nodes. Let us know if this would suit you. If you have any questions let me know.  Best,  Naomi Naomi from MongoDB

Support Ticket 4 @naomi_pen URGENT HI, Our analytics node was
going great. So great in fact we decided to use the analytics node for a search feature. Now everything is burning. But we don't want to completely turn it off. There's only some queries that are really bad. Can we somehow limit a queries impact? HELP Thx, R Robin from Petflix 1 minute ago

Support Ticket 4 @naomi_pen Re: URGENT Hi Robin, You can
consider setting maxTimeMS (http://bit.ly/maxTimeMS, http://bit.ly/ maxTimeMS2) on the queries.   This setting will protect the database server by ending operations that take longer than maxTimeMS. That said, be aware that:  • In some cases, partial results are returned by terminated operations. • maxTimeMS is not a guarantee. In high-queue situations, operations that are queued but not active may not be terminated until they become active again.   Please note that this is a safeguard, not a solution. To help alleviate the load you should add indexes for commonly used query patterns (to all nodes or only the analytics node).   Best, Naomi Naomi from MongoDB

Support Ticket 5 @naomi_pen How do I query this…? We
started tracking views information by deviceIDs in the collection `deviceviews`. Now we're trying to find all views where `movie_completed` is false. However, I don't know how to do that since the children of `views` are uuids… {   "_id": ObjectId("5be090ca8dd84581f7120e5f"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-f02a-4eca-acaa-690498c8590e": { "id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, … } } Robin from Petflix 7 minute ago

Support Ticket 5 Investigate the data model @naomi_pen {  
"_id": ObjectId("5be090ca8dd84581f7120e5f"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-f02a-4eca-acaa-690498c8590e": { "id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, … } }

Support Ticket 5 @naomi_pen Re: How do I query this…?
You can solve this with $objectToArray (code attached) but this would result in extra calculation load. We would suggest restructuring your schema instead: { "_id": "e5a52daf-f02a-4eca-acaa-690498c8590e", "movie_title": "It’s a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, } } … Naomi from MongoDB

Support Ticket 5 $project: { views: { $objectToArray: "$views" },
deviceId: 1 } @naomi_pen {   "_id": ObjectId("5be090ca8d…"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views": { "e5a52daf-…": { "id": "e5a52daf-…", "movie_title": "It's a wonderful dog life", "movie_completed": false, "time_started": "1541443718297" }, "f7d34ced-…": { … }, … } } {   "_id": ObjectId("5be090ca8d…"), "deviceId": { user: "Naomi", device: "iPhone", first_seen: "1541364977160" }, "views" : [ { "k" : "e5a52daf-…", "v" : { "id" : "e5a52daf-…", "movie_title" : "It's a wonderful dog life", "movie_completed" : false, "time_started" : "1541443718297" } }, { "k" : "f7d34ced-…", "v" : { … } } ] }

Support Ticket 5 Investigate the query @naomi_pen ➜ db.deviceviews.aggregate([ {
$project: { views: { $objectToArray: "$views" }, deviceId: 1 } }, { $unwind: "$views" }, { $match: { "views.v.movie_completed": false } } ])

@naomi_pen 1 Indexable stages before unindexable stages Start with stages
that can use indexes (e.g. $sort). Index the foreignField of a $lookup. Aggregated Aggregation Tips 3 2 Start with selective stages. If matching on fields in an array after an $unwind, $match before too. $match documents early 3 Place $limit close to $sort to allow for more efficient sorting. Use $limit 5 maxTimeMS can be used as a safeguard. Use a dedicated analytics node 4 Expensive aggregations that run often may indicate a need for a schema change. Consider changing the data model

@naomi_pen Thank you! Hi folks, We’re off for today. Thank
you so much for all your work!! Have a wonderful day, Robin Robin from Petflix 1 minute ago

@naomi_pen By Naomi Pentrel Thank you! Find me later with
any questions!    Slides are available at speakerdeck.com/npentrel @naomi_pen

Aggregation Framework Performance Pitfalls

Aggregation Framework Performance Pitfalls

More Decks by Naomi Pentrel

Other Decks in Technology

Featured

Transcript