MongoDB: What's New in Aggregation

WHAT'S NEW IN AGGREGATION Jeremy Mikola @jmikola

State of Aggregation Pipeline Expressions Usage and Limitations Sharding Looking
Ahead AGENDA

STATE OF AGGREGATION We're storing our data in MongoDB. We
need to do ad-hoc reporting, grouping, common aggregations, etc. What are we using for this?

DATA WAREHOUSING SQL for reporting and analytics Infrastructure complications Additional
maintenance Data duplication ETL processes Real time?

MAPREDUCE Extremely versatile, powerful Intended for complex data analysis Overkill
for simple aggregation tasks Averages Summation Grouping

MAPREDUCE IN MONGODB Implemented with JavaScript Single-threaded Difficult to debug
Concurrency Appearance of parallelism Write locks (without i n l i n e or j s M o d e )

And now for something completely different…

AGGREGATION FRAMEWORK Declared with BSON, executes in C++ Flexible, functional,
and simple Operation pipeline Computational expressions Plays nice with sharding

PIPELINE

PIPELINE Process a stream of documents Original input is a
collection Final output is a result document Series of operators Filter or transform data Input/output chain p s a x | g r e p m o n g o d | h e a d ‐ n 1

PIPELINE OPERATORS $ m a t c h $ p
r o j e c t $ g r o u p $ u n w i n d $ s o r t $ l i m i t $ s k i p

OUR EXAMPLE DATA Library Books { _ i d :
3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } }

$MATCH Filter documents Uses existing query syntax No geospatial operations
or $ w h e r e $ w i t h i n support coming in 2.4

► ▼ $MATCH Matching field values { t i t
l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ m a t c h : { l a n g u a g e : " R u s s i a n " } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " }

► ▼ $MATCH Matching with query operators { t i
t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ m a t c h : { p a g e s : { $ g t : 1 0 0 0 } } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " }

$PROJECT Reshape documents Include, exclude or rename fields Inject computed
fields Manipulate sub-document fields

► ▼ $PROJECT Including and excluding fields { _ i
d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } { $ p r o j e c t : { _ i d : 0 , t i t l e : 1 , l a n g u a g e : 1 } } { t i t l e : " T h e G r e a t G a t s b y " , l a n g u a g e : " E n g l i s h " }

► ▼ $PROJECT Renaming and computing fields { _ i
d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } { $ p r o j e c t : { a v g P a g e s P e r C h a p t e r : { $ d i v i d e : [ " $ p a g e s " , " $ c h a p t e r s " ] } , l a n g : " $ l a n g u a g e " } } { _ i d : 3 7 5 , a v g P a g e s P e r C h a p t e r : 2 4 . 2 2 2 2 2 2 2 2 2 2 2 2 2 2 , l a n g : " E n g l i s h " }

► ▼ $PROJECT Creating and extracting sub-document fields { _
i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } } { $ p r o j e c t : { t i t l e : 1 , s t a t s : { p a g e s : " $ p a g e s " , c h a p t e r s : " $ c h a p t e r s " , } , p u b _ c i t y : " $ p u b l i s h e r . c i t y " } } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s t a t s : { p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } , p u b _ c i t y : " L o n d o n " }

$GROUP Group documents by an ID Field reference, object, constant
Other output fields are computed $ m a x , $ m i n , $ a v g , $ s u m $ a d d T o S e t , $ p u s h $ f i r s t , $ l a s t Processes all data in memory

► ▼ $GROUP Calculating an average { t i t
l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ g r o u p : { _ i d : " $ l a n g u a g e " , a v g P a g e s : { $ a v g : " $ p a g e s " } } } { _ i d : " R u s s i a n " , a v g P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , a v g P a g e s : 6 5 3 }

► ▼ $GROUP Summating fields and counting { t i
t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ g r o u p : { _ i d : " $ l a n g u a g e " , n u m T i t l e s : { $ s u m : 1 } , s u m P a g e s : { $ s u m : " $ p a g e s " } } } { _ i d : " R u s s i a n " , n u m T i t l e s : 1 , s u m P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , n u m T i t l e s : 2 , s u m P a g e s : 1 3 0 6 }

► ▼ $GROUP Collecting distinct values { t i t
l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ g r o u p : { _ i d : " $ l a n g u a g e " , t i t l e s : { $ a d d T o S e t : " $ t i t l e " } } } { _ i d : " R u s s i a n " , t i t l e s : [ " W a r a n d P e a c e " ] } { _ i d : " E n g l i s h " , t i t l e s : [ " A t l a s S h r u g g e d " , " T h e G r e a t G a t s b y " ] }

$UNWIND Operate on an array field Yield new documents for
each array element Array replaced by element value Missing/empty fields → no output Non-array fields → error Pipe to $ g r o u p to aggregate array values

► ▼ $UNWIND Yielding multiple documents from one { _
i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] } { $ u n w i n d : " $ s u b j e c t s " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " L o n g I s l a n d " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " N e w Y o r k " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " 1 9 2 0 s " }

$SORT, $LIMIT, $SKIP Sort documents by one or more fields
Same order syntax as cursors Waits for earlier pipeline operator to return In-memory unless early and indexed Limit and skip follow cursor behavior

► ▼ $SORT Sort all documents in the pipeline {
t i t l e : " T h e G r e a t G a t s b y " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " A n i m a l F a r m " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { $ s o r t : { t i t l e : 1 } } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " }

► ▼ $LIMIT Limit documents through the pipeline { t
i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " } { $ l i m i t : 5 } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }

► ▼ $SKIP Skip over documents in the pipeline {
t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { $ s k i p : 2 } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }

EXPRESSIONS Expressions Usage and Limitations Sharding Looking Ahead State of
Aggregation Pipeline

EXPRESSIONS Return computed values Used with $ p r o
j e c t and $ g r o u p Reference fields using $ (e.g. " $ x " ) Expressions may be nested

EXPRESSIONS Logic $ a n d , $ o r
, $ n o t … Comparison $ c m p , $ e q , $ g t … Arithmetic $ a d d , $ d i v i d e … String $ s t r c a s e c m p , $ s u b s t r … Date $ y e a r , $ d a y O f M o n t h … Conditional $ c o n d , $ i f N u l l …

USAGE Usage and Limitations Sharding Looking Ahead State of Aggregation
Pipeline Expressions

USAGE a g g r e g a t e
database command c o l l e c t i o n . a g g r e g a t e ( ) method Mongo shell Most drivers

COLLECTION METHOD ▼ d b . b o o k
s . a g g r e g a t e ( [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] ) ; { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }

DATABASE COMMAND ▼ d b . r u n C
o m m a n d ( { a g g r e g a t e : " b o o k s " , p i p e l i n e : [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] } ) ; { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }

LIMITATIONS Result limited by BSON document size Final command result
Intermediate shard results Pipeline operator memory limits

SHARDING Sharding Looking Ahead State of Aggregation Pipeline Expressions Usage
and Limitations

SHARDING Split the pipeline at first $ g r o
u p or $ s o r t Shards execute pipeline up to that point mongos merges results and continues Early $ m a t c h may excuse shards CPU and memory implications for mongos

SHARDING [ { $ m a t c h :
{ / * f i l t e r b y s h a r d k e y * / } } , { $ g r o u p : { / * g r o u p b y s o m e f i e l d * / } } , { $ s o r t : { / * s o r t b y s o m e f i e l d * / } } , { $ p r o j e c t : { / * r e s h a p e r e s u l t * / } } ]

SHARDING shard1 $ m a t c h $ g
r o u p 1 ↘ shard2 $ m a t c h $ g r o u p 1 ↓ shard3 mongos $ g r o u p 2 $ s o r t $ p r o j e c t ↓ Result

LOOKING AHEAD Looking Ahead State of Aggregation Pipeline Expressions Usage
and Limitations Sharding

FRAMEWORK USE CASES Basic aggregation queries Ad-hoc reporting Real-time analytics
Visualizing time series data

EXTENDING THE FRAMEWORK Adding new pipeline operators, expressions $ w
i t h i n expression for $ m a t c h $ g e o N e a r pipeline operator $ o u t operator for output control

FUTURE ENHANCEMENTS Improved handling of n u l l values
Optimizing $ m a t c h position Pipeline explain facility Support BSON binary, code, etc. Memory usage improvements Grouping input sorted by _ i d Sorting with limited output (top k)

ENABLING DEVELOPERS Doing more within MongoDB, faster Refactoring MapReduce and
groupings Replace pages of JavaScript Longer aggregation pipelines Quick aggregations from the shell

THANKS! Server and drivers Aggregation documentation Time-series visualization demo app
MongoDB shell enhancements for hackers QUESTIONS?

PHOTO CREDITS http://dilbert.com/strips/comic/2012-09-05 http://www.flickr.com/photos/toolstop/4324416999 http://www.ristart.ee/web2/files/Product/large/13443203307.jpg http://www.flickr.com/photos/vascorola/3164882131 http://www.flickr.com/photos/capcase/4970062156 http://img.timeinc.net/time/photoessays/2009/monty_python/monty_python_02.jpg

MongoDB: What's New in Aggregation

MongoDB: What's New in Aggregation

More Decks by Jeremy Mikola

Other Decks in Programming

Featured

Transcript