Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB: What's New in Aggregation

MongoDB: What's New in Aggregation

Presented January 31, 2013 for a 10gen.com webinar.

Presented October 24, 2012 at MongoDB Boston.

Presented September 6, 2012 at An Evening with MongoDB: Orlando.

Presented March 3, 2012 at MidwestPHP: https://joind.in/8223

Reveal.js presentation published at: http://jmikola.github.com/slides/mongodb_aggregation_framework/

Jeremy Mikola

March 03, 2013
Tweet

More Decks by Jeremy Mikola

Other Decks in Programming

Transcript

  1. STATE OF AGGREGATION We're storing our data in MongoDB. We

    need to do ad-hoc reporting, grouping, common aggregations, etc. What are we using for this?
  2. MAPREDUCE Extremely versatile, powerful Intended for complex data analysis Overkill

    for simple aggregation tasks Averages Summation Grouping
  3. MAPREDUCE IN MONGODB Implemented with JavaScript Single-threaded Difficult to debug

    Concurrency Appearance of parallelism Write locks (without i n l i n e or j s M o d e )
  4. AGGREGATION FRAMEWORK Declared with BSON, executes in C++ Flexible, functional,

    and simple Operation pipeline Computational expressions Plays nice with sharding
  5. PIPELINE Process a stream of documents Original input is a

    collection Final output is a result document Series of operators Filter or transform data Input/output chain p s a x | g r e p m o n g o d | h e a d ‐ n 1
  6. PIPELINE OPERATORS $ m a t c h $ p

    r o j e c t $ g r o u p $ u n w i n d $ s o r t $ l i m i t $ s k i p
  7. OUR EXAMPLE DATA Library Books { _ i d :

    3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } }
  8. $MATCH Filter documents Uses existing query syntax No geospatial operations

    or $ w h e r e $ w i t h i n support coming in 2.4
  9. ► ▼ $MATCH Matching field values { t i t

    l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ m a t c h : { l a n g u a g e : " R u s s i a n " } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " }
  10. ► ▼ $MATCH Matching with query operators { t i

    t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ m a t c h : { p a g e s : { $ g t : 1 0 0 0 } } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " }
  11. ► ▼ $PROJECT Including and excluding fields { _ i

    d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } { $ p r o j e c t : { _ i d : 0 , t i t l e : 1 , l a n g u a g e : 1 } } { t i t l e : " T h e G r e a t G a t s b y " , l a n g u a g e : " E n g l i s h " }
  12. ► ▼ $PROJECT Renaming and computing fields { _ i

    d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } { $ p r o j e c t : { a v g P a g e s P e r C h a p t e r : { $ d i v i d e : [ " $ p a g e s " , " $ c h a p t e r s " ] } , l a n g : " $ l a n g u a g e " } } { _ i d : 3 7 5 , a v g P a g e s P e r C h a p t e r : 2 4 . 2 2 2 2 2 2 2 2 2 2 2 2 2 2 , l a n g : " E n g l i s h " }
  13. ► ▼ $PROJECT Creating and extracting sub-document fields { _

    i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } } { $ p r o j e c t : { t i t l e : 1 , s t a t s : { p a g e s : " $ p a g e s " , c h a p t e r s : " $ c h a p t e r s " , } , p u b _ c i t y : " $ p u b l i s h e r . c i t y " } } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s t a t s : { p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } , p u b _ c i t y : " L o n d o n " }
  14. $GROUP Group documents by an ID Field reference, object, constant

    Other output fields are computed $ m a x , $ m i n , $ a v g , $ s u m $ a d d T o S e t , $ p u s h $ f i r s t , $ l a s t Processes all data in memory
  15. ► ▼ $GROUP Calculating an average { t i t

    l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ g r o u p : { _ i d : " $ l a n g u a g e " , a v g P a g e s : { $ a v g : " $ p a g e s " } } } { _ i d : " R u s s i a n " , a v g P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , a v g P a g e s : 6 5 3 }
  16. ► ▼ $GROUP Summating fields and counting { t i

    t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ g r o u p : { _ i d : " $ l a n g u a g e " , n u m T i t l e s : { $ s u m : 1 } , s u m P a g e s : { $ s u m : " $ p a g e s " } } } { _ i d : " R u s s i a n " , n u m T i t l e s : 1 , s u m P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , n u m T i t l e s : 2 , s u m P a g e s : 1 3 0 6 }
  17. ► ▼ $GROUP Collecting distinct values { t i t

    l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } { $ g r o u p : { _ i d : " $ l a n g u a g e " , t i t l e s : { $ a d d T o S e t : " $ t i t l e " } } } { _ i d : " R u s s i a n " , t i t l e s : [ " W a r a n d P e a c e " ] } { _ i d : " E n g l i s h " , t i t l e s : [ " A t l a s S h r u g g e d " , " T h e G r e a t G a t s b y " ] }
  18. $UNWIND Operate on an array field Yield new documents for

    each array element Array replaced by element value Missing/empty fields → no output Non-array fields → error Pipe to $ g r o u p to aggregate array values
  19. ► ▼ $UNWIND Yielding multiple documents from one { _

    i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] } { $ u n w i n d : " $ s u b j e c t s " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " L o n g I s l a n d " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " N e w Y o r k " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " 1 9 2 0 s " }
  20. $SORT, $LIMIT, $SKIP Sort documents by one or more fields

    Same order syntax as cursors Waits for earlier pipeline operator to return In-memory unless early and indexed Limit and skip follow cursor behavior
  21. ► ▼ $SORT Sort all documents in the pipeline {

    t i t l e : " T h e G r e a t G a t s b y " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " A n i m a l F a r m " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { $ s o r t : { t i t l e : 1 } } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " }
  22. ► ▼ $LIMIT Limit documents through the pipeline { t

    i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " } { $ l i m i t : 5 } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }
  23. ► ▼ $SKIP Skip over documents in the pipeline {

    t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { $ s k i p : 2 } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }
  24. EXPRESSIONS Return computed values Used with $ p r o

    j e c t and $ g r o u p Reference fields using $ (e.g. " $ x " ) Expressions may be nested
  25. EXPRESSIONS Logic $ a n d , $ o r

    , $ n o t … Comparison $ c m p , $ e q , $ g t … Arithmetic $ a d d , $ d i v i d e … String $ s t r c a s e c m p , $ s u b s t r … Date $ y e a r , $ d a y O f M o n t h … Conditional $ c o n d , $ i f N u l l …
  26. USAGE a g g r e g a t e

    database command c o l l e c t i o n . a g g r e g a t e ( ) method Mongo shell Most drivers
  27. COLLECTION METHOD ▼ d b . b o o k

    s . a g g r e g a t e ( [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] ) ; { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }
  28. DATABASE COMMAND ▼ d b . r u n C

    o m m a n d ( { a g g r e g a t e : " b o o k s " , p i p e l i n e : [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] } ) ; { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }
  29. LIMITATIONS Result limited by BSON document size Final command result

    Intermediate shard results Pipeline operator memory limits
  30. SHARDING Split the pipeline at first $ g r o

    u p or $ s o r t Shards execute pipeline up to that point mongos merges results and continues Early $ m a t c h may excuse shards CPU and memory implications for mongos
  31. SHARDING [ { $ m a t c h :

    { / * f i l t e r b y s h a r d k e y * / } } , { $ g r o u p : { / * g r o u p b y s o m e f i e l d * / } } , { $ s o r t : { / * s o r t b y s o m e f i e l d * / } } , { $ p r o j e c t : { / * r e s h a p e r e s u l t * / } } ]
  32. SHARDING shard1 $ m a t c h $ g

    r o u p 1 ↘ shard2 $ m a t c h $ g r o u p 1 ↓ shard3 mongos $ g r o u p 2 $ s o r t $ p r o j e c t ↓ Result
  33. EXTENDING THE FRAMEWORK Adding new pipeline operators, expressions $ w

    i t h i n expression for $ m a t c h $ g e o N e a r pipeline operator $ o u t operator for output control
  34. FUTURE ENHANCEMENTS Improved handling of n u l l values

    Optimizing $ m a t c h position Pipeline explain facility Support BSON binary, code, etc. Memory usage improvements Grouping input sorted by _ i d Sorting with limited output (top k)
  35. ENABLING DEVELOPERS Doing more within MongoDB, faster Refactoring MapReduce and

    groupings Replace pages of JavaScript Longer aggregation pipelines Quick aggregations from the shell