Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Data Aggregation with MongoDB

Real-time Data Aggregation with MongoDB

Presented October 5, 2013 at PHP South Africa: https://joind.in/talk/view/9465

Presented May 24, 2013 at Symfony Live: Portland: https://joind.in/talk/view/8663

Reveal.js presentation published at: http://jmikola.github.io/slides/mongodb_real_time_aggregation/

Jeremy Mikola

October 05, 2013
Tweet

More Decks by Jeremy Mikola

Other Decks in Programming

Transcript

  1. SETTING THE STAGE We have data stored in MongoDB. We

    need to do ad-hoc reporting, grouping, common aggregations, etc. What can we use for this?
  2. SCHEMA-BASED Documents hold pre-aggregated data Processing at time of insert/update

    Optimized for real-time queries In-place updates avoid fragmentation by MongoDB for Analytics John Nunemaker
  3. Count d b . r e c o r d

    s . i n s e r t ( { a : 1 , b : 0 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , b : 2 } ) ; d b . r e c o r d s . c o u n t ( ) ; / / R e t u r n s 4 d b . r e c o r d s . c o u n t ( { a : 1 } ) ; / / R e t u r n s 3
  4. Distinct d b . o r d e r s

    . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 5 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 2 5 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " B 2 1 2 " , a m o u n t : 2 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 3 0 0 , s t a t u s : " D " } ) ; d b . r e c o r d s . d i s t i n c t ( ) ; / / R e t u r n s [ " A 1 2 3 " , " B 2 1 2 " ]
  5. Group d b . r e c o r d

    s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 2 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 3 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 5 } ) ; d b . r e c o r d s . i n s e r t ( { a : 4 , c o u n t : 4 } ) ; d b . r e c o r d s . g r o u p ( { k e y : { a : 1 } , c o n d : { a : { $ l t : 3 } } , r e d u c e : f u n c t i o n ( c u r , r e s u l t ) { r e s u l t . c o u n t + = c u r . c o u n t } , i n i t i a l : { c o u n t : 0 } } ) ; / / R e t u r n s [ { a : 1 , c o u n t : 1 5 } , { a : 2 , c o u n t : 4 } ]
  6. MAPREDUCE Extremely versatile, powerful Intended for complex data analysis Overkill

    for simple aggregation tasks Averages Summation Grouping
  7. MapReduce in MongoDB Implemented with JavaScript Convenient, but difficult to

    debug for batches Better concurrency with V8 Incremental map-reduce
  8. MapReduce in Action: MongoQP MongoDB includes a Use map-reduce for

    query log analysis Parse BSON query objects with JavaScript Group and aggregate similar queries database profiler
  9. Profile Data { " o p " : " q

    u e r y " , " n s " : " d b . c o l l e c t i o n " , " q u e r y " : { … } , " n t o r e t u r n " : 0 , " n t o s k i p " : 0 , " n s c a n n e d " : 1 1 4 2 6 , " l o c k S t a t s " : { … } , " n r e t u r n e d " : 0 , " r e s p o n s e L e n g t h " : 2 0 , " m i l l i s " : 1 2 , " t s " : I S O D a t e ( " 2 0 1 3 - 0 5 - 2 3 T 2 1 : 2 4 : 3 9 . 3 2 7 Z " ) , }
  10. AGGREGATION FRAMEWORK Declared with BSON, like queries Executes in C++,

    not JavaScript Flexible, functional, and simple Operation pipeline Computational expressions
  11. PIPELINE Process a stream of documents Original input is a

    collection Final output is a result document Series of operators Filter or transform data Input/output chain p s a x | g r e p m o n g o d | h e a d - n 1
  12. PIPELINE OPERATORS $ m a t c h $ p

    r o j e c t $ g r o u p $ u n w i n d $ s o r t $ l i m i t $ s k i p $ g e o N e a r
  13. OUR EXAMPLE DATA Library Books { _ i d :

    3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } }
  14. $MATCH Matching field values { t i t l e

    : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { l a n g u a g e : " R u s s i a n " } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " }
  15. $MATCH Matching with query operators { t i t l

    e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { p a g e s : { $ g t : 1 0 0 0 } } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " }
  16. $PROJECT Including and excluding fields { _ i d :

    3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { _ i d : 0 , t i t l e : 1 , l a n g u a g e : 1 } } { t i t l e : " T h e G r e a t G a t s b y " , l a n g u a g e : " E n g l i s h " }
  17. $PROJECT Renaming and computing fields { _ i d :

    3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { a v g P a g e s P e r C h a p t e r : { $ d i v i d e : [ " $ p a g e s " , " $ c h a p t e r s " ] } , l a n g : " $ l a n g u a g e " } } { _ i d : 3 7 5 , a v g P a g e s P e r C h a p t e r : 2 4 . 2 2 2 2 2 2 2 2 2 2 , l a n g : " E n g l i s h " }
  18. $PROJECT Creating and extracting sub-document fields { _ i d

    : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } } ► ▼ { $ p r o j e c t : { t i t l e : 1 , s t a t s : { p a g e s : " $ p a g e s " , c h a p t e r s : " $ c h a p t e r s " , } , p u b _ c i t y : " $ p u b l i s h e r . c i t y " } } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s t a t s : { p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } , p u b _ c i t y : " L o n d o n " }
  19. $GROUP Group documents by an ID Field reference, object, constant

    Other output fields are computed $ m a x , $ m i n , $ a v g , $ s u m $ a d d T o S e t , $ p u s h $ f i r s t , $ l a s t
  20. $GROUP Calculating an average { t i t l e

    : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , a v g P a g e s : { $ a v g : " $ p a g e s " } } } { _ i d : " R u s s i a n " , a v g P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , a v g P a g e s : 6 5 3 }
  21. $GROUP Summating fields and counting { t i t l

    e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , n u m T i t l e s : { $ s u m : 1 } , s u m P a g e s : { $ s u m : " $ p a g e s " } } } { _ i d : " R u s s i a n " , n u m T i t l e s : 1 , s u m P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , n u m T i t l e s : 2 , s u m P a g e s : 1 3 0 6 }
  22. $GROUP Collecting distinct values { t i t l e

    : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , t i t l e s : { $ a d d T o S e t : " $ t i t l e " } } } { _ i d : " R u s s i a n " , t i t l e s : [ " W a r a n d P e a c e " ] } { _ i d : " E n g l i s h " , t i t l e s : [ " A t l a s S h r u g g e d " , " T h e G r e a t G a t s b y " ] }
  23. $UNWIND Operate on an array field For each element, yield

    a document Pipe to $ g r o u p to aggregate array values
  24. $UNWIND Yielding multiple documents from one { _ i d

    : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] } ► ▼ { $ u n w i n d : " $ s u b j e c t s " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " L o n g I s l a n d " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " N e w Y o r k " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " 1 9 2 0 s " }
  25. $SORT, $LIMIT, $SKIP Sort documents by one or more fields

    Same order syntax as Limit and skip follow cursor behavior cursors
  26. $SORT Sort all documents in the pipeline { t i

    t l e : " T h e G r e a t G a t s b y " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " A n i m a l F a r m " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " F a h r e n h e i t 4 5 1 " } ► ▼ { $ s o r t : { t i t l e : 1 } } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " }
  27. $LIMIT Limit documents through the pipeline { t i t

    l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " } ► ▼ { $ l i m i t : 5 } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }
  28. $SKIP Skip over documents in the pipeline { t i

    t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } ► ▼ { $ s k i p : 2 } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }
  29. $GEONEAR Only used to start the pipeline Returns documents near

    a point, in order Proxies the command Supports the same options Adds a d i s t a n c e F i e l d option geoNear
  30. EXPRESSIONS Return computed values Used with $ p r o

    j e c t and $ g r o u p Reference fields using $ (e.g. " $ x " ) Expressions may be nested
  31. EXPRESSIONS Logic $ a n d , $ o r

    , $ n o t … Comparison $ c m p , $ e q , $ g t … Arithmetic $ a d d , $ d i v i d e … String $ s t r c a s e c m p , $ s u b s t r … Date $ y e a r , $ d a y O f M o n t h … Conditional $ c o n d , $ i f N u l l …
  32. USAGE a g g r e g a t e

    database command c o l l e c t i o n . a g g r e g a t e ( ) shell helper in PHP MongoCollection::aggregate()
  33. COLLECTION METHOD d b . b o o k s

    . a g g r e g a t e ( [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }
  34. DATABASE COMMAND d b . r u n C o

    m m a n d ( { a g g r e g a t e : " b o o k s " , p i p e l i n e : [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] } ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }
  35. LIMITATIONS Result limited by BSON document size Final command result

    Intermediate shard results Pipeline operator memory limits
  36. Aggregation Framework in Action Stock ticker data Real-time candlestick charts

    Open/close rates per minute Plot peak points and averages Time-series visualization
  37. EXAMPLE DATA { " _ i d " : O

    b j e c t I d ( " 4 f 4 b 8 9 1 6 f b 1 c 8 0 e 1 4 1 e a 6 2 0 1 " ) , " a s k " : 1 . 3 0 0 2 8 , " b i d " : 1 . 3 0 0 2 , " t s " : I S O D a t e ( " 2 0 1 2 - 0 2 - 1 6 T 1 2 : 4 8 : 0 0 Z " ) }
  38. SHARDING Split the pipeline at first $ g r o

    u p or $ s o r t Shards execute pipeline up to that point mongos merges results and continues Early $ m a t c h may excuse shards
  39. SHARDING [ { $ m a t c h :

    { / * f i l t e r b y s h a r d k e y * / } } , { $ g r o u p : { / * g r o u p b y s o m e f i e l d * / } } , { $ s o r t : { / * s o r t b y s o m e f i e l d * / } } , { $ p r o j e c t : { / * r e s h a p e r e s u l t * / } } ]
  40. SHARDING shard1 $ m a t c h $ g

    r o u p 1 ↘ shard2 $ m a t c h $ g r o u p 1 ↓ shard3 mongos $ g r o u p 2 $ s o r t $ p r o j e c t ↓ Result
  41. Future Enhancements Coming in MongoDB 2.6 operator for output control

    Down the road $ o u t Pipeline explain facility Optimizing $ m a t c h position Grouping sorted input
  42. Enabling Developers Doing more within MongoDB, faster Refactoring MapReduce and

    groupings Replace pages of JavaScript Longer aggregation pipelines Quick aggregations from the shell
  43. THANKS! Server and drivers Aggregation documentation MapReduce query profiler demo

    app Time-series visualization demo app MongoDB shell enhancements for hackers Questions?