Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Data Aggregation with MongoDB

Real-time Data Aggregation with MongoDB

Presented October 5, 2013 at PHP South Africa: https://joind.in/talk/view/9465

Presented May 24, 2013 at Symfony Live: Portland: https://joind.in/talk/view/8663

Reveal.js presentation published at: http://jmikola.github.io/slides/mongodb_real_time_aggregation/

F23700b51dc0c196c1dc02f84aeeecdf?s=128

Jeremy Mikola

October 05, 2013
Tweet

Transcript

  1. REAL-TIME DATA AGGREGATION WITH Jeremy Mikola @jmikola

  2. AGENDA Available Tools Schemas Commands MapReduce Aggregation Framework Pipeline, Expressions

    Usage, Sharding Looking Ahead
  3. SETTING THE STAGE We have data stored in MongoDB. We

    need to do ad-hoc reporting, grouping, common aggregations, etc. What can we use for this?
  4. DATA WAREHOUSING SQL for reporting and analytics Infrastructure complications Additional

    maintenance Data duplication ETL processes Real time?
  5. SCHEMA-BASED Documents hold pre-aggregated data Processing at time of insert/update

    Optimized for real-time queries In-place updates avoid fragmentation by MongoDB for Analytics John Nunemaker
  6. SINGLE-PURPOSE COMMANDS count distinct group

  7. Count d b . r e c o r d

    s . i n s e r t ( { a : 1 , b : 0 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , b : 2 } ) ; d b . r e c o r d s . c o u n t ( ) ; / / R e t u r n s 4 d b . r e c o r d s . c o u n t ( { a : 1 } ) ; / / R e t u r n s 3
  8. Distinct d b . o r d e r s

    . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 5 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 2 5 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " B 2 1 2 " , a m o u n t : 2 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 3 0 0 , s t a t u s : " D " } ) ; d b . r e c o r d s . d i s t i n c t ( ) ; / / R e t u r n s [ " A 1 2 3 " , " B 2 1 2 " ]
  9. Group d b . r e c o r d

    s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 2 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 3 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 5 } ) ; d b . r e c o r d s . i n s e r t ( { a : 4 , c o u n t : 4 } ) ; d b . r e c o r d s . g r o u p ( { k e y : { a : 1 } , c o n d : { a : { $ l t : 3 } } , r e d u c e : f u n c t i o n ( c u r , r e s u l t ) { r e s u l t . c o u n t + = c u r . c o u n t } , i n i t i a l : { c o u n t : 0 } } ) ; / / R e t u r n s [ { a : 1 , c o u n t : 1 5 } , { a : 2 , c o u n t : 4 } ]
  10. MAPREDUCE Extremely versatile, powerful Intended for complex data analysis Overkill

    for simple aggregation tasks Averages Summation Grouping
  11. MapReduce in MongoDB Implemented with JavaScript Convenient, but difficult to

    debug for batches Better concurrency with V8 Incremental map-reduce
  12. MapReduce in Action: MongoQP MongoDB includes a Use map-reduce for

    query log analysis Parse BSON query objects with JavaScript Group and aggregate similar queries database profiler
  13. Profile Data { " o p " : " q

    u e r y " , " n s " : " d b . c o l l e c t i o n " , " q u e r y " : { … } , " n t o r e t u r n " : 0 , " n t o s k i p " : 0 , " n s c a n n e d " : 1 1 4 2 6 , " l o c k S t a t s " : { … } , " n r e t u r n e d " : 0 , " r e s p o n s e L e n g t h " : 2 0 , " m i l l i s " : 1 2 , " t s " : I S O D a t e ( " 2 0 1 3 - 0 5 - 2 3 T 2 1 : 2 4 : 3 9 . 3 2 7 Z " ) , }
  14. None
  15. And now for something completely different…

  16. AGGREGATION FRAMEWORK Declared with BSON, like queries Executes in C++,

    not JavaScript Flexible, functional, and simple Operation pipeline Computational expressions
  17. PIPELINE

  18. PIPELINE Process a stream of documents Original input is a

    collection Final output is a result document Series of operators Filter or transform data Input/output chain p s a x | g r e p m o n g o d | h e a d - n 1
  19. PIPELINE OPERATORS $ m a t c h $ p

    r o j e c t $ g r o u p $ u n w i n d $ s o r t $ l i m i t $ s k i p $ g e o N e a r
  20. OUR EXAMPLE DATA Library Books { _ i d :

    3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } }
  21. $MATCH Filter documents Uses existing query syntax

  22. $MATCH Matching field values { t i t l e

    : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { l a n g u a g e : " R u s s i a n " } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " }
  23. $MATCH Matching with query operators { t i t l

    e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { p a g e s : { $ g t : 1 0 0 0 } } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " }
  24. $PROJECT Reshape documents Include, exclude or rename fields Inject computed

    fields Manipulate sub-document fields
  25. $PROJECT Including and excluding fields { _ i d :

    3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { _ i d : 0 , t i t l e : 1 , l a n g u a g e : 1 } } { t i t l e : " T h e G r e a t G a t s b y " , l a n g u a g e : " E n g l i s h " }
  26. $PROJECT Renaming and computing fields { _ i d :

    3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { a v g P a g e s P e r C h a p t e r : { $ d i v i d e : [ " $ p a g e s " , " $ c h a p t e r s " ] } , l a n g : " $ l a n g u a g e " } } { _ i d : 3 7 5 , a v g P a g e s P e r C h a p t e r : 2 4 . 2 2 2 2 2 2 2 2 2 2 , l a n g : " E n g l i s h " }
  27. $PROJECT Creating and extracting sub-document fields { _ i d

    : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } } ► ▼ { $ p r o j e c t : { t i t l e : 1 , s t a t s : { p a g e s : " $ p a g e s " , c h a p t e r s : " $ c h a p t e r s " , } , p u b _ c i t y : " $ p u b l i s h e r . c i t y " } } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s t a t s : { p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } , p u b _ c i t y : " L o n d o n " }
  28. $GROUP Group documents by an ID Field reference, object, constant

    Other output fields are computed $ m a x , $ m i n , $ a v g , $ s u m $ a d d T o S e t , $ p u s h $ f i r s t , $ l a s t
  29. $GROUP Calculating an average { t i t l e

    : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , a v g P a g e s : { $ a v g : " $ p a g e s " } } } { _ i d : " R u s s i a n " , a v g P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , a v g P a g e s : 6 5 3 }
  30. $GROUP Summating fields and counting { t i t l

    e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , n u m T i t l e s : { $ s u m : 1 } , s u m P a g e s : { $ s u m : " $ p a g e s " } } } { _ i d : " R u s s i a n " , n u m T i t l e s : 1 , s u m P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , n u m T i t l e s : 2 , s u m P a g e s : 1 3 0 6 }
  31. $GROUP Collecting distinct values { t i t l e

    : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , t i t l e s : { $ a d d T o S e t : " $ t i t l e " } } } { _ i d : " R u s s i a n " , t i t l e s : [ " W a r a n d P e a c e " ] } { _ i d : " E n g l i s h " , t i t l e s : [ " A t l a s S h r u g g e d " , " T h e G r e a t G a t s b y " ] }
  32. $UNWIND Operate on an array field For each element, yield

    a document Pipe to $ g r o u p to aggregate array values
  33. $UNWIND Yielding multiple documents from one { _ i d

    : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] } ► ▼ { $ u n w i n d : " $ s u b j e c t s " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " L o n g I s l a n d " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " N e w Y o r k " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " 1 9 2 0 s " }
  34. $SORT, $LIMIT, $SKIP Sort documents by one or more fields

    Same order syntax as Limit and skip follow cursor behavior cursors
  35. $SORT Sort all documents in the pipeline { t i

    t l e : " T h e G r e a t G a t s b y " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " A n i m a l F a r m " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " F a h r e n h e i t 4 5 1 " } ► ▼ { $ s o r t : { t i t l e : 1 } } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " }
  36. $LIMIT Limit documents through the pipeline { t i t

    l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " } ► ▼ { $ l i m i t : 5 } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }
  37. $SKIP Skip over documents in the pipeline { t i

    t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } ► ▼ { $ s k i p : 2 } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }
  38. $GEONEAR Only used to start the pipeline Returns documents near

    a point, in order Proxies the command Supports the same options Adds a d i s t a n c e F i e l d option geoNear
  39. EXPRESSIONS

  40. EXPRESSIONS Return computed values Used with $ p r o

    j e c t and $ g r o u p Reference fields using $ (e.g. " $ x " ) Expressions may be nested
  41. EXPRESSIONS Logic $ a n d , $ o r

    , $ n o t … Comparison $ c m p , $ e q , $ g t … Arithmetic $ a d d , $ d i v i d e … String $ s t r c a s e c m p , $ s u b s t r … Date $ y e a r , $ d a y O f M o n t h … Conditional $ c o n d , $ i f N u l l …
  42. USAGE

  43. USAGE a g g r e g a t e

    database command c o l l e c t i o n . a g g r e g a t e ( ) shell helper in PHP MongoCollection::aggregate()
  44. COLLECTION METHOD d b . b o o k s

    . a g g r e g a t e ( [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }
  45. DATABASE COMMAND d b . r u n C o

    m m a n d ( { a g g r e g a t e : " b o o k s " , p i p e l i n e : [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] } ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }
  46. LIMITATIONS Result limited by BSON document size Final command result

    Intermediate shard results Pipeline operator memory limits
  47. Aggregation Framework in Action Stock ticker data Real-time candlestick charts

    Open/close rates per minute Plot peak points and averages Time-series visualization
  48. EXAMPLE DATA { " _ i d " : O

    b j e c t I d ( " 4 f 4 b 8 9 1 6 f b 1 c 8 0 e 1 4 1 e a 6 2 0 1 " ) , " a s k " : 1 . 3 0 0 2 8 , " b i d " : 1 . 3 0 0 2 , " t s " : I S O D a t e ( " 2 0 1 2 - 0 2 - 1 6 T 1 2 : 4 8 : 0 0 Z " ) }
  49. http://git.io/mongodb-demo

  50. SHARDING

  51. SHARDING Split the pipeline at first $ g r o

    u p or $ s o r t Shards execute pipeline up to that point mongos merges results and continues Early $ m a t c h may excuse shards
  52. SHARDING [ { $ m a t c h :

    { / * f i l t e r b y s h a r d k e y * / } } , { $ g r o u p : { / * g r o u p b y s o m e f i e l d * / } } , { $ s o r t : { / * s o r t b y s o m e f i e l d * / } } , { $ p r o j e c t : { / * r e s h a p e r e s u l t * / } } ]
  53. SHARDING shard1 $ m a t c h $ g

    r o u p 1 ↘ shard2 $ m a t c h $ g r o u p 1 ↓ shard3 mongos $ g r o u p 2 $ s o r t $ p r o j e c t ↓ Result
  54. LOOKING AHEAD

  55. Framework Use Cases Basic aggregation queries Ad-hoc reporting Real-time analytics

    Visualizing time series data
  56. Future Enhancements Coming in MongoDB 2.6 operator for output control

    Down the road $ o u t Pipeline explain facility Optimizing $ m a t c h position Grouping sorted input
  57. Enabling Developers Doing more within MongoDB, faster Refactoring MapReduce and

    groupings Replace pages of JavaScript Longer aggregation pipelines Quick aggregations from the shell
  58. THANKS! Server and drivers Aggregation documentation MapReduce query profiler demo

    app Time-series visualization demo app MongoDB shell enhancements for hackers Questions?
  59. Photo Credits http://dilbert.com/strips/comic/2012-09-05 http://img.timeinc.net/time/photoessays/2009/monty_python/monty_python_02.jpg http://elevatedscholars.files.wordpress.com/2011/01/dscf0459.jpg http://purenintendo.com/wp-content/uploads/2009/10/i_20476.jpg http://i198.photobucket.com/albums/aa77/Stinger1118/Video%20Games/SuperMarioBrosWarpZone.jpg http://www.blogcdn.com/www.joystiq.com/media/2009/10/marioart102613.jpg