Slide 1

Slide 1 text

REAL-TIME DATA AGGREGATION WITH Jeremy Mikola @jmikola

Slide 2

Slide 2 text

AGENDA Available Tools Schemas Commands MapReduce Aggregation Framework Pipeline, Expressions Usage, Sharding Looking Ahead

Slide 3

Slide 3 text

SETTING THE STAGE We have data stored in MongoDB. We need to do ad-hoc reporting, grouping, common aggregations, etc. What can we use for this?

Slide 4

Slide 4 text

DATA WAREHOUSING SQL for reporting and analytics Infrastructure complications Additional maintenance Data duplication ETL processes Real time?

Slide 5

Slide 5 text

SCHEMA-BASED Documents hold pre-aggregated data Processing at time of insert/update Optimized for real-time queries In-place updates avoid fragmentation by MongoDB for Analytics John Nunemaker

Slide 6

Slide 6 text

SINGLE-PURPOSE COMMANDS count distinct group

Slide 7

Slide 7 text

Count d b . r e c o r d s . i n s e r t ( { a : 1 , b : 0 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , b : 2 } ) ; d b . r e c o r d s . c o u n t ( ) ; / / R e t u r n s 4 d b . r e c o r d s . c o u n t ( { a : 1 } ) ; / / R e t u r n s 3

Slide 8

Slide 8 text

Distinct d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 5 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 2 5 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " B 2 1 2 " , a m o u n t : 2 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 3 0 0 , s t a t u s : " D " } ) ; d b . r e c o r d s . d i s t i n c t ( ) ; / / R e t u r n s [ " A 1 2 3 " , " B 2 1 2 " ]

Slide 9

Slide 9 text

Group d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 2 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 3 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 5 } ) ; d b . r e c o r d s . i n s e r t ( { a : 4 , c o u n t : 4 } ) ; d b . r e c o r d s . g r o u p ( { k e y : { a : 1 } , c o n d : { a : { $ l t : 3 } } , r e d u c e : f u n c t i o n ( c u r , r e s u l t ) { r e s u l t . c o u n t + = c u r . c o u n t } , i n i t i a l : { c o u n t : 0 } } ) ; / / R e t u r n s [ { a : 1 , c o u n t : 1 5 } , { a : 2 , c o u n t : 4 } ]

Slide 10

Slide 10 text

MAPREDUCE Extremely versatile, powerful Intended for complex data analysis Overkill for simple aggregation tasks Averages Summation Grouping

Slide 11

Slide 11 text

MapReduce in MongoDB Implemented with JavaScript Convenient, but difficult to debug for batches Better concurrency with V8 Incremental map-reduce

Slide 12

Slide 12 text

MapReduce in Action: MongoQP MongoDB includes a Use map-reduce for query log analysis Parse BSON query objects with JavaScript Group and aggregate similar queries database profiler

Slide 13

Slide 13 text

Profile Data { " o p " : " q u e r y " , " n s " : " d b . c o l l e c t i o n " , " q u e r y " : { … } , " n t o r e t u r n " : 0 , " n t o s k i p " : 0 , " n s c a n n e d " : 1 1 4 2 6 , " l o c k S t a t s " : { … } , " n r e t u r n e d " : 0 , " r e s p o n s e L e n g t h " : 2 0 , " m i l l i s " : 1 2 , " t s " : I S O D a t e ( " 2 0 1 3 - 0 5 - 2 3 T 2 1 : 2 4 : 3 9 . 3 2 7 Z " ) , }

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

And now for something completely different…

Slide 16

Slide 16 text

AGGREGATION FRAMEWORK Declared with BSON, like queries Executes in C++, not JavaScript Flexible, functional, and simple Operation pipeline Computational expressions

Slide 17

Slide 17 text

PIPELINE

Slide 18

Slide 18 text

PIPELINE Process a stream of documents Original input is a collection Final output is a result document Series of operators Filter or transform data Input/output chain p s a x | g r e p m o n g o d | h e a d - n 1

Slide 19

Slide 19 text

PIPELINE OPERATORS $ m a t c h $ p r o j e c t $ g r o u p $ u n w i n d $ s o r t $ l i m i t $ s k i p $ g e o N e a r

Slide 20

Slide 20 text

OUR EXAMPLE DATA Library Books { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } }

Slide 21

Slide 21 text

$MATCH Filter documents Uses existing query syntax

Slide 22

Slide 22 text

$MATCH Matching field values { t i t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { l a n g u a g e : " R u s s i a n " } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " }

Slide 23

Slide 23 text

$MATCH Matching with query operators { t i t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { p a g e s : { $ g t : 1 0 0 0 } } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " }

Slide 24

Slide 24 text

$PROJECT Reshape documents Include, exclude or rename fields Inject computed fields Manipulate sub-document fields

Slide 25

Slide 25 text

$PROJECT Including and excluding fields { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { _ i d : 0 , t i t l e : 1 , l a n g u a g e : 1 } } { t i t l e : " T h e G r e a t G a t s b y " , l a n g u a g e : " E n g l i s h " }

Slide 26

Slide 26 text

$PROJECT Renaming and computing fields { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { a v g P a g e s P e r C h a p t e r : { $ d i v i d e : [ " $ p a g e s " , " $ c h a p t e r s " ] } , l a n g : " $ l a n g u a g e " } } { _ i d : 3 7 5 , a v g P a g e s P e r C h a p t e r : 2 4 . 2 2 2 2 2 2 2 2 2 2 , l a n g : " E n g l i s h " }

Slide 27

Slide 27 text

$PROJECT Creating and extracting sub-document fields { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } } ► ▼ { $ p r o j e c t : { t i t l e : 1 , s t a t s : { p a g e s : " $ p a g e s " , c h a p t e r s : " $ c h a p t e r s " , } , p u b _ c i t y : " $ p u b l i s h e r . c i t y " } } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s t a t s : { p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } , p u b _ c i t y : " L o n d o n " }

Slide 28

Slide 28 text

$GROUP Group documents by an ID Field reference, object, constant Other output fields are computed $ m a x , $ m i n , $ a v g , $ s u m $ a d d T o S e t , $ p u s h $ f i r s t , $ l a s t

Slide 29

Slide 29 text

$GROUP Calculating an average { t i t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , a v g P a g e s : { $ a v g : " $ p a g e s " } } } { _ i d : " R u s s i a n " , a v g P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , a v g P a g e s : 6 5 3 }

Slide 30

Slide 30 text

$GROUP Summating fields and counting { t i t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , n u m T i t l e s : { $ s u m : 1 } , s u m P a g e s : { $ s u m : " $ p a g e s " } } } { _ i d : " R u s s i a n " , n u m T i t l e s : 1 , s u m P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , n u m T i t l e s : 2 , s u m P a g e s : 1 3 0 6 }

Slide 31

Slide 31 text

$GROUP Collecting distinct values { t i t l e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , t i t l e s : { $ a d d T o S e t : " $ t i t l e " } } } { _ i d : " R u s s i a n " , t i t l e s : [ " W a r a n d P e a c e " ] } { _ i d : " E n g l i s h " , t i t l e s : [ " A t l a s S h r u g g e d " , " T h e G r e a t G a t s b y " ] }

Slide 32

Slide 32 text

$UNWIND Operate on an array field For each element, yield a document Pipe to $ g r o u p to aggregate array values

Slide 33

Slide 33 text

$UNWIND Yielding multiple documents from one { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] } ► ▼ { $ u n w i n d : " $ s u b j e c t s " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " L o n g I s l a n d " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " N e w Y o r k " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " 1 9 2 0 s " }

Slide 34

Slide 34 text

$SORT, $LIMIT, $SKIP Sort documents by one or more fields Same order syntax as Limit and skip follow cursor behavior cursors

Slide 35

Slide 35 text

$SORT Sort all documents in the pipeline { t i t l e : " T h e G r e a t G a t s b y " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " A n i m a l F a r m " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " F a h r e n h e i t 4 5 1 " } ► ▼ { $ s o r t : { t i t l e : 1 } } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " }

Slide 36

Slide 36 text

$LIMIT Limit documents through the pipeline { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " } ► ▼ { $ l i m i t : 5 } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }

Slide 37

Slide 37 text

$SKIP Skip over documents in the pipeline { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } ► ▼ { $ s k i p : 2 } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }

Slide 38

Slide 38 text

$GEONEAR Only used to start the pipeline Returns documents near a point, in order Proxies the command Supports the same options Adds a d i s t a n c e F i e l d option geoNear

Slide 39

Slide 39 text

EXPRESSIONS

Slide 40

Slide 40 text

EXPRESSIONS Return computed values Used with $ p r o j e c t and $ g r o u p Reference fields using $ (e.g. " $ x " ) Expressions may be nested

Slide 41

Slide 41 text

EXPRESSIONS Logic $ a n d , $ o r , $ n o t … Comparison $ c m p , $ e q , $ g t … Arithmetic $ a d d , $ d i v i d e … String $ s t r c a s e c m p , $ s u b s t r … Date $ y e a r , $ d a y O f M o n t h … Conditional $ c o n d , $ i f N u l l …

Slide 42

Slide 42 text

USAGE

Slide 43

Slide 43 text

USAGE a g g r e g a t e database command c o l l e c t i o n . a g g r e g a t e ( ) shell helper in PHP MongoCollection::aggregate()

Slide 44

Slide 44 text

COLLECTION METHOD d b . b o o k s . a g g r e g a t e ( [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }

Slide 45

Slide 45 text

DATABASE COMMAND d b . r u n C o m m a n d ( { a g g r e g a t e : " b o o k s " , p i p e l i n e : [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] } ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }

Slide 46

Slide 46 text

LIMITATIONS Result limited by BSON document size Final command result Intermediate shard results Pipeline operator memory limits

Slide 47

Slide 47 text

Aggregation Framework in Action Stock ticker data Real-time candlestick charts Open/close rates per minute Plot peak points and averages Time-series visualization

Slide 48

Slide 48 text

EXAMPLE DATA { " _ i d " : O b j e c t I d ( " 4 f 4 b 8 9 1 6 f b 1 c 8 0 e 1 4 1 e a 6 2 0 1 " ) , " a s k " : 1 . 3 0 0 2 8 , " b i d " : 1 . 3 0 0 2 , " t s " : I S O D a t e ( " 2 0 1 2 - 0 2 - 1 6 T 1 2 : 4 8 : 0 0 Z " ) }

Slide 49

Slide 49 text

http://git.io/mongodb-demo

Slide 50

Slide 50 text

SHARDING

Slide 51

Slide 51 text

SHARDING Split the pipeline at first $ g r o u p or $ s o r t Shards execute pipeline up to that point mongos merges results and continues Early $ m a t c h may excuse shards

Slide 52

Slide 52 text

SHARDING [ { $ m a t c h : { / * f i l t e r b y s h a r d k e y * / } } , { $ g r o u p : { / * g r o u p b y s o m e f i e l d * / } } , { $ s o r t : { / * s o r t b y s o m e f i e l d * / } } , { $ p r o j e c t : { / * r e s h a p e r e s u l t * / } } ]

Slide 53

Slide 53 text

SHARDING shard1 $ m a t c h $ g r o u p 1 ↘ shard2 $ m a t c h $ g r o u p 1 ↓ shard3 mongos $ g r o u p 2 $ s o r t $ p r o j e c t ↓ Result

Slide 54

Slide 54 text

LOOKING AHEAD

Slide 55

Slide 55 text

Framework Use Cases Basic aggregation queries Ad-hoc reporting Real-time analytics Visualizing time series data

Slide 56

Slide 56 text

Future Enhancements Coming in MongoDB 2.6 operator for output control Down the road $ o u t Pipeline explain facility Optimizing $ m a t c h position Grouping sorted input

Slide 57

Slide 57 text

Enabling Developers Doing more within MongoDB, faster Refactoring MapReduce and groupings Replace pages of JavaScript Longer aggregation pipelines Quick aggregations from the shell

Slide 58

Slide 58 text

THANKS! Server and drivers Aggregation documentation MapReduce query profiler demo app Time-series visualization demo app MongoDB shell enhancements for hackers Questions?

Slide 59

Slide 59 text

Photo Credits http://dilbert.com/strips/comic/2012-09-05 http://img.timeinc.net/time/photoessays/2009/monty_python/monty_python_02.jpg http://elevatedscholars.files.wordpress.com/2011/01/dscf0459.jpg http://purenintendo.com/wp-content/uploads/2009/10/i_20476.jpg http://i198.photobucket.com/albums/aa77/Stinger1118/Video%20Games/SuperMarioBrosWarpZone.jpg http://www.blogcdn.com/www.joystiq.com/media/2009/10/marioart102613.jpg