Real-time Data Aggregation with MongoDB

REAL-TIME DATA AGGREGATION WITH Jeremy Mikola @jmikola

AGENDA Available Tools Schemas Commands MapReduce Aggregation Framework Pipeline, Expressions
Usage, Sharding Looking Ahead

SETTING THE STAGE We have data stored in MongoDB. We
need to do ad-hoc reporting, grouping, common aggregations, etc. What can we use for this?

DATA WAREHOUSING SQL for reporting and analytics Infrastructure complications Additional
maintenance Data duplication ETL processes Real time?

SCHEMA-BASED Documents hold pre-aggregated data Processing at time of insert/update
Optimized for real-time queries In-place updates avoid fragmentation by MongoDB for Analytics John Nunemaker

SINGLE-PURPOSE COMMANDS count distinct group

Count d b . r e c o r d
s . i n s e r t ( { a : 1 , b : 0 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , b : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , b : 2 } ) ; d b . r e c o r d s . c o u n t ( ) ; / / R e t u r n s 4 d b . r e c o r d s . c o u n t ( { a : 1 } ) ; / / R e t u r n s 3

Distinct d b . o r d e r s
. i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 5 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 2 5 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " B 2 1 2 " , a m o u n t : 2 0 0 , s t a t u s : " A " } ) ; d b . o r d e r s . i n s e r t ( { c u s t o m e r : " A 1 2 3 " , a m o u n t : 3 0 0 , s t a t u s : " D " } ) ; d b . r e c o r d s . d i s t i n c t ( ) ; / / R e t u r n s [ " A 1 2 3 " , " B 2 1 2 " ]

Group d b . r e c o r d
s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 2 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 4 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 3 } ) ; d b . r e c o r d s . i n s e r t ( { a : 2 , c o u n t : 1 } ) ; d b . r e c o r d s . i n s e r t ( { a : 1 , c o u n t : 5 } ) ; d b . r e c o r d s . i n s e r t ( { a : 4 , c o u n t : 4 } ) ; d b . r e c o r d s . g r o u p ( { k e y : { a : 1 } , c o n d : { a : { $ l t : 3 } } , r e d u c e : f u n c t i o n ( c u r , r e s u l t ) { r e s u l t . c o u n t + = c u r . c o u n t } , i n i t i a l : { c o u n t : 0 } } ) ; / / R e t u r n s [ { a : 1 , c o u n t : 1 5 } , { a : 2 , c o u n t : 4 } ]

MAPREDUCE Extremely versatile, powerful Intended for complex data analysis Overkill
for simple aggregation tasks Averages Summation Grouping

MapReduce in MongoDB Implemented with JavaScript Convenient, but difficult to
debug for batches Better concurrency with V8 Incremental map-reduce

MapReduce in Action: MongoQP MongoDB includes a Use map-reduce for
query log analysis Parse BSON query objects with JavaScript Group and aggregate similar queries database profiler

Profile Data { " o p " : " q
u e r y " , " n s " : " d b . c o l l e c t i o n " , " q u e r y " : { … } , " n t o r e t u r n " : 0 , " n t o s k i p " : 0 , " n s c a n n e d " : 1 1 4 2 6 , " l o c k S t a t s " : { … } , " n r e t u r n e d " : 0 , " r e s p o n s e L e n g t h " : 2 0 , " m i l l i s " : 1 2 , " t s " : I S O D a t e ( " 2 0 1 3 - 0 5 - 2 3 T 2 1 : 2 4 : 3 9 . 3 2 7 Z " ) , }

And now for something completely different…

AGGREGATION FRAMEWORK Declared with BSON, like queries Executes in C++,
not JavaScript Flexible, functional, and simple Operation pipeline Computational expressions

PIPELINE

PIPELINE Process a stream of documents Original input is a
collection Final output is a result document Series of operators Filter or transform data Input/output chain p s a x | g r e p m o n g o d | h e a d - n 1

PIPELINE OPERATORS $ m a t c h $ p
r o j e c t $ g r o u p $ u n w i n d $ s o r t $ l i m i t $ s k i p $ g e o N e a r

OUR EXAMPLE DATA Library Books { _ i d :
3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } }

$MATCH Filter documents Uses existing query syntax

$MATCH Matching field values { t i t l e
: " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { l a n g u a g e : " R u s s i a n " } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " }

$MATCH Matching with query operators { t i t l
e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ m a t c h : { p a g e s : { $ g t : 1 0 0 0 } } } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " }

$PROJECT Reshape documents Include, exclude or rename fields Inject computed
fields Manipulate sub-document fields

$PROJECT Including and excluding fields { _ i d :
3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { _ i d : 0 , t i t l e : 1 , l a n g u a g e : 1 } } { t i t l e : " T h e G r e a t G a t s b y " , l a n g u a g e : " E n g l i s h " }

$PROJECT Renaming and computing fields { _ i d :
3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , l a n g u a g e : " E n g l i s h " } ► ▼ { $ p r o j e c t : { a v g P a g e s P e r C h a p t e r : { $ d i v i d e : [ " $ p a g e s " , " $ c h a p t e r s " ] } , l a n g : " $ l a n g u a g e " } } { _ i d : 3 7 5 , a v g P a g e s P e r C h a p t e r : 2 4 . 2 2 2 2 2 2 2 2 2 2 , l a n g : " E n g l i s h " }

$PROJECT Creating and extracting sub-document fields { _ i d
: 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , I S B N : " 9 7 8 1 8 5 7 1 5 0 1 9 3 " , a v a i l a b l e : t r u e , p a g e s : 2 1 8 , c h a p t e r s : 9 , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] , p u b l i s h e r : { c i t y : " L o n d o n " , n a m e : " R a n d o m H o u s e " } } ► ▼ { $ p r o j e c t : { t i t l e : 1 , s t a t s : { p a g e s : " $ p a g e s " , c h a p t e r s : " $ c h a p t e r s " , } , p u b _ c i t y : " $ p u b l i s h e r . c i t y " } } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s t a t s : { p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } , p u b _ c i t y : " L o n d o n " }

$GROUP Group documents by an ID Field reference, object, constant
Other output fields are computed $ m a x , $ m i n , $ a v g , $ s u m $ a d d T o S e t , $ p u s h $ f i r s t , $ l a s t

$GROUP Calculating an average { t i t l e
: " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , a v g P a g e s : { $ a v g : " $ p a g e s " } } } { _ i d : " R u s s i a n " , a v g P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , a v g P a g e s : 6 5 3 }

$GROUP Summating fields and counting { t i t l
e : " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , n u m T i t l e s : { $ s u m : 1 } , s u m P a g e s : { $ s u m : " $ p a g e s " } } } { _ i d : " R u s s i a n " , n u m T i t l e s : 1 , s u m P a g e s : 1 4 4 0 } { _ i d : " E n g l i s h " , n u m T i t l e s : 2 , s u m P a g e s : 1 3 0 6 }

$GROUP Collecting distinct values { t i t l e
: " T h e G r e a t G a t s b y " , p a g e s : 2 1 8 , l a n g u a g e : " E n g l i s h " } { t i t l e : " W a r a n d P e a c e " , p a g e s : 1 4 4 0 , l a n g u a g e : " R u s s i a n " } { t i t l e : " A t l a s S h r u g g e d " , p a g e s : 1 0 8 8 , l a n g u a g e : " E n g l i s h " } ► ▼ { $ g r o u p : { _ i d : " $ l a n g u a g e " , t i t l e s : { $ a d d T o S e t : " $ t i t l e " } } } { _ i d : " R u s s i a n " , t i t l e s : [ " W a r a n d P e a c e " ] } { _ i d : " E n g l i s h " , t i t l e s : [ " A t l a s S h r u g g e d " , " T h e G r e a t G a t s b y " ] }

$UNWIND Operate on an array field For each element, yield
a document Pipe to $ g r o u p to aggregate array values

$UNWIND Yielding multiple documents from one { _ i d
: 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : [ " L o n g I s l a n d " , " N e w Y o r k " , " 1 9 2 0 s " ] } ► ▼ { $ u n w i n d : " $ s u b j e c t s " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " L o n g I s l a n d " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " N e w Y o r k " } { _ i d : 3 7 5 , t i t l e : " T h e G r e a t G a t s b y " , s u b j e c t s : " 1 9 2 0 s " }

$SORT, $LIMIT, $SKIP Sort documents by one or more fields
Same order syntax as Limit and skip follow cursor behavior cursors

$SORT Sort all documents in the pipeline { t i
t l e : " T h e G r e a t G a t s b y " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " A n i m a l F a r m " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " F a h r e n h e i t 4 5 1 " } ► ▼ { $ s o r t : { t i t l e : 1 } } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " }

$LIMIT Limit documents through the pipeline { t i t
l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } { t i t l e : " L o r d o f t h e F l i e s " } { t i t l e : " T h e G r a p e s o f W r a t h " } { t i t l e : " T h e G r e a t G a t s b y " } ► ▼ { $ l i m i t : 5 } { t i t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }

$SKIP Skip over documents in the pipeline { t i
t l e : " A n i m a l F a r m " } { t i t l e : " B r a v e N e w W o r l d " } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " } ► ▼ { $ s k i p : 2 } { t i t l e : " F a h r e n h e i t 4 5 1 " } { t i t l e : " F a t h e r s a n d S o n s " } { t i t l e : " I n v i s i b l e M a n " }

$GEONEAR Only used to start the pipeline Returns documents near
a point, in order Proxies the command Supports the same options Adds a d i s t a n c e F i e l d option geoNear

EXPRESSIONS

EXPRESSIONS Return computed values Used with $ p r o
j e c t and $ g r o u p Reference fields using $ (e.g. " $ x " ) Expressions may be nested

EXPRESSIONS Logic $ a n d , $ o r
, $ n o t … Comparison $ c m p , $ e q , $ g t … Arithmetic $ a d d , $ d i v i d e … String $ s t r c a s e c m p , $ s u b s t r … Date $ y e a r , $ d a y O f M o n t h … Conditional $ c o n d , $ i f N u l l …

USAGE a g g r e g a t e
database command c o l l e c t i o n . a g g r e g a t e ( ) shell helper in PHP MongoCollection::aggregate()

COLLECTION METHOD d b . b o o k s
. a g g r e g a t e ( [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }

DATABASE COMMAND d b . r u n C o
m m a n d ( { a g g r e g a t e : " b o o k s " , p i p e l i n e : [ { $ s o r t : { c r e a t e d : 1 } } , { $ u n w i n d : " $ s u b j e c t s " } , { $ g r o u p : { _ i d : " $ s u b j e c t s " , n : { $ s u m : 1 } , f c : { $ f i r s t : " $ c r e a t e d " } } } , { $ p r o j e c t : { _ i d : 1 , n : 1 , f c : { $ y e a r : " $ f c " } } } ] } ) ; ▼ { r e s u l t : [ { " _ i d " : " F a n t a s y " , " n u m " : 6 , " f c " : 2 0 0 8 } , { " _ i d " : " H i s t o r i c a l " , " n u m " : 7 , " f c " : 2 0 1 2 } , { " _ i d " : " W o r l d L i t e r a t u r e " , " n " : 2 , " f c " : 2 0 0 9 } / / O t h e r r e s u l t s f o l l o w … ] , o k : 1 }

LIMITATIONS Result limited by BSON document size Final command result
Intermediate shard results Pipeline operator memory limits

Aggregation Framework in Action Stock ticker data Real-time candlestick charts
Open/close rates per minute Plot peak points and averages Time-series visualization

EXAMPLE DATA { " _ i d " : O
b j e c t I d ( " 4 f 4 b 8 9 1 6 f b 1 c 8 0 e 1 4 1 e a 6 2 0 1 " ) , " a s k " : 1 . 3 0 0 2 8 , " b i d " : 1 . 3 0 0 2 , " t s " : I S O D a t e ( " 2 0 1 2 - 0 2 - 1 6 T 1 2 : 4 8 : 0 0 Z " ) }

http://git.io/mongodb-demo

SHARDING

SHARDING Split the pipeline at first $ g r o
u p or $ s o r t Shards execute pipeline up to that point mongos merges results and continues Early $ m a t c h may excuse shards

SHARDING [ { $ m a t c h :
{ / * f i l t e r b y s h a r d k e y * / } } , { $ g r o u p : { / * g r o u p b y s o m e f i e l d * / } } , { $ s o r t : { / * s o r t b y s o m e f i e l d * / } } , { $ p r o j e c t : { / * r e s h a p e r e s u l t * / } } ]

SHARDING shard1 $ m a t c h $ g
r o u p 1 ↘ shard2 $ m a t c h $ g r o u p 1 ↓ shard3 mongos $ g r o u p 2 $ s o r t $ p r o j e c t ↓ Result

LOOKING AHEAD

Framework Use Cases Basic aggregation queries Ad-hoc reporting Real-time analytics
Visualizing time series data

Future Enhancements Coming in MongoDB 2.6 operator for output control
Down the road $ o u t Pipeline explain facility Optimizing $ m a t c h position Grouping sorted input

Enabling Developers Doing more within MongoDB, faster Refactoring MapReduce and
groupings Replace pages of JavaScript Longer aggregation pipelines Quick aggregations from the shell

THANKS! Server and drivers Aggregation documentation MapReduce query profiler demo
app Time-series visualization demo app MongoDB shell enhancements for hackers Questions?

Photo Credits http://dilbert.com/strips/comic/2012-09-05 http://img.timeinc.net/time/photoessays/2009/monty_python/monty_python_02.jpg http://elevatedscholars.files.wordpress.com/2011/01/dscf0459.jpg http://purenintendo.com/wp-content/uploads/2009/10/i_20476.jpg http://i198.photobucket.com/albums/aa77/Stinger1118/Video%20Games/SuperMarioBrosWarpZone.jpg http://www.blogcdn.com/www.joystiq.com/media/2009/10/marioart102613.jpg

Real-time Data Aggregation with MongoDB

Real-time Data Aggregation with MongoDB

More Decks by Jeremy Mikola

Other Decks in Programming

Featured

Transcript