Data Processing and Aggregation MongoDB

Norberto Leite #MongoDBMadrid Data Processing and Aggregation @nleite

Big Data

Exponential Data Growth 0 250 500 750 1000 2000 2001
2002 2003 2004 2005 2006 2007 2008 Billions of URLs indexed by Google

• Norberto Leite - @nleite • 10gen Solutions Architect •
Love databases Who am I?

For over a decade Big Data == Custom Software

In the past few years Open source software emerged enabling
the rest of us to handle Big Data

Applications and data Store Process

• MongoDB is an ideal operational database • MongoDB provides
high performance for storage and retrieval at large scale • MongoDB has a robust query interface permitting intelligent operations • MongoDB is not a data processing engine, but provides processing functionality MongoDB solves our needs

• Process in MongoDB using Map/Reduce • Process in MongoDB
using Aggregation Framework • Process outside MongoDB using Hadoop and other external tools Data Processing in MongoDB

Twitter { "created_at" : "Thu Feb 21 11:27:16 +0000 2013",
"id" : 304552867820339200, "id_str" : "304552867820339200", "text" : "I've collected 58,158 gold coins! http://t.co/ oqcrKhQXdv #android, #androidgames, #gameinsight", "source" : "<a href=\"http://bit.ly/tribez_itw\" rel= \"nofollow\">The Tribez for Android</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 1089088963, "id_str" : "1089088963", "name" : "timothy m farmer", "screen_name" : "iceyknight8461", "location" : "", "url" : null, curl -U<u>:<pw> https://stream.twitter.com/1.1/statuses/sample.json | mongoimport -d test -c tweets

{ "created_at" : "Thu Feb 21 11:27:16 +0000 2013", "id"
: 304552867820339200, "text" : "I've collected 58,158 gold coins! http://t.co/oqcrKhQXdv #android, #androidgames, #gameinsight", "user" : {...}, "entities" : { "hashtags" : [{ "text" : "android", "indices" : [57, 65]}, {"text" : "androidgames", "indices" : [67, 80]} ] } } Inspecting Hashtags

MongoDB Map/Reduce

MongoDB Map/Reduce MongoDB map reduce ﬁnalise

> var map = function() { if (this.entities) { for
( i=0; i < this.entities.hashtags.length; i++) { emit(this.entities.hashtags[i].text, 1); } } } Map Function map

> var reduce = function (key, values) { var sum
= 0; values.forEach( function (val) {sum += val;} ); return sum; } Reduce reduce

> db.tweets.mapReduce(map, reduce, {out: 'tweet_hashtags'}) { "result" : "tweet_hashtags", "timeMillis"
: 6297, "counts" : { "input" : 328978, "emit" : 54027, "reduce" : 9789, "output" : 27046 }, "ok" : 1, } Execute MongoDB Map Reduce

> db.tweet_hashtags.ﬁnd().sort({value: -1}).limit(10) { "_id" : "RT", "value" : 559
} { "_id" : "20SongsThatILike", "value" : 508 } { "_id" : "BrazilLovesHarryStyles", "value" : 411 } { "_id" : "WeLiveInAGenerationWhere", "value" : 358 } { "_id" : "tbt", "value" : 358 } { "_id" : "TeamFollowBack", "value" : 335 } { "_id" : "SCTVthanksForFullSMTOWN", "value" : 314 } { "_id" : "MientoComoHombre", "value" : 304 } { "_id" : "musicfans", "value" : 233 } { "_id" : "StoryBehindMyScar", "value" : 203 } Results

• Real-time • Output directly to document or collection •
Runs inside MongoDB on local data - Adds load to your DB - In javascript - V8 engine MongoDB Map/Reduce

Aggregation Framework

MongoDB Aggregation Framework MongoDB op1 op2 opN

• $project • $match • $limit • $skip • $sort
• $unwind • $group Aggregation framework operators

• $first • $last • $max • $min • $avg
• $sum • $push • $addToSet $group operators

> hash_tags = [ { $unwind : "$entities.hashtags" } ,
{ $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 } ] Aggregate Hashtags

Aggregate Hashtags > hash_tags = [ { $unwind : "$entities.hashtags"
} , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 } ]

Results > db.tweets.aggregate(hash_tags) { "result" : [ { "_id" :
"RT", "value" : 559 } { "_id" : "20SongsThatILike", "value" : 508 } { "_id" : "BrazilLovesHarryStyles", "value" : 411 } { "_id" : "WeLiveInAGenerationWhere", "value" : 358 } { "_id" : "tbt", "value" : 358 } { "_id" : "TeamFollowBack", "value" : 335 } { "_id" : "SCTVthanksForFullSMTOWN", "value" : 314 } { "_id" : "MientoComoHombre", "value" : 304 } { "_id" : "musicfans", "value" : 233 } { "_id" : "StoryBehindMyScar", "value" : 203 } ],"ok" : 1 }

• Real-time • Simple yet powerful interface • Declared in
JSON, executes in C++ • Runs inside MongoDB on local data - Adds load to your DB - Limited how much data it can return Aggregation Framework Beneﬁts

Analysing MongoDB Data in External Systems

MongoDB with Hadoop MongoDB

MongoDB with Hadoop MongoDB warehouse

MongoDB with Hadoop MongoDB ETL

#!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def
mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Map Hashtags in Python

#!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def
reducer(key, values): print >> sys.stderr, "Hashtag %s" % key _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Reduce Hashtags in Python

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0- rc0.jar \ -mapper examples/twitter/twit_hashtag_map.py \ -reducer examples/twitter/twit_hashtag_reduce.py
\ -inputURI mongodb://127.0.0.1/test.tweets \ -outputURI mongodb://127.0.0.1/test.tweet_hashtags \ -file examples/twitter/twit_hashtag_map.py \ -file examples/twitter/twit_hashtag_reduce.py Execute MongoDB Hadoop M/R Text

Popular Hashtags > db.tweet_hashtags.ﬁnd().sort({count: -1}).limit(10) { "_id" : "RT", "value"
: 559 } { "_id" : "20SongsThatILike", "value" : 508 } { "_id" : "BrazilLovesHarryStyles", "value" : 411 } { "_id" : "WeLiveInAGenerationWhere", "value" : 358 } { "_id" : "tbt", "value" : 358 } { "_id" : "TeamFollowBack", "value" : 335 } { "_id" : "SCTVthanksForFullSMTOWN", "value" : 314 } { "_id" : "MientoComoHombre", "value" : 304 } { "_id" : "musicfans", "value" : 233 } { "_id" : "StoryBehindMyScar", "value" : 203 }

• Away from data store • Can leverage existing data
processing infrastructure • Can horizontally scale your data processing - Offline batch processing - Requires synchronisation between store & processor - Infrastructure is much more complex MongoDB and Hadoop

The Future of Big Data and MongoDB

What is Big? Big today is normal tomorrow

Big is only getting bigger 0 2250 4500 6750 9000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Billions of URLs indexed by Google

MongoDB enables you to scale to the redefinition of BIG.

MongoDB is evolving to enable you to process the new
BIG.

• Aggregation Framework • Hadoop https://github.com/mongodb/mongo-hadoop • Storm https://github.com/christkv/mongo-storm •
Disco https://github.com/mongodb/mongo-disco • Spark Coming soon! MongoDB is committed to working with the best data processing tools

Norberto Leite #MongoDBMadrid Obrigado! @nleite

Data Processing and Aggregation MongoDB

Data Processing and Aggregation MongoDB

More Decks by Norberto

Other Decks in Programming

Featured

Transcript