Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Processing and Aggregation MongoDB

Data Processing and Aggregation MongoDB

This talk was presented on Madrid MUG - http://www.meetup.com/Madrid-MongoDB-User-Group/events/117219142/ on data processing mechanisms and aggregation framework over MongoDB

Norberto

May 22, 2013
Tweet

More Decks by Norberto

Other Decks in Programming

Transcript

  1. Exponential Data Growth 0 250 500 750 1000 2000 2001

    2002 2003 2004 2005 2006 2007 2008 Billions of URLs indexed by Google
  2. • MongoDB is an ideal operational database • MongoDB provides

    high performance for storage and retrieval at large scale • MongoDB has a robust query interface permitting intelligent operations • MongoDB is not a data processing engine, but provides processing functionality MongoDB solves our needs
  3. • Process in MongoDB using Map/Reduce • Process in MongoDB

    using Aggregation Framework • Process outside MongoDB using Hadoop and other external tools Data Processing in MongoDB
  4. Twitter { "created_at" : "Thu Feb 21 11:27:16 +0000 2013",

    "id" : 304552867820339200, "id_str" : "304552867820339200", "text" : "I've collected 58,158 gold coins! http://t.co/ oqcrKhQXdv #android, #androidgames, #gameinsight", "source" : "<a href=\"http://bit.ly/tribez_itw\" rel= \"nofollow\">The Tribez for Android</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 1089088963, "id_str" : "1089088963", "name" : "timothy m farmer", "screen_name" : "iceyknight8461", "location" : "", "url" : null, curl -U<u>:<pw> https://stream.twitter.com/1.1/statuses/sample.json | mongoimport -d test -c tweets
  5. { "created_at" : "Thu Feb 21 11:27:16 +0000 2013", "id"

    : 304552867820339200, "text" : "I've collected 58,158 gold coins! http://t.co/oqcrKhQXdv #android, #androidgames, #gameinsight", "user" : {...}, "entities" : { "hashtags" : [{ "text" : "android", "indices" : [57, 65]}, {"text" : "androidgames", "indices" : [67, 80]} ] } } Inspecting Hashtags
  6. > var map = function() { if (this.entities) { for

    ( i=0; i < this.entities.hashtags.length; i++) { emit(this.entities.hashtags[i].text, 1); } } } Map Function map
  7. > var reduce = function (key, values) { var sum

    = 0; values.forEach( function (val) {sum += val;} ); return sum; } Reduce reduce
  8. > db.tweets.mapReduce(map, reduce, {out: 'tweet_hashtags'}) { "result" : "tweet_hashtags", "timeMillis"

    : 6297, "counts" : { "input" : 328978, "emit" : 54027, "reduce" : 9789, "output" : 27046 }, "ok" : 1, } Execute MongoDB Map Reduce
  9. > db.tweet_hashtags.find().sort({value: -1}).limit(10) { "_id" : "RT", "value" : 559

    } { "_id" : "20SongsThatILike", "value" : 508 } { "_id" : "BrazilLovesHarryStyles", "value" : 411 } { "_id" : "WeLiveInAGenerationWhere", "value" : 358 } { "_id" : "tbt", "value" : 358 } { "_id" : "TeamFollowBack", "value" : 335 } { "_id" : "SCTVthanksForFullSMTOWN", "value" : 314 } { "_id" : "MientoComoHombre", "value" : 304 } { "_id" : "musicfans", "value" : 233 } { "_id" : "StoryBehindMyScar", "value" : 203 } Results
  10. • Real-time • Output directly to document or collection •

    Runs inside MongoDB on local data - Adds load to your DB - In javascript - V8 engine MongoDB Map/Reduce
  11. • $project • $match • $limit • $skip • $sort

    • $unwind • $group Aggregation framework operators
  12. • $first • $last • $max • $min • $avg

    • $sum • $push • $addToSet $group operators
  13. > hash_tags = [ { $unwind : "$entities.hashtags" } ,

    { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 } ] Aggregate Hashtags
  14. Aggregate Hashtags > hash_tags = [ { $unwind : "$entities.hashtags"

    } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 } ]
  15. Results > db.tweets.aggregate(hash_tags) { "result" : [ { "_id" :

    "RT", "value" : 559 } { "_id" : "20SongsThatILike", "value" : 508 } { "_id" : "BrazilLovesHarryStyles", "value" : 411 } { "_id" : "WeLiveInAGenerationWhere", "value" : 358 } { "_id" : "tbt", "value" : 358 } { "_id" : "TeamFollowBack", "value" : 335 } { "_id" : "SCTVthanksForFullSMTOWN", "value" : 314 } { "_id" : "MientoComoHombre", "value" : 304 } { "_id" : "musicfans", "value" : 233 } { "_id" : "StoryBehindMyScar", "value" : 203 } ],"ok" : 1 }
  16. • Real-time • Simple yet powerful interface • Declared in

    JSON, executes in C++ • Runs inside MongoDB on local data - Adds load to your DB - Limited how much data it can return Aggregation Framework Benefits
  17. #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def

    mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Map Hashtags in Python
  18. #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def

    reducer(key, values): print >> sys.stderr, "Hashtag %s" % key _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Reduce Hashtags in Python
  19. hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0- rc0.jar \ -mapper examples/twitter/twit_hashtag_map.py \ -reducer examples/twitter/twit_hashtag_reduce.py

    \ -inputURI mongodb://127.0.0.1/test.tweets \ -outputURI mongodb://127.0.0.1/test.tweet_hashtags \ -file examples/twitter/twit_hashtag_map.py \ -file examples/twitter/twit_hashtag_reduce.py Execute MongoDB Hadoop M/R Text
  20. Popular Hashtags > db.tweet_hashtags.find().sort({count: -1}).limit(10) { "_id" : "RT", "value"

    : 559 } { "_id" : "20SongsThatILike", "value" : 508 } { "_id" : "BrazilLovesHarryStyles", "value" : 411 } { "_id" : "WeLiveInAGenerationWhere", "value" : 358 } { "_id" : "tbt", "value" : 358 } { "_id" : "TeamFollowBack", "value" : 335 } { "_id" : "SCTVthanksForFullSMTOWN", "value" : 314 } { "_id" : "MientoComoHombre", "value" : 304 } { "_id" : "musicfans", "value" : 233 } { "_id" : "StoryBehindMyScar", "value" : 203 }
  21. • Away from data store • Can leverage existing data

    processing infrastructure • Can horizontally scale your data processing - Offline batch processing - Requires synchronisation between store & processor - Infrastructure is much more complex MongoDB and Hadoop
  22. Big is only getting bigger 0 2250 4500 6750 9000

    2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Billions of URLs indexed by Google
  23. • Aggregation Framework • Hadoop https://github.com/mongodb/mongo-hadoop • Storm https://github.com/christkv/mongo-storm •

    Disco https://github.com/mongodb/mongo-disco • Spark Coming soon! MongoDB is committed to working with the best data processing tools
  24. QA?