Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop Mongo.pdf

mpobrien
November 07, 2012
110

Hadoop Mongo.pdf

mpobrien

November 07, 2012
Tweet

Transcript

  1. Big Data Problems A) Storage of huge (TB? PB?) sizes

    B) Reading + Writing all that data C) Processing/Analytics on large data set Wednesday, May 16, 12
  2. Large Dataset Solutions: Distributed Computing g h i j k

    l a b c d e f Wednesday, May 16, 12
  3. g h i j k l a b c d

    e f Wednesday, May 16, 12
  4. g h j l b c d e a i

    k f A) Storage Hadoop: HDFS Mongo: Shards Wednesday, May 16, 12
  5. g h j l b c d e a i

    k f B) Reading/Writing Query for name : “frank” Wednesday, May 16, 12
  6. g h j l b c d e a i

    k f B) Reading/Writing Query for name : “frank” Wednesday, May 16, 12
  7. g h j l b c d e a i

    k f C) Processing - MapReduce “calculate name:count for all names in the dataset” map(...) map(...) map(...) Wednesday, May 16, 12
  8. g h j l b c d e a i

    k f “calculate name:count for all names in the dataset” reduce(...) reduce(...) reduce(...) C) Processing - MapReduce Wednesday, May 16, 12
  9. g h j l b c d e a i

    k f “calculate name:count for all names in the dataset” christina: 3 harry: 4 ... adam: 10 frank: 1 ... danielle: 3 jennifer: 4 ... C) Processing - MapReduce Wednesday, May 16, 12
  10. framework for map/reduce (processing large volumes of data) schemaless non-

    relational database (querying + updating data) Wednesday, May 16, 12
  11. Mongo MapReduce map = function(){ this.tags.forEach( function(z){ emit( z ,

    { count : 1 } ); } ); }; reduce = function( key , values ){ var total = 0; for ( var i=0; i<values.length; i++ ) total += values[i].count; return { count : total }; }; res = db.things.mapReduce(map, reduce, { out : "myoutput" } ); Wednesday, May 16, 12
  12. Mongo MapReduce •Capable, but sometimes limiting •Code must be written

    in JS •Single threaded on each machine •Adds load to data store Wednesday, May 16, 12
  13. Hadoop MapReduce public void map(LongWritable key, Text value, Context context)

    throws IOException { ... } public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException { ... } (abridged) Wednesday, May 16, 12
  14. Mongo/Hadoop Connector •Use MongoDB data as input to Hadoop MapReduce

    job •Translates to/from MongoDB storage format (BSON) •Splits data from MongoDB for parallelization public void map( Object key, BSONObject value, Context context ) throws IOException { ... } Wednesday, May 16, 12
  15. Benefits •Access to the entire Hadoop toolchain •Access to the

    entire JVM ecosystem •Full multi-core parallelism •Support for Hadoop Streaming: write Map/Reduce jobs in any language Wednesday, May 16, 12
  16. MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

    v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 1 Input splits are calculated from MongoDB collection Wednesday, May 16, 12
  17. MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

    v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop Splits loaded from MongoDB and sent as input to Hadoop 2 com.mongodb.hadoop.MongoInputFormat Wednesday, May 16, 12
  18. MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

    v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 3 Hadoop nodes execute map function in parallel Wednesday, May 16, 12
  19. MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

    v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 4 Partitioner collects results of map functions with same output key Wednesday, May 16, 12
  20. MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

    v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 5 Results are sorted and sent to reduce phase Wednesday, May 16, 12
  21. MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

    v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 6 Outputs of reduce() stored in MongoCollection com.mongodb.hadoop.output.MongoRecordWriter Wednesday, May 16, 12
  22. but what if i don’t wanna use Java :( map(k,

    v) map(k, v) map(k, v) map() JVM •Do computation for map() and reduce() in external process • Params and Return values exchanged over std input/output streams STDIN Python / Ruby / JS functions STDOUT hadoop (JVM) operating system def mapper(documents): . . . Hadoop Streaming Wednesday, May 16, 12
  23. Demo - MapReduce with Mongo/Hadoop Using Enron e-mail corpus (501,000

    records, 1.75gb) E-mails are loaded into a MongoDB collection http://bit.ly/wmelPm download: Wednesday, May 16, 12
  24. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n

    ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Sample Input Document Let’s build a graph of (senders → recipients) and the count of messages exchanged between each pair Wednesday, May 16, 12
  25. We’ll use Python here for simplicity and legibility Hadoop Streaming

    - Python interpreter will be invoked to perform the actual computation Demo - MapReduce with Mongo/Hadoop Wednesday, May 16, 12
  26. #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def

    mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Mapper Code Wednesday, May 16, 12
  27. #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def

    reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Reducer Code Wednesday, May 16, 12
  28. mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "[email protected]",

    "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more Results Wednesday, May 16, 12
  29. More to come •Support for reading from multiple collections •Static

    BSON support - read/write from/to mongoDB backup files on S3, HDFS, etc. •Support for additional languages in streaming - currently: python, javascript, ruby •Expanded tools support •Performance Improvements Wednesday, May 16, 12
  30. github.com/mongodb/mongo-hadoop thanks! ✌(-‿-)✌ credits to people who built it: Brendan

    McAdams, Max Afonov, Evan Korth, Joseph Shraibman, Sumin Xia, Priya Manda, Rushin Shah, Russel Jurney, and many more Wednesday, May 16, 12