Hadoop Mongo.pdf

+ mike o’brien @mpobrien software engineer @ 10gen Wednesday, May
16, 12

Big Data Problems A) Storage of huge (TB? PB?) sizes
B) Reading + Writing all that data C) Processing/Analytics on large data set Wednesday, May 16, 12

Large Dataset “names” Solutions: Distributed Computing Wednesday, May 16, 12

Large Dataset Solutions: Distributed Computing g h i j k
l a b c d e f Wednesday, May 16, 12

g h i j k l a b c d
e f Wednesday, May 16, 12

g h j l b c d e a i
k f A) Storage Hadoop: HDFS Mongo: Shards Wednesday, May 16, 12

g h j l b c d e a i
k f B) Reading/Writing Query for name : “frank” Wednesday, May 16, 12

g h j l b c d e a i
k f C) Processing - MapReduce “calculate name:count for all names in the dataset” map(...) map(...) map(...) Wednesday, May 16, 12

g h j l b c d e a i
k f “calculate name:count for all names in the dataset” reduce(...) reduce(...) reduce(...) C) Processing - MapReduce Wednesday, May 16, 12

g h j l b c d e a i
k f “calculate name:count for all names in the dataset” christina: 3 harry: 4 ... adam: 10 frank: 1 ... danielle: 3 jennifer: 4 ... C) Processing - MapReduce Wednesday, May 16, 12

framework for map/reduce (processing large volumes of data) schemaless non-
relational database (querying + updating data) Wednesday, May 16, 12

Mongo MapReduce map = function(){ this.tags.forEach( function(z){ emit( z ,
{ count : 1 } ); } ); }; reduce = function( key , values ){ var total = 0; for ( var i=0; i<values.length; i++ ) total += values[i].count; return { count : total }; }; res = db.things.mapReduce(map, reduce, { out : "myoutput" } ); Wednesday, May 16, 12

Mongo MapReduce •Capable, but sometimes limiting •Code must be written
in JS •Single threaded on each machine •Adds load to data store Wednesday, May 16, 12

Hadoop MapReduce public void map(LongWritable key, Text value, Context context)
throws IOException { ... } public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException { ... } (abridged) Wednesday, May 16, 12

Mongo/Hadoop Connector •Use MongoDB data as input to Hadoop MapReduce
job •Translates to/from MongoDB storage format (BSON) •Splits data from MongoDB for parallelization public void map( Object key, BSONObject value, Context context ) throws IOException { ... } Wednesday, May 16, 12

Beneﬁts •Access to the entire Hadoop toolchain •Access to the
entire JVM ecosystem •Full multi-core parallelism •Support for Hadoop Streaming: write Map/Reduce jobs in any language Wednesday, May 16, 12

MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,
v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 1 Input splits are calculated from MongoDB collection Wednesday, May 16, 12

v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop Splits loaded from MongoDB and sent as input to Hadoop 2 com.mongodb.hadoop.MongoInputFormat Wednesday, May 16, 12

v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 3 Hadoop nodes execute map function in parallel Wednesday, May 16, 12

v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 4 Partitioner collects results of map functions with same output key Wednesday, May 16, 12

v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 5 Results are sorted and sent to reduce phase Wednesday, May 16, 12

v’) map(k, v) map(k, v) map(k, v) sort(k’) map(k, v) map(k, v) map(k, v) Partitioner(k’) map(k, v) map(k, v) map(k, v) reduce(k’, values’) Mongo Hadoop 6 Outputs of reduce() stored in MongoCollection com.mongodb.hadoop.output.MongoRecordWriter Wednesday, May 16, 12

but what if i don’t wanna use Java :( map(k,
v) map(k, v) map(k, v) map() JVM •Do computation for map() and reduce() in external process • Params and Return values exchanged over std input/output streams STDIN Python / Ruby / JS functions STDOUT hadoop (JVM) operating system def mapper(documents): . . . Hadoop Streaming Wednesday, May 16, 12

Demo - MapReduce with Mongo/Hadoop Using Enron e-mail corpus (501,000
records, 1.75gb) E-mails are loaded into a MongoDB collection http://bit.ly/wmelPm download: Wednesday, May 16, 12

{ "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n
", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Sample Input Document Let’s build a graph of (senders → recipients) and the count of messages exchanged between each pair Wednesday, May 16, 12

We’ll use Python here for simplicity and legibility Hadoop Streaming
- Python interpreter will be invoked to perform the actual computation Demo - MapReduce with Mongo/Hadoop Wednesday, May 16, 12

#!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def
mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Mapper Code Wednesday, May 16, 12

#!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def
reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Reducer Code Wednesday, May 16, 12

hadoop jar mongo-hadoop-streaming-assembly.jar -mapper enron_map.py -reducer enron_reduce.py -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI
mongodb://127.0.0.1/enron_mail.output Running the Job Wednesday, May 16, 12

mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "[email protected]",
"f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more Results Wednesday, May 16, 12

Parallelism Wednesday, May 16, 12

More to come •Support for reading from multiple collections •Static
BSON support - read/write from/to mongoDB backup ﬁles on S3, HDFS, etc. •Support for additional languages in streaming - currently: python, javascript, ruby •Expanded tools support •Performance Improvements Wednesday, May 16, 12

github.com/mongodb/mongo-hadoop thanks! ✌(-‿-)✌ credits to people who built it: Brendan
McAdams, Max Afonov, Evan Korth, Joseph Shraibman, Sumin Xia, Priya Manda, Rushin Shah, Russel Jurney, and many more Wednesday, May 16, 12

Hadoop Mongo.pdf

Hadoop Mongo.pdf

mpobrien

More Decks by mpobrien

Featured

Transcript

+ mike o’brien @mpobrien software engineer @ 10gen Wednesday, May

Big Data Problems A) Storage of huge (TB? PB?) sizes

Large Dataset “names” Solutions: Distributed Computing Wednesday, May 16, 12

Large Dataset Solutions: Distributed Computing g h i j k

g h i j k l a b c d

g h j l b c d e a i

g h j l b c d e a i

g h j l b c d e a i

g h j l b c d e a i

g h j l b c d e a i

g h j l b c d e a i

framework for map/reduce (processing large volumes of data) schemaless non-

Mongo MapReduce map = function(){ this.tags.forEach( function(z){ emit( z ,

Mongo MapReduce •Capable, but sometimes limiting •Code must be written

Hadoop MapReduce public void map(LongWritable key, Text value, Context context)

Mongo/Hadoop Connector •Use MongoDB data as input to Hadoop MapReduce

Beneﬁts •Access to the entire Hadoop toolchain •Access to the

MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

MongoDB cluster map(k, v) map(k, v) map(k, v) map(k,v,ctx) ctx.write(k’,

but what if i don’t wanna use Java :( map(k,

Demo - MapReduce with Mongo/Hadoop Using Enron e-mail corpus (501,000

{ "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n

We’ll use Python here for simplicity and legibility Hadoop Streaming

#!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def

#!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def

hadoop jar mongo-hadoop-streaming-assembly.jar -mapper enron_map.py -reducer enron_reduce.py -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI

mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "[email protected]",

Parallelism Wednesday, May 16, 12

More to come •Support for reading from multiple collections •Static

github.com/mongodb/mongo-hadoop thanks! ✌(-‿-)✌ credits to people who built it: Brendan