Taming the Elephant In the Room with the MongoDB Hadoop Connector

Brendan McAdams 10gen, Inc. [email protected] @rit Taming The Elephant In
The Room with MongoDB + Hadoop Integration Monday, April 9, 12

Integrating MongoDB + Hadoop Monday, April 9, 12

Introducing the MongoDB Hadoop Connector Monday, April 9, 12

Introducing the MongoDB Hadoop Connector •For the past year and
a half, 10gen has been exploring ways to integrate MongoDB + Hadoop Monday, April 9, 12

Introducing the MongoDB Hadoop Connector •For the past year and
a half, 10gen has been exploring ways to integrate MongoDB + Hadoop •Outgrowth from work I did prior to joining 10gen – “Luau”, focused on Pig support for ETL Monday, April 9, 12

Introducing the MongoDB Hadoop Connector •What have we been doing
for 18 months? Monday, April 9, 12

for 18 months? •Building a polished, reliable product which provides real beneﬁt to our users Monday, April 9, 12

for 18 months? •Building a polished, reliable product which provides real beneﬁt to our users •Improving and testing feature sets Monday, April 9, 12

for 18 months? •Building a polished, reliable product which provides real beneﬁt to our users •Improving and testing feature sets •Enhancing integration between the MongoDB Server + Hadoop Connector for max performance Monday, April 9, 12

Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0
of this Integration: The MongoDB Hadoop Connector Monday, April 9, 12

of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java Monday, April 9, 12

of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB Monday, April 9, 12

of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB •Write MapReduce jobs in Python via Hadoop Streaming Monday, April 9, 12

of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB •Write MapReduce jobs in Python via Hadoop Streaming •Collect massive amounts of Logging output into MongoDB via Flume Monday, April 9, 12

Community Contributions are Key Monday, April 9, 12

Community Contributions are Key •Lots of effort from the community
to make this project come together Monday, April 9, 12

to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau Monday, April 9, 12

to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau •Evan Korth (@evankorth) led a New York University projects class which built the initial input split support Monday, April 9, 12

to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau •Evan Korth (@evankorth) led a New York University projects class which built the initial input split support •Joseph Shraibman, Sumin Xia, Priya Manda, and Rushin Shah worked on this feature Monday, April 9, 12

to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau •Evan Korth (@evankorth) led a New York University projects class which built the initial input split support •Joseph Shraibman, Sumin Xia, Priya Manda, and Rushin Shah worked on this feature •Russell Jurney (@rjurney) has done a lot of heavy lifting on improving Pig Monday, April 9, 12

Why Integrate MongoDB + Hadoop? Monday, April 9, 12

Separation of Concern Monday, April 9, 12

Separation of Concern •Data storage and data processing are often
separate concerns Monday, April 9, 12

separate concerns •MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation) Monday, April 9, 12

separate concerns •MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation) •Hadoop is built for scalable processing of large datasets Monday, April 9, 12

The Right Tool for the Job Monday, April 9, 12

The Right Tool for the Job •JavaScript isn’t the most
ideal language for many types of calculations Monday, April 9, 12

ideal language for many types of calculations •Slow Monday, April 9, 12

ideal language for many types of calculations •Slow •Limited datatypes Monday, April 9, 12

ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM Monday, April 9, 12

ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM •Rich, powerful ecosystem Monday, April 9, 12

ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM •Rich, powerful ecosystem •Hadoop has machine learning, ETL, and many other tools which are much more ﬂexible than the processing tools in MongoDB Monday, April 9, 12

Being a Good Neighbor Monday, April 9, 12

Being a Good Neighbor •Integration with Customers’ Existing Stacks &
Toolchains is Crucial Monday, April 9, 12

Toolchains is Crucial •Many users & customers already have Hadoop in their stacks Monday, April 9, 12

Toolchains is Crucial •Many users & customers already have Hadoop in their stacks •They want us to “play nicely” with their existing toolchains Monday, April 9, 12

Toolchains is Crucial •Many users & customers already have Hadoop in their stacks •They want us to “play nicely” with their existing toolchains •Different groups in companies may mandate all data be processable in Hadoop Monday, April 9, 12

Capabilities Monday, April 9, 12

Hadoop Connector Capabilities Monday, April 9, 12

Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input
Splits”) for parallel Hadoop processing Monday, April 9, 12

Splits”) for parallel Hadoop processing •Without splits, only one mapper can run Monday, April 9, 12

Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections Monday, April 9, 12

Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections •Sharded: Read individual chunks from conﬁg server into Hadoop Monday, April 9, 12

Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections •Sharded: Read individual chunks from conﬁg server into Hadoop •Unsharded: Create splits, similar to how sharding chunks are calculated Monday, April 9, 12

Parallel Processing of Splits Monday, April 9, 12

Parallel Processing of Splits •Ship “Splits” to Mappers as hostname,
database, collection, & query Monday, April 9, 12

database, collection, & query •Each Mapper reads the relevant documents in Monday, April 9, 12

database, collection, & query •Each Mapper reads the relevant documents in •Parallel processing for high performance Monday, April 9, 12

database, collection, & query •Each Mapper reads the relevant documents in •Parallel processing for high performance •Speaks BSON between all layers! Monday, April 9, 12

MapReduce in MongoDB (JavaScript) Data Map() emit(k,v) Sort(k) Group(k) Reduce(k,values)
k,v Finalize(k,v) k,v MongoDB map iterates on documents Document is $this 1 at time per shard Input matches output Can run multiple times Monday, April 9, 12

Hadoop MapReduce w/ MongoDB Map (k1, v1, ctx) ctx.write(k2,v2) Map
(k1, v1, ctx) ctx.write(k2,v2) Map (k1, v1, ctx) ctx.write(k2,v2) Creates a list of Input Splits (InputFormat) Output Format MongoDB single server or sharded cluster same as Mongo's shard chunks (64mb) each split each split each split MongoDB RecordReader Runs on same thread as map Reducer threads Partitioner(k2) Partitioner(k2) Partitioner(k2) Many map operations 1 at time per input split Runs once per key Reduce(k2,values3) kf,vf Combiner(k2,values2) k2, v3 Combiner(k2,values2) k2, v3 Combiner(k2,values2) k2, v3 Sort(k2) Sort(k2) Sort(keys2) Monday, April 9, 12

MongoDB Hadoop Connector In Action Monday, April 9, 12

Python Streaming Monday, April 9, 12

Python Streaming •The Hadoop Streaming interface is much easier to
demo (it’s also my favorite feature, and was the hardest to implement) Monday, April 9, 12

demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python Monday, April 9, 12

demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) Monday, April 9, 12

demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster Monday, April 9, 12

demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster • Grab your own copy of this dataset at: http://goo.gl/fSleC Monday, April 9, 12

A Sample Input Doc { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" :
"Here is our forecast\n\n ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Monday, April 9, 12

Setting up Hadoop Streaming •Install the Python support module on
each Hadoop Node: •Build (or download) the Streaming module for the Hadoop adapter: $ sudo pip install pymongo_hadoop $ git clone http://github.com/mongodb/mongo-hadoop.git $ ./sbt mongo-hadoop-streaming/assembly Monday, April 9, 12

Mapper Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import
BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Monday, April 9, 12

Reducer Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import
BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Monday, April 9, 12

Running the MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper /home/ec2-user/enron_map.py -reducer /home/ec2-user/enron_reduce.py
-inputURI mongodb://test_mongodb:27020/enron_mail.messages -outputURI mongodb://test_mongodb:27020/enron_mail.sender_map Monday, April 9, 12

Results! mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" :
"[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more Monday, April 9, 12

Parallelism is Good The Input Data was split into 44
pieces for parallel processing... Monday, April 9, 12

pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Monday, April 9, 12

pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Even with an unsharded collection, MongoHadoop can calculate splits! Monday, April 9, 12

This is just the beginning... Monday, April 9, 12

Looking Forward •Mongo Hadoop Connector 1.0.0 is released and available
as of today •Docs: http://api.mongodb.org/hadoop/ •Downloads & Code: http://github.com/mongodb/mongo- hadoop Monday, April 9, 12

Looking Forward •Lots More Coming; 1.1.0 expected in May 2011
• Support for reading from Multiple Input Collections (“MultiMongo”) •Static BSON Support... Read from and Write to Mongo Backup ﬁles! •S3 / HDFS stored, mongodump format •Great for big ofﬂine batch jobs (this is how Foursquare does it) •Pig input (Read from MongoDB into Pig) •Ruby support in Streaming •Performance improvements (e.g. pipelining BSON for streaming) •Future: Expanded Ecosystem support (Cascading, Oozie, Mahout, etc) Monday, April 9, 12

Looking Forward •We are committed to growing our integration with
Big Data • Not only Hadoop, but other data processing systems our users want such as Storm, Disco and Spark. •Initial Disco support is almost complete; look for it this summer •If you have other data processing toolchains you’d like to see integration with, let us know! Monday, April 9, 12

conferences, appearances, and meetups http://www.10gen.com/events http://linkd.in/joinmongo @mongodb http://bit.ly/mongofb Did I
Mention We’re Hiring? http://www.10gen.com/careers ( Jobs of all sorts, all over the world! ) Contact Me [email protected] (twitter: @rit) Monday, April 9, 12

Taming the Elephant In the Room with the MongoD...

Taming the Elephant In the Room with the MongoDB Hadoop Connector

More Decks by Brendan McAdams

Other Decks in Technology

Featured

Transcript