Taming the Elephant In the Room with the MongoDB Hadoop Connector

Slide 1

Slide 1 text

Brendan McAdams 10gen, Inc. [email protected] @rit Taming The Elephant In The Room with MongoDB + Hadoop Integration Monday, April 9, 12

Slide 2

Slide 2 text

Integrating MongoDB + Hadoop Monday, April 9, 12

Slide 3

Slide 3 text

Introducing the MongoDB Hadoop Connector Monday, April 9, 12

Slide 4

Slide 4 text

Introducing the MongoDB Hadoop Connector •For the past year and a half, 10gen has been exploring ways to integrate MongoDB + Hadoop Monday, April 9, 12

Slide 5

Slide 5 text

Introducing the MongoDB Hadoop Connector •For the past year and a half, 10gen has been exploring ways to integrate MongoDB + Hadoop •Outgrowth from work I did prior to joining 10gen – “Luau”, focused on Pig support for ETL Monday, April 9, 12

Slide 6

Slide 6 text

Introducing the MongoDB Hadoop Connector Monday, April 9, 12

Slide 7

Slide 7 text

Introducing the MongoDB Hadoop Connector •What have we been doing for 18 months? Monday, April 9, 12

Slide 8

Slide 8 text

Introducing the MongoDB Hadoop Connector •What have we been doing for 18 months? •Building a polished, reliable product which provides real beneﬁt to our users Monday, April 9, 12

Slide 9

Slide 9 text

Introducing the MongoDB Hadoop Connector •What have we been doing for 18 months? •Building a polished, reliable product which provides real beneﬁt to our users •Improving and testing feature sets Monday, April 9, 12

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Introducing the MongoDB Hadoop Connector Monday, April 9, 12

Slide 12

Slide 12 text

Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0 of this Integration: The MongoDB Hadoop Connector Monday, April 9, 12

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0 of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB •Write MapReduce jobs in Python via Hadoop Streaming Monday, April 9, 12

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Community Contributions are Key Monday, April 9, 12

Slide 18

Slide 18 text

Community Contributions are Key •Lots of effort from the community to make this project come together Monday, April 9, 12

Slide 19

Slide 19 text

Community Contributions are Key •Lots of effort from the community to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau Monday, April 9, 12

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Community Contributions are Key •Lots of effort from the community to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau •Evan Korth (@evankorth) led a New York University projects class which built the initial input split support •Joseph Shraibman, Sumin Xia, Priya Manda, and Rushin Shah worked on this feature Monday, April 9, 12

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Why Integrate MongoDB + Hadoop? Monday, April 9, 12

Slide 24

Slide 24 text

Separation of Concern Monday, April 9, 12

Slide 25

Slide 25 text

Separation of Concern •Data storage and data processing are often separate concerns Monday, April 9, 12

Slide 26

Slide 26 text

Separation of Concern •Data storage and data processing are often separate concerns •MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation) Monday, April 9, 12

Slide 27

Slide 27 text

Slide 28

Slide 28 text

The Right Tool for the Job Monday, April 9, 12

Slide 29

Slide 29 text

The Right Tool for the Job •JavaScript isn’t the most ideal language for many types of calculations Monday, April 9, 12

Slide 30

Slide 30 text

The Right Tool for the Job •JavaScript isn’t the most ideal language for many types of calculations •Slow Monday, April 9, 12

Slide 31

Slide 31 text

The Right Tool for the Job •JavaScript isn’t the most ideal language for many types of calculations •Slow •Limited datatypes Monday, April 9, 12

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

The Right Tool for the Job •JavaScript isn’t the most ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM •Rich, powerful ecosystem •Hadoop has machine learning, ETL, and many other tools which are much more ﬂexible than the processing tools in MongoDB Monday, April 9, 12

Slide 35

Slide 35 text

Being a Good Neighbor Monday, April 9, 12

Slide 36

Slide 36 text

Being a Good Neighbor •Integration with Customers’ Existing Stacks & Toolchains is Crucial Monday, April 9, 12

Slide 37

Slide 37 text

Being a Good Neighbor •Integration with Customers’ Existing Stacks & Toolchains is Crucial •Many users & customers already have Hadoop in their stacks Monday, April 9, 12

Slide 38

Slide 38 text

Being a Good Neighbor •Integration with Customers’ Existing Stacks & Toolchains is Crucial •Many users & customers already have Hadoop in their stacks •They want us to “play nicely” with their existing toolchains Monday, April 9, 12

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Capabilities Monday, April 9, 12

Slide 41

Slide 41 text

Hadoop Connector Capabilities Monday, April 9, 12

Slide 42

Slide 42 text

Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input Splits”) for parallel Hadoop processing Monday, April 9, 12

Slide 43

Slide 43 text

Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input Splits”) for parallel Hadoop processing •Without splits, only one mapper can run Monday, April 9, 12

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections •Sharded: Read individual chunks from conﬁg server into Hadoop Monday, April 9, 12

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Parallel Processing of Splits Monday, April 9, 12

Slide 48

Slide 48 text

Parallel Processing of Splits •Ship “Splits” to Mappers as hostname, database, collection, & query Monday, April 9, 12

Slide 49

Slide 49 text

Parallel Processing of Splits •Ship “Splits” to Mappers as hostname, database, collection, & query •Each Mapper reads the relevant documents in Monday, April 9, 12

Slide 50

Slide 50 text

Parallel Processing of Splits •Ship “Splits” to Mappers as hostname, database, collection, & query •Each Mapper reads the relevant documents in •Parallel processing for high performance Monday, April 9, 12

Slide 51

Slide 51 text

Slide 52

Slide 52 text

MapReduce in MongoDB (JavaScript) Data Map() emit(k,v) Sort(k) Group(k) Reduce(k,values) k,v Finalize(k,v) k,v MongoDB map iterates on documents Document is $this 1 at time per shard Input matches output Can run multiple times Monday, April 9, 12

Slide 53

Slide 53 text

Hadoop MapReduce w/ MongoDB Map (k1, v1, ctx) ctx.write(k2,v2) Map (k1, v1, ctx) ctx.write(k2,v2) Map (k1, v1, ctx) ctx.write(k2,v2) Creates a list of Input Splits (InputFormat) Output Format MongoDB single server or sharded cluster same as Mongo's shard chunks (64mb) each split each split each split MongoDB RecordReader Runs on same thread as map Reducer threads Partitioner(k2) Partitioner(k2) Partitioner(k2) Many map operations 1 at time per input split Runs once per key Reduce(k2,values3) kf,vf Combiner(k2,values2) k2, v3 Combiner(k2,values2) k2, v3 Combiner(k2,values2) k2, v3 Sort(k2) Sort(k2) Sort(keys2) Monday, April 9, 12

Slide 54

Slide 54 text

MongoDB Hadoop Connector In Action Monday, April 9, 12

Slide 55

Slide 55 text

Python Streaming Monday, April 9, 12

Slide 56

Slide 56 text

Python Streaming •The Hadoop Streaming interface is much easier to demo (it’s also my favorite feature, and was the hardest to implement) Monday, April 9, 12

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Python Streaming •The Hadoop Streaming interface is much easier to demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) Monday, April 9, 12

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

A Sample Input Doc { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Monday, April 9, 12

Slide 62

Slide 62 text

Setting up Hadoop Streaming •Install the Python support module on each Hadoop Node: •Build (or download) the Streaming module for the Hadoop adapter: $ sudo pip install pymongo_hadoop $ git clone http://github.com/mongodb/mongo-hadoop.git $ ./sbt mongo-hadoop-streaming/assembly Monday, April 9, 12

Slide 63

Slide 63 text

Mapper Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Monday, April 9, 12

Slide 64

Slide 64 text

Reducer Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Monday, April 9, 12

Slide 65

Slide 65 text

Running the MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper /home/ec2-user/enron_map.py -reducer /home/ec2-user/enron_reduce.py -inputURI mongodb://test_mongodb:27020/enron_mail.messages -outputURI mongodb://test_mongodb:27020/enron_mail.sender_map Monday, April 9, 12

Slide 66

Slide 66 text

Results! mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more Monday, April 9, 12

Slide 67

Slide 67 text

Parallelism is Good The Input Data was split into 44 pieces for parallel processing... Monday, April 9, 12

Slide 68

Slide 68 text

Parallelism is Good The Input Data was split into 44 pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Monday, April 9, 12

Slide 69

Slide 69 text

Parallelism is Good The Input Data was split into 44 pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Monday, April 9, 12

Slide 70

Slide 70 text

Parallelism is Good The Input Data was split into 44 pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Even with an unsharded collection, MongoHadoop can calculate splits! Monday, April 9, 12

Slide 71

Slide 71 text

This is just the beginning... Monday, April 9, 12

Slide 72

Slide 72 text

Looking Forward •Mongo Hadoop Connector 1.0.0 is released and available as of today •Docs: http://api.mongodb.org/hadoop/ •Downloads & Code: http://github.com/mongodb/mongo- hadoop Monday, April 9, 12

Slide 73

Slide 73 text

Looking Forward •Lots More Coming; 1.1.0 expected in May 2011 • Support for reading from Multiple Input Collections (“MultiMongo”) •Static BSON Support... Read from and Write to Mongo Backup ﬁles! •S3 / HDFS stored, mongodump format •Great for big ofﬂine batch jobs (this is how Foursquare does it) •Pig input (Read from MongoDB into Pig) •Ruby support in Streaming •Performance improvements (e.g. pipelining BSON for streaming) •Future: Expanded Ecosystem support (Cascading, Oozie, Mahout, etc) Monday, April 9, 12

Slide 74

Slide 74 text

Looking Forward •We are committed to growing our integration with Big Data • Not only Hadoop, but other data processing systems our users want such as Storm, Disco and Spark. •Initial Disco support is almost complete; look for it this summer •If you have other data processing toolchains you’d like to see integration with, let us know! Monday, April 9, 12

Slide 75

Slide 75 text

conferences, appearances, and meetups http://www.10gen.com/events http://linkd.in/joinmongo @mongodb http://bit.ly/mongofb Did I Mention We’re Hiring? http://www.10gen.com/careers ( Jobs of all sorts, all over the world! ) Contact Me [email protected] (twitter: @rit) Monday, April 9, 12