Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming the Elephant In the Room with the MongoDB Hadoop Connector

Taming the Elephant In the Room with the MongoDB Hadoop Connector

Mongo Philly 2012

Brendan McAdams

April 09, 2012
Tweet

More Decks by Brendan McAdams

Other Decks in Technology

Transcript

  1. Brendan McAdams 10gen, Inc. [email protected] @rit Taming The Elephant In

    The Room with MongoDB + Hadoop Integration Monday, April 9, 12
  2. Introducing the MongoDB Hadoop Connector •For the past year and

    a half, 10gen has been exploring ways to integrate MongoDB + Hadoop Monday, April 9, 12
  3. Introducing the MongoDB Hadoop Connector •For the past year and

    a half, 10gen has been exploring ways to integrate MongoDB + Hadoop •Outgrowth from work I did prior to joining 10gen – “Luau”, focused on Pig support for ETL Monday, April 9, 12
  4. Introducing the MongoDB Hadoop Connector •What have we been doing

    for 18 months? •Building a polished, reliable product which provides real benefit to our users Monday, April 9, 12
  5. Introducing the MongoDB Hadoop Connector •What have we been doing

    for 18 months? •Building a polished, reliable product which provides real benefit to our users •Improving and testing feature sets Monday, April 9, 12
  6. Introducing the MongoDB Hadoop Connector •What have we been doing

    for 18 months? •Building a polished, reliable product which provides real benefit to our users •Improving and testing feature sets •Enhancing integration between the MongoDB Server + Hadoop Connector for max performance Monday, April 9, 12
  7. Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0

    of this Integration: The MongoDB Hadoop Connector Monday, April 9, 12
  8. Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java Monday, April 9, 12
  9. Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB Monday, April 9, 12
  10. Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB •Write MapReduce jobs in Python via Hadoop Streaming Monday, April 9, 12
  11. Introducing the MongoDB Hadoop Connector •Today, we are releasing v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB •Write MapReduce jobs in Python via Hadoop Streaming •Collect massive amounts of Logging output into MongoDB via Flume Monday, April 9, 12
  12. Community Contributions are Key •Lots of effort from the community

    to make this project come together Monday, April 9, 12
  13. Community Contributions are Key •Lots of effort from the community

    to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau Monday, April 9, 12
  14. Community Contributions are Key •Lots of effort from the community

    to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau •Evan Korth (@evankorth) led a New York University projects class which built the initial input split support Monday, April 9, 12
  15. Community Contributions are Key •Lots of effort from the community

    to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau •Evan Korth (@evankorth) led a New York University projects class which built the initial input split support •Joseph Shraibman, Sumin Xia, Priya Manda, and Rushin Shah worked on this feature Monday, April 9, 12
  16. Community Contributions are Key •Lots of effort from the community

    to make this project come together •Max Afonov (@max4f) helped conceive and build the original Luau •Evan Korth (@evankorth) led a New York University projects class which built the initial input split support •Joseph Shraibman, Sumin Xia, Priya Manda, and Rushin Shah worked on this feature •Russell Jurney (@rjurney) has done a lot of heavy lifting on improving Pig Monday, April 9, 12
  17. Separation of Concern •Data storage and data processing are often

    separate concerns •MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation) Monday, April 9, 12
  18. Separation of Concern •Data storage and data processing are often

    separate concerns •MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation) •Hadoop is built for scalable processing of large datasets Monday, April 9, 12
  19. The Right Tool for the Job •JavaScript isn’t the most

    ideal language for many types of calculations Monday, April 9, 12
  20. The Right Tool for the Job •JavaScript isn’t the most

    ideal language for many types of calculations •Slow Monday, April 9, 12
  21. The Right Tool for the Job •JavaScript isn’t the most

    ideal language for many types of calculations •Slow •Limited datatypes Monday, April 9, 12
  22. The Right Tool for the Job •JavaScript isn’t the most

    ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM Monday, April 9, 12
  23. The Right Tool for the Job •JavaScript isn’t the most

    ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM •Rich, powerful ecosystem Monday, April 9, 12
  24. The Right Tool for the Job •JavaScript isn’t the most

    ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM •Rich, powerful ecosystem •Hadoop has machine learning, ETL, and many other tools which are much more flexible than the processing tools in MongoDB Monday, April 9, 12
  25. Being a Good Neighbor •Integration with Customers’ Existing Stacks &

    Toolchains is Crucial •Many users & customers already have Hadoop in their stacks Monday, April 9, 12
  26. Being a Good Neighbor •Integration with Customers’ Existing Stacks &

    Toolchains is Crucial •Many users & customers already have Hadoop in their stacks •They want us to “play nicely” with their existing toolchains Monday, April 9, 12
  27. Being a Good Neighbor •Integration with Customers’ Existing Stacks &

    Toolchains is Crucial •Many users & customers already have Hadoop in their stacks •They want us to “play nicely” with their existing toolchains •Different groups in companies may mandate all data be processable in Hadoop Monday, April 9, 12
  28. Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input

    Splits”) for parallel Hadoop processing Monday, April 9, 12
  29. Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input

    Splits”) for parallel Hadoop processing •Without splits, only one mapper can run Monday, April 9, 12
  30. Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input

    Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections Monday, April 9, 12
  31. Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input

    Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections •Sharded: Read individual chunks from config server into Hadoop Monday, April 9, 12
  32. Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input

    Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections •Sharded: Read individual chunks from config server into Hadoop •Unsharded: Create splits, similar to how sharding chunks are calculated Monday, April 9, 12
  33. Parallel Processing of Splits •Ship “Splits” to Mappers as hostname,

    database, collection, & query Monday, April 9, 12
  34. Parallel Processing of Splits •Ship “Splits” to Mappers as hostname,

    database, collection, & query •Each Mapper reads the relevant documents in Monday, April 9, 12
  35. Parallel Processing of Splits •Ship “Splits” to Mappers as hostname,

    database, collection, & query •Each Mapper reads the relevant documents in •Parallel processing for high performance Monday, April 9, 12
  36. Parallel Processing of Splits •Ship “Splits” to Mappers as hostname,

    database, collection, & query •Each Mapper reads the relevant documents in •Parallel processing for high performance •Speaks BSON between all layers! Monday, April 9, 12
  37. MapReduce in MongoDB (JavaScript) Data Map() emit(k,v) Sort(k) Group(k) Reduce(k,values)

    k,v Finalize(k,v) k,v MongoDB map iterates on documents Document is $this 1 at time per shard Input matches output Can run multiple times Monday, April 9, 12
  38. Hadoop MapReduce w/ MongoDB Map (k1, v1, ctx) ctx.write(k2,v2) Map

    (k1, v1, ctx) ctx.write(k2,v2) Map (k1, v1, ctx) ctx.write(k2,v2) Creates a list of Input Splits (InputFormat) Output Format MongoDB single server or sharded cluster same as Mongo's shard chunks (64mb) each split each split each split MongoDB RecordReader Runs on same thread as map Reducer threads Partitioner(k2) Partitioner(k2) Partitioner(k2) Many map operations 1 at time per input split Runs once per key Reduce(k2,values3) kf,vf Combiner(k2,values2) k2, v3 Combiner(k2,values2) k2, v3 Combiner(k2,values2) k2, v3 Sort(k2) Sort(k2) Sort(keys2) Monday, April 9, 12
  39. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) Monday, April 9, 12
  40. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python Monday, April 9, 12
  41. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) Monday, April 9, 12
  42. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster Monday, April 9, 12
  43. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster • Grab your own copy of this dataset at: http://goo.gl/fSleC Monday, April 9, 12
  44. A Sample Input Doc { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" :

    "Here is our forecast\n\n ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Monday, April 9, 12
  45. Setting up Hadoop Streaming •Install the Python support module on

    each Hadoop Node: •Build (or download) the Streaming module for the Hadoop adapter: $ sudo pip install pymongo_hadoop $ git clone http://github.com/mongodb/mongo-hadoop.git $ ./sbt mongo-hadoop-streaming/assembly Monday, April 9, 12
  46. Mapper Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Monday, April 9, 12
  47. Reducer Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Monday, April 9, 12
  48. Running the MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper /home/ec2-user/enron_map.py -reducer /home/ec2-user/enron_reduce.py

    -inputURI mongodb://test_mongodb:27020/enron_mail.messages -outputURI mongodb://test_mongodb:27020/enron_mail.sender_map Monday, April 9, 12
  49. Results! mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" :

    "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more Monday, April 9, 12
  50. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... Monday, April 9, 12
  51. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Monday, April 9, 12
  52. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Monday, April 9, 12
  53. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Even with an unsharded collection, MongoHadoop can calculate splits! Monday, April 9, 12
  54. Looking Forward •Mongo Hadoop Connector 1.0.0 is released and available

    as of today •Docs: http://api.mongodb.org/hadoop/ •Downloads & Code: http://github.com/mongodb/mongo- hadoop Monday, April 9, 12
  55. Looking Forward •Lots More Coming; 1.1.0 expected in May 2011

    • Support for reading from Multiple Input Collections (“MultiMongo”) •Static BSON Support... Read from and Write to Mongo Backup files! •S3 / HDFS stored, mongodump format •Great for big offline batch jobs (this is how Foursquare does it) •Pig input (Read from MongoDB into Pig) •Ruby support in Streaming •Performance improvements (e.g. pipelining BSON for streaming) •Future: Expanded Ecosystem support (Cascading, Oozie, Mahout, etc) Monday, April 9, 12
  56. Looking Forward •We are committed to growing our integration with

    Big Data • Not only Hadoop, but other data processing systems our users want such as Storm, Disco and Spark. •Initial Disco support is almost complete; look for it this summer •If you have other data processing toolchains you’d like to see integration with, let us know! Monday, April 9, 12
  57. conferences, appearances, and meetups http://www.10gen.com/events http://linkd.in/joinmongo @mongodb http://bit.ly/mongofb Did I

    Mention We’re Hiring? http://www.10gen.com/careers ( Jobs of all sorts, all over the world! ) Contact Me [email protected] (twitter: @rit) Monday, April 9, 12