Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoNYC2012: MongoDB and Hadoop

mongodb
May 29, 2012
1.4k

MongoNYC2012: MongoDB and Hadoop

MongoNYC2012: MongoDB and Hadoop, Brendan McAdams, 10gen. Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using Hadoop's MapReduce and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With support for Hadoop streaming support goes beyond the native Java enabling map reduce to be run in languages like Python and Ruby.

mongodb

May 29, 2012
Tweet

Transcript

  1. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Large Dataset Primary Key as “username”
  2. Big Data at a Glance • Systems like Google File

    System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalanced Large Dataset Primary Key as “username” a b c d e f g h s t u v w x y z ...
  3. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z
  4. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)
  5. Large Dataset Primary Key as “username” Scaling Data Node 1

    25% of chunks Data Node 2 25% of chunks Data Node 3 25% of chunks Data Node 4 25% of chunks a b c d e f g h s t u v w x y z Representing data as chunks allows many levels of scale across n data nodes
  6. Scaling Data Node 1 Data Node 2 Data Node 3

    Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The set of chunks can be evenly distributed across n data nodes
  7. Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance.
  8. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z
  9. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z Writes are efficiently routed to the appropriate node & chunk
  10. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks
  11. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks
  12. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks
  13. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1
  14. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1
  15. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 Each new part of the Z chunk (left & right) now contains half of the keys
  16. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 As chunks continue to grow and split, they can be rebalanced to keep an equal share of data on each server.
  17. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z1 Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z2
  18. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z1 z2
  19. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Keys “T”->”X” Reading multiple values by Primary Key Reads routed efficiently to specific chunks in range t u v w x z1 z2
  20. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  21. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  22. Processing Scalable Big Data • The answer to calculating big

    data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes • Aggregate the results into a final set of results
  23. Processing Scalable Big Data • These pieces are not chunks

    – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well • Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processing • The most common application of these techniques is MapReduce • Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasets
  24. MapReduce to Calculate Big Data • MapReduce is designed to

    effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results • MongoDB Supports MapReduce with JavaScript • Limitations on its scalability • In addition to the HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop • No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engine
  25. MapReduce to Calculate Big Data • MapReduce made up of

    a series of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records • Count # of times a particular user has received email
  26. MapReducing Email to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?)
  27. Map Step to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} map function emit(k, v) map function breaks each document into a key (grouping) & value
  28. Group/Shuffle Step key: tyler value: {count: 1} key: brendan value:

    {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks)
  29. Group/Shuffle Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks)
  30. Reduce Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2}
  31. Processing Scalable Big Data •MapReduce provides an effective system for

    calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  32. Processing Scalable Big Data •MapReduce provides an effective system for

    calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  33. Separation of Concern •Data storage and data processing are often

    separate concerns •MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation Framework) •Hadoop is built for scalable processing of large datasets
  34. MapReducing in MongoDB - Single Server Large Dataset (single mongod)

    Primary Key as “username” Only one MapReduce thread available
  35. MapReducing in MongoDB - Sharding One MapReduce thread per shard

    (no per-chunk parallelism) Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) ... Architecturally, the number of processing nodes is limited to our number of data storage nodes.
  36. The Right Tool for the Job •JavaScript isn’t always the

    most ideal language for many types of calculations •Slow •Limited datatypes •No access to complex analytics libraries available on the JVM •Rich, powerful ecosystem •Hadoop has machine learning, ETL, and many other tools which are much more flexible than the processing tools in MongoDB
  37. Being a Good Neighbor •Integration with Customers’ Existing Stacks &

    Toolchains is Crucial •Many users & customers already have Hadoop in their stacks •They want us to “play nicely” with their existing toolchains •Different groups in companies may mandate all data be processable in Hadoop
  38. Introducing the MongoDB Hadoop Connector •Recently, we released v1.0.0 of

    this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java •Write Pig (ETL) jobs’ output to MongoDB •Write MapReduce jobs in Python via Hadoop Streaming •Collect massive amounts of Logging output into MongoDB via Flume
  39. Hadoop Connector Capabilities •Split large datasets into smaller chunks (“Input

    Splits”) for parallel Hadoop processing •Without splits, only one mapper can run •Connector can split both sharded & unsharded collections •Sharded: Read individual chunks from config server into Hadoop •Unsharded: Create splits, similar to how sharding chunks are calculated
  40. MapReducing MongoDB + Hadoop - Single Server Large Dataset (single

    mongod) Primary Key as “username” Each Hadoop node runs a processing task per core a b c d e f g h s t u v w x y z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core
  41. MapReducing MongoDB + Hadoop - Sharding Data Node 1 Data

    Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core
  42. Parallel Processing of Splits •Ship “Splits” to Mappers as hostname,

    database, collection, & query •Each Mapper reads the relevant documents in •Parallel processing for high performance •Speaks BSON between all layers!
  43. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject • Represent _id and the full document, respectively •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster • Grab your own copy of this dataset at: http://goo.gl/fSleC
  44. A Sample Input Doc { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" :

    "Here is our forecast\n\n ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } }
  45. Setting up Hadoop Streaming •Install the Python support module on

    each Hadoop Node: •Build (or download) the Streaming module for the Hadoop adapter: $ sudo pip install pymongo_hadoop $ git clone http://github.com/mongodb/mongo-hadoop.git $ ./sbt mongo-hadoop-streaming/assembly
  46. Mapper Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  47. Reducer Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer)
  48. Running the MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper /home/ec2-user/enron_map.py -reducer /home/ec2-user/enron_reduce.py

    -inputURI mongodb://test_mongodb:27020/enron_mail.messages -outputURI mongodb://test_mongodb:27020/enron_mail.sender_map
  49. Results! mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" :

    "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more
  50. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Even with an unsharded collection, MongoHadoop can calculate splits!
  51. We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only

    shipped Python support •Currently in git master, and due to be released with 1.1 is support for two additional languages • Ruby (Tyler Brock - @tylerbrock) • Node.JS (Mike O’Brien - @mpobrien) •The same Enron MapReduce job can be accomplished with either of these languages as well
  52. Ruby + Mongo-Hadoop Streaming •As there isn’t an official release

    for Ruby support yet, you’ll need to build the gem by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the mongo- hadoop module from Ruby
  53. Enron Map from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.map do

    |document| if document.has_key?('headers') headers = document['headers'] if ['To', 'From'].all? { |header| headers.has_key? (header) } to_field = headers['To'] from_field = headers['From'] recipients = to_field.split(',').map { |recipient| recipient.strip } recipients.map { |recipient| {:_id => {:f => from_field, :t => recipient}, :count => 1} } end end end
  54. Enron Reduce from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.reduce do

    |key, values| count = values.reduce { |sum, current| sum += current['count'] } { :_id => key, :count => count } end
  55. Running the Ruby MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.rb -reducer

    examples/enron/enron_reduce.rb -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output
  56. Node.JS + Mongo-Hadoop Streaming •As there isn’t an official release

    for Node.JS support yet, you’ll need to build the Node module by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the node_mongo_hadoop module from Node.JS
  57. Enron Map from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop')

    var trimString = function(str){ return String(str).replace(/^\s+|\s+$/g, ''); } function mapFunc(doc, callback){ if(doc.headers && doc.headers.From && doc.headers.To){ var from_field = doc['headers']['From'] var to_field = doc['headers']['To'] var recips = [] to_field.split(',').forEach(function(to){ callback( {'_id': {'f':from_field, 't':trimString(to)}, 'count': 1} ) }); } } node_mongo_hadoop.MapBSONStream(mapFunc);
  58. Enron Reduce from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop')

    function reduceFunc(key, values, callback){ var count = 0; values.forEach(function(v){ count += v.count }); callback( {'_id':key, 'count':count } ); } node_mongo_hadoop.ReduceBSONStream(reduceFunc);
  59. Running the Node.JS MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.rb -reducer

    examples/enron/enron_reduce.rb -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output
  60. Looking Forward •Mongo Hadoop Connector 1.0.0 is released and available

    •Docs: http://api.mongodb.org/hadoop/ •Downloads & Code: http://github.com/mongodb/mongo-hadoop
  61. Looking Forward •Lots More Coming; 1.1.0 expected in Summer 2012

    • Support for reading from Multiple Input Collections (“MultiMongo”) •Static BSON Support... Read from and Write to Mongo Backup files! •S3 / HDFS stored, mongodump format •Great for big offline batch jobs (this is how Foursquare does it) •Pig input (Read from MongoDB into Pig) •Performance improvements (e.g. pipelining BSON for streaming) •Future: Expanded Ecosystem support (Cascading, Oozie, Mahout, etc)
  62. Looking Forward •We are looking to grow our integration with

    Big Data • Not only Hadoop, but other data processing systems our users want such as Storm, Disco and Spark. •Initial Disco support (Nokia’s Python MapReduce framework) is almost complete; look for it this summer •If you have other data processing toolchains you’d like to see integration with, let us know!
  63. http://linkd.in/joinmongo @mongodb http://bit.ly/mongofb Did I Mention We’re Hiring? http://www.10gen.com/careers (

    Jobs of all sorts, all over the world! ) More Questions? Join me for a Whiteboard Session Later... London Room - 6th Floor @ 4PM *Contact Me* [email protected] (twitter: @rit)