Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB + Hadoop: Taming the Elephant in the Room

D8fc2580cfaca035f666d9e4ee79a7f7?s=47 mongodb
June 21, 2012
2k

MongoDB + Hadoop: Taming the Elephant in the Room

10gen released the Hadoop plugin for MongoDB v1.0.0. In this session, Brendan will go through how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

D8fc2580cfaca035f666d9e4ee79a7f7?s=128

mongodb

June 21, 2012
Tweet

Transcript

  1. Brendan McAdams 10gen, Inc. brendan@10gen.com @rit Taming The Elephant In

    The Room with MongoDB + Hadoop Integration
  2. Big Data at a Glance • Big Data can be

    gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Large Dataset Primary Key as “username”
  3. Storing & Scaling Big Data MongoDB and Hadoop

  4. Big Data at a Glance • Systems like Google File

    System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalanced Large Dataset Primary Key as “username” a b c d e f g h s t u v w x y z ...
  5. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z
  6. Big Data at a Glance Large Dataset Primary Key as

    “username” a b c d e f g h s t u v w x y z MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)
  7. Large Dataset Primary Key as “username” Scaling Data Node 1

    25% of chunks Data Node 2 25% of chunks Data Node 3 25% of chunks Data Node 4 25% of chunks a b c d e f g h s t u v w x y z Representing data as chunks allows many levels of scale across n data nodes
  8. Scaling Data Node 1 Data Node 2 Data Node 3

    Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The set of chunks can be evenly distributed across n data nodes
  9. Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance.
  10. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z
  11. Writes Routed to Appropriate Chunk Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z Writes are efficiently routed to the appropriate node & chunk
  12. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks
  13. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks
  14. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks
  15. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1
  16. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1
  17. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 Each new part of the Z chunk (left & right) now contains half of the keys
  18. Chunk Splitting & Balancing Data Node 1 Data Node 2

    Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 As chunks continue to grow and split, they can be rebalanced to keep an equal share of data on each server.
  19. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z1 Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z2
  20. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z1 z2
  21. Reads with Key Routed Efficiently Data Node 1 Data Node

    2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Keys “T”->”X” Reading multiple values by Primary Key Reads routed efficiently to specific chunks in range t u v w x z1 z2
  22. Processing Big Data MongoDB and Hadoop

  23. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  24. Processing Scalable Big Data •Just as we must be able

    to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  25. Processing Scalable Big Data • The answer to calculating big

    data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes • Aggregate the results into a final set of results
  26. Processing Scalable Big Data • These pieces are not chunks

    – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well • Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processing • The most common application of these techniques is MapReduce • Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasets
  27. MapReduce to Calculate Big Data • MapReduce is designed to

    effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results • MongoDB Supports MapReduce with JavaScript • Limitations on its scalability • In addition to the HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop • No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engine
  28. MapReduce to Calculate Big Data • MapReduce made up of

    a series of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records • Count # of times a particular user has received email
  29. MapReducing Email to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?)
  30. Map Step to: tyler from: brendan subject: Ruby Support to:

    brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} map function emit(k, v) map function breaks each document into a key (grouping) & value
  31. Group/Shuffle Step key: tyler value: {count: 1} key: brendan value:

    {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks)
  32. Group/Shuffle Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks)
  33. Reduce Step key: brendan values: [{count: 1}, {count: 1}] key:

    mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2}
  34. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  35. Processing Scalable Big Data • MapReduce provides an effective system

    for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?
  36. Integrating MongoDB + Hadoop

  37. Separation of Concern • Data storage and data processing are

    often separate concerns • MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation Framework) • Hadoop is built for scalable processing of large datasets
  38. MapReducing in MongoDB - Single Server Large Dataset (single mongod)

    Primary Key as “username” Only one MapReduce thread available
  39. MapReducing in MongoDB - Sharding One MapReduce thread per shard

    (no per-chunk parallelism) Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) ... Architecturally, the number of processing nodes is limited to our number of data storage nodes.
  40. The Right Tool for the Job • JavaScript isn’t always

    the most ideal language for many types of calculations • Slow • Limited datatypes • No access to complex analytics libraries available on the JVM • Rich, powerful ecosystem of tools on the JVM + Hadoop • Hadoop has machine learning, ETL, and many other tools which are much more flexible than the processing tools in MongoDB
  41. Being a Good Neighbor • Integration with Customers’ Existing Stacks

    & Toolchains is Crucial • Many users & customers already have Hadoop in their stacks • They want us to “play nicely” with their existing toolchains • Different groups in companies may mandate all data be processable in Hadoop
  42. Capabilities

  43. Introducing the MongoDB Hadoop Connector • Recently, we released v1.0.0

    of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java • Write Pig (ETL) jobs’ output to MongoDB • Write MapReduce jobs in Python via Hadoop Streaming • Collect massive amounts of Logging output into MongoDB via Flume
  44. Hadoop Connector Capabilities • Split large datasets into smaller chunks

    (“Input Splits”) for parallel Hadoop processing • Without splits, only one mapper can run • Connector can split both sharded & unsharded collections • Sharded: Read individual chunks from config server into Hadoop • Unsharded: Create splits, similar to how sharding chunks are calculated
  45. MapReducing MongoDB + Hadoop - Single Server Large Dataset (single

    mongod) Primary Key as “username” Each Hadoop node runs a processing task per core a b c d e f g h s t u v w x y z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core
  46. MapReducing MongoDB + Hadoop - Sharding Data Node 1 Data

    Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core
  47. Parallel Processing of Splits • Ship “Splits” to Mappers as

    hostname, database, collection, & query • Each Mapper reads the relevant documents in • Parallel processing for high performance • Speaks BSON between all layers!
  48. MongoDB Hadoop Connector In Action

  49. Python Streaming •The Hadoop Streaming interface is much easier to

    demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject • Represent _id and the full document, respectively •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster • Grab your own copy of this dataset at: http://goo.gl/fSleC
  50. A Sample Input Doc { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" :

    "Here is our forecast\n\n ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "phillip.allen@enron.com", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } }
  51. Setting up Hadoop Streaming •Install the Python support module on

    each Hadoop Node: •Build (or download) the Streaming module for the Hadoop adapter: $ sudo pip install pymongo_hadoop $ git clone http://github.com/mongodb/mongo-hadoop.git $ ./sbt mongo-hadoop-streaming/assembly
  52. Mapper Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  53. Reducer Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import

    BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer)
  54. Running the MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper /home/ec2-user/enron_map.py -reducer /home/ec2-user/enron_reduce.py

    -inputURI mongodb://test_mongodb:27020/enron_mail.messages -outputURI mongodb://test_mongodb:27020/enron_mail.sender_map
  55. Results! mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" :

    "kenneth.lay@enron.com", "f" : "15126-1267@m2.innovyx.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "2586207@www4.imakenews.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "40enron@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..davis@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..hughes@enron.com" }, "count" : 4 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..lindholm@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..schroeder@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..shankman@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "aaron.berutti@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "adnan.patel@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "adriana.wynn@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "adventurehf@pdq.net" }, "count" : 3 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "aeplager@yahoo.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "agatha.tran@enron.com" }, "count" : 3 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "ahampshire-cowan@howard.edu" }, "count" : 4 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "ahaws@austin.rr.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "alberto.gude@enron.com" }, "count" : 6 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "alfredo@dvinci.net" }, "count" : 3 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "amanda.day@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "amepya2@hotmail.com" }, "count" : 1 } has more
  56. Parallelism is Good The Input Data was split into 44

    pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Even with an unsharded collection, MongoHadoop can calculate splits!
  57. We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only

    shipped Python support •Currently in git master, and due to be released with 1.1 is support for two additional languages • Ruby (Tyler Brock - @tylerbrock) • Node.JS (Mike O’Brien - @mpobrien) •The same Enron MapReduce job can be accomplished with either of these languages as well
  58. Ruby + Mongo-Hadoop Streaming •As there isn’t an official release

    for Ruby support yet, you’ll need to build the gem by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the mongo- hadoop module from Ruby
  59. Enron Map from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.map do

    |document| if document.has_key?('headers') headers = document['headers'] if ['To', 'From'].all? { |header| headers.has_key? (header) } to_field = headers['To'] from_field = headers['From'] recipients = to_field.split(',').map { |recipient| recipient.strip } recipients.map { |recipient| {:_id => {:f => from_field, :t => recipient}, :count => 1} } end end end
  60. Enron Reduce from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.reduce do

    |key, values| count = values.reduce { |sum, current| sum += current['count'] } { :_id => key, :count => count } end
  61. Running the Ruby MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.rb -reducer

    examples/enron/enron_reduce.rb -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output
  62. Node.JS + Mongo-Hadoop Streaming •As there isn’t an official release

    for Node.JS support yet, you’ll need to build the Node module by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the node_mongo_hadoop module from Node.JS
  63. Enron Map from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop')

    var trimString = function(str){ return String(str).replace(/^\s+|\s+$/g, ''); } function mapFunc(doc, callback){ if(doc.headers && doc.headers.From && doc.headers.To){ var from_field = doc['headers']['From'] var to_field = doc['headers']['To'] var recips = [] to_field.split(',').forEach(function(to){ callback( {'_id': {'f':from_field, 't':trimString(to)}, 'count': 1} ) }); } } node_mongo_hadoop.MapBSONStream(mapFunc);
  64. Enron Reduce from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop')

    function reduceFunc(key, values, callback){ var count = 0; values.forEach(function(v){ count += v.count }); callback( {'_id':key, 'count':count } ); } node_mongo_hadoop.ReduceBSONStream(reduceFunc);
  65. Running the Node.JS MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.rb -reducer

    examples/enron/enron_reduce.rb -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output
  66. [Joining the Hive mind]

  67. Hive + MongoDB • Over the past weekend, I ended

    up with a few spare hours and started playing with a frequently requested feature • Hive is a Hadoop based Data Warehousing system, providing a SQL- like language (dubbed “QL”) • Designed for large datasets stored on HDFS • Lots of SQL-like facilities such as data summarization, aggregation and analysis; all compile down to Hadoop MapReduce tasks • Custom User Defined functions can even replace inefficient Hive queries with raw MapReduce • Many users have requested support for this with MongoDB Data
  68. Sticking BSON in the Hive • Step 1 involved teaching

    Hive to read MongoDB Backup files - essentially, raw BSON • While there are some APIs that we can use to talk directly to MongoDB, we haven’t explored that yet • With this code, it is possible to load a .bson file (typically produced by mongodump) directly into Hive and query it • No conversion needed to a “native” Hive format - BSON is read directly • Still needs some polish and tweaking, but this is now slated to be included in the upcoming 1.1 release
  69. Loading BSON into Hive • As Hive emulates a Relational

    Database, tables need schema ( evaluating ways to ‘infer’ schema to make this more automatic ) • Let’s load some MongoDB collections into Hive and play with the data!
  70. Loading BSON into Hive

  71. Loading BSON into Hive • We have BSON Files to

    load, now we need to instruct Hive about their Schemas ...
  72. Loading BSON into Hive

  73. Defining Hive Schemas • We’ve given some instructions to Hive

    about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; • Our first line defines the structure - with a column for ‘student’, ‘name’ and ‘score’, each having a SQL-like datatype. • ROW FORMAT SERDE instructs Hive to use a Serde of ‘BSONSerde’ • In Hive, a Serde is a special codec that explains how to read and write (serialize and deserialize) a custom data format containing Hive rows • We also need to tell Hive to use an INPUTFORMAT of ‘BSONFileInputFormat’ which tells it how to Read BSON files off of disk into individual rows (the Serde is instructions for how to turn individual lines of BSON into a Hive friendly format) • Finally, we specify where Hive should store the metadata, etc with LOCATION
  74. Loading Data to Hive •Finally, we need to load data

    into the Hive table, from our raw BSON file • Now we can query! hive> LOAD DATA LOCAL INPATH "dump/training/scores.bson" INTO TABLE scores;
  75. Querying Hive

  76. Querying Hive • Most standard SQL-like queries work - though

    I’m not going to enumerate the ins and outs of HiveQL today • What we can do with Hive that we can’t with MongoDB ... is joins • In addition to the scores data, I also created a collection of student ids and randomly generated names. Let’s look at joining these to our scores in Hive
  77. Joins from BSON + Hive

  78. Joins from BSON + Hive hive> SELECT u.firstName, u.lastName, u.sex,

    s.name, s.score FROM > scores s JOIN students u ON u.studentID = s.student > ORDER BY s.score DESC; DELPHIA DOUIN Female exam 99 DOMINIQUE SUAZO Male essay 99 ETTIE BETZIG Female exam 99 ADOLFO PIRONE Male exam 99 IVORY NETTERS Male essay 99 RAFAEL HURLES Male essay 99 KRISTEN VALLERO Female exam 99 CONNIE KNAPPER Female quiz 99 JEANNA DIVELY Female exam 99 TRISTAN SEGAL Male exam 99 WILTON TRULOVE Male essay 99 THAO OTSMAN Female essay 99 CLARENCE STITZ Male quiz 99 LUIS GUAMAN Male exam 99 WILLARD RUSSAK Male quiz 99 MARCOS HOELLER Male quiz 99 TED BOTTCHER Male essay 99 LAKEISHA NAGAMINE Female essay 99 ALLEN HITT Male exam 99 MADELINE DAWKINS Female essay 99
  79. This is just the beginning...

  80. Looking Forward • Mongo Hadoop Connector 1.0.0 is released and

    available • Docs: http://api.mongodb.org/hadoop/ • Downloads & Code: http://github.com/mongodb/mongo-hadoop
  81. Looking Forward •Lots More Coming; 1.1.0 expected in Summer 2012

    • Support for reading from Multiple Input Collections (“MultiMongo”) • Static BSON Support... Read from and Write to Mongo Backup files! • S3 / HDFS stored, mongodump format • Great for big offline batch jobs (this is how Foursquare does it) • Pig input (Read from MongoDB into Pig) • Performance improvements (e.g. pipelining BSON for streaming) • Future: Expanded Ecosystem support (Cascading, Oozie, Mahout, etc)
  82. Looking Forward • We are looking to grow our integration

    with Big Data • Not only Hadoop, but other data processing systems our users want such as Storm, Disco and Spark. • Initial Disco support (Nokia’s Python MapReduce framework) is almost complete; look for it this summer • If you have other data processing toolchains you’d like to see integration with, let us know!
  83. http://linkd.in/joinmongo @mongodb facebook.com/mongodb Did I Mention We’re Hiring? http://www.10gen.com/careers (

    Jobs of all sorts, all over the world! ) [Download the Hadoop Connector] http://github.com/mongodb/mongo-hadoop [Docs] http://api.mongodb/ *Contact Me* brendan@10gen.com (twitter: @rit)