Slide 1

Slide 1 text

Brendan McAdams 10gen, Inc. [email protected] @rit Taming The Elephant In The Room with MongoDB + Hadoop Integration

Slide 2

Slide 2 text

Big Data at a Glance • Big Data can be gigabytes, terabytes, petabytes or exabytes • An ideal big data system scales up and down around various data sizes – while providing a uniform view • Major concerns • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data? Large Dataset Primary Key as “username”

Slide 3

Slide 3 text

Storing & Scaling Big Data MongoDB and Hadoop

Slide 4

Slide 4 text

Big Data at a Glance • Systems like Google File System (which inspired Hadoop’s HDFS) and MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalanced Large Dataset Primary Key as “username” a b c d e f g h s t u v w x y z ...

Slide 5

Slide 5 text

Big Data at a Glance Large Dataset Primary Key as “username” a b c d e f g h s t u v w x y z

Slide 6

Slide 6 text

Big Data at a Glance Large Dataset Primary Key as “username” a b c d e f g h s t u v w x y z MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)

Slide 7

Slide 7 text

Large Dataset Primary Key as “username” Scaling Data Node 1 25% of chunks Data Node 2 25% of chunks Data Node 3 25% of chunks Data Node 4 25% of chunks a b c d e f g h s t u v w x y z Representing data as chunks allows many levels of scale across n data nodes

Slide 8

Slide 8 text

Scaling Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The set of chunks can be evenly distributed across n data nodes

Slide 9

Slide 9 text

Add Nodes: Chunk Rebalancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z The goal is equilibrium - an equal distribution. As nodes are added (or even removed) chunks can be redistributed for balance.

Slide 10

Slide 10 text

Writes Routed to Appropriate Chunk Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z

Slide 11

Slide 11 text

Writes Routed to Appropriate Chunk Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z Writes are efficiently routed to the appropriate node & chunk

Slide 12

Slide 12 text

Chunk Splitting & Balancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z Write to key“ziggy” z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks

Slide 13

Slide 13 text

Chunk Splitting & Balancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks

Slide 14

Slide 14 text

Chunk Splitting & Balancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks

Slide 15

Slide 15 text

Chunk Splitting & Balancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1

Slide 16

Slide 16 text

Chunk Splitting & Balancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 If a chunk gets too large (default in MongoDB - 64mb per chunk), It is split into two new chunks z1

Slide 17

Slide 17 text

Chunk Splitting & Balancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 Each new part of the Z chunk (left & right) now contains half of the keys

Slide 18

Slide 18 text

Chunk Splitting & Balancing Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z2 z1 As chunks continue to grow and split, they can be rebalanced to keep an equal share of data on each server.

Slide 19

Slide 19 text

Reads with Key Routed Efficiently Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z1 Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z2

Slide 20

Slide 20 text

Reads with Key Routed Efficiently Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Key “xavier” Reading a single value by Primary Key Read routed efficiently to specific chunk containing key z1 z2

Slide 21

Slide 21 text

Reads with Key Routed Efficiently Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y Read Keys “T”->”X” Reading multiple values by Primary Key Reads routed efficiently to specific chunks in range t u v w x z1 z2

Slide 22

Slide 22 text

Processing Big Data MongoDB and Hadoop

Slide 23

Slide 23 text

Processing Scalable Big Data •Just as we must be able to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?

Slide 24

Slide 24 text

Processing Scalable Big Data •Just as we must be able to scale our storage of data (from gigabytes through exabytes and beyond), we must be able to process it. • We had two questions, one of which we’ve answered... • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?

Slide 25

Slide 25 text

Processing Scalable Big Data • The answer to calculating big data is much the same as storing it • We need to break our data into bite sized pieces • Build functions which can be composed together repeatedly on partitions of our data • Process portions of the data across multiple calculation nodes • Aggregate the results into a final set of results

Slide 26

Slide 26 text

Processing Scalable Big Data • These pieces are not chunks – rather, the individual data points that make up each chunk • Chunks make up a useful data transfer units for processing as well • Transfer Chunks as “Input Splits” to calculation nodes, allowing for scalable parallel processing • The most common application of these techniques is MapReduce • Based on a Google Whitepaper, works with two primary functions – map and reduce – to calculate against large datasets

Slide 27

Slide 27 text

MapReduce to Calculate Big Data • MapReduce is designed to effectively process data at varying scales • Composable function units can be reused repeatedly for scaled results • MongoDB Supports MapReduce with JavaScript • Limitations on its scalability • In addition to the HDFS storage component, Hadoop is built around MapReduce for calculation • MongoDB can be integrated to MapReduce data on Hadoop • No HDFS storage needed - data moves directly between MongoDB and Hadoop’s MapReduce engine

Slide 28

Slide 28 text

MapReduce to Calculate Big Data • MapReduce made up of a series of phases, the primary of which are • Map • Shuffle • Reduce • Let’s look at a typical MapReduce job • Email records • Count # of times a particular user has received email

Slide 29

Slide 29 text

MapReducing Email to: tyler from: brendan subject: Ruby Support to: brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?)

Slide 30

Slide 30 text

Map Step to: tyler from: brendan subject: Ruby Support to: brendan from: tyler subject: Re: Ruby Support to: mike from: brendan subject: Node Support to: brendan from: mike subject: Re: Node Support to: mike from: tyler subject: COBOL Support to: tyler from: mike subject: Re: COBOL Support (WTF?) key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} map function emit(k, v) map function breaks each document into a key (grouping) & value

Slide 31

Slide 31 text

Group/Shuffle Step key: tyler value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: brendan value: {count: 1} key: mike value: {count: 1} key: tyler value: {count: 1} Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks)

Slide 32

Slide 32 text

Group/Shuffle Step key: brendan values: [{count: 1}, {count: 1}] key: mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] Group like keys together, creating an array of their distinct values (Automatically done by M/R frameworks)

Slide 33

Slide 33 text

Reduce Step key: brendan values: [{count: 1}, {count: 1}] key: mike values: [{count: 1}, {count: 1}] key: tyler values: [{count: 1}, {count: 1}] For each key reduce function flattens the list of values to a single result reduce function aggregate values return (result) key: tyler value: {count: 2} key: mike value: {count: 2} key: brendan value: {count: 2}

Slide 34

Slide 34 text

Processing Scalable Big Data • MapReduce provides an effective system for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?

Slide 35

Slide 35 text

Processing Scalable Big Data • MapReduce provides an effective system for calculating and processing our large datasets (from gigabytes through exabytes and beyond) • MapReduce is supported in many places including MongoDB & Hadoop • We have effective answers for both of our concerns. • Can I read & write this data efficiently at different scale? • Can I run calculations on large portions of this data?

Slide 36

Slide 36 text

Integrating MongoDB + Hadoop

Slide 37

Slide 37 text

Separation of Concern • Data storage and data processing are often separate concerns • MongoDB has limited ability to aggregate and process large datasets (JavaScript parallelism - alleviated some with New Aggregation Framework) • Hadoop is built for scalable processing of large datasets

Slide 38

Slide 38 text

MapReducing in MongoDB - Single Server Large Dataset (single mongod) Primary Key as “username” Only one MapReduce thread available

Slide 39

Slide 39 text

MapReducing in MongoDB - Sharding One MapReduce thread per shard (no per-chunk parallelism) Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) One MapReduce thread per shard (no per-chunk parallelism) ... Architecturally, the number of processing nodes is limited to our number of data storage nodes.

Slide 40

Slide 40 text

The Right Tool for the Job • JavaScript isn’t always the most ideal language for many types of calculations • Slow • Limited datatypes • No access to complex analytics libraries available on the JVM • Rich, powerful ecosystem of tools on the JVM + Hadoop • Hadoop has machine learning, ETL, and many other tools which are much more flexible than the processing tools in MongoDB

Slide 41

Slide 41 text

Being a Good Neighbor • Integration with Customers’ Existing Stacks & Toolchains is Crucial • Many users & customers already have Hadoop in their stacks • They want us to “play nicely” with their existing toolchains • Different groups in companies may mandate all data be processable in Hadoop

Slide 42

Slide 42 text

Capabilities

Slide 43

Slide 43 text

Introducing the MongoDB Hadoop Connector • Recently, we released v1.0.0 of this Integration: The MongoDB Hadoop Connector • Read/Write between MongoDB + Hadoop (Core MapReduce) in Java • Write Pig (ETL) jobs’ output to MongoDB • Write MapReduce jobs in Python via Hadoop Streaming • Collect massive amounts of Logging output into MongoDB via Flume

Slide 44

Slide 44 text

Hadoop Connector Capabilities • Split large datasets into smaller chunks (“Input Splits”) for parallel Hadoop processing • Without splits, only one mapper can run • Connector can split both sharded & unsharded collections • Sharded: Read individual chunks from config server into Hadoop • Unsharded: Create splits, similar to how sharding chunks are calculated

Slide 45

Slide 45 text

MapReducing MongoDB + Hadoop - Single Server Large Dataset (single mongod) Primary Key as “username” Each Hadoop node runs a processing task per core a b c d e f g h s t u v w x y z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core

Slide 46

Slide 46 text

MapReducing MongoDB + Hadoop - Sharding Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5 a b c d e f g h s t u v w x y z z Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core Each Hadoop node runs a processing task per core

Slide 47

Slide 47 text

Parallel Processing of Splits • Ship “Splits” to Mappers as hostname, database, collection, & query • Each Mapper reads the relevant documents in • Parallel processing for high performance • Speaks BSON between all layers!

Slide 48

Slide 48 text

MongoDB Hadoop Connector In Action

Slide 49

Slide 49 text

Python Streaming •The Hadoop Streaming interface is much easier to demo (it’s also my favorite feature, and was the hardest to implement) • Java gets a bit ... “verbose” on slides versus Python • Java Hadoop + MongoDB integrates cleanly though for those inclined • Map functions get an initial key of type Object and value of type BSONObject • Represent _id and the full document, respectively •Processing 1.75 gigabytes of the Enron Email Corpus (501,513 emails) • I ran this test on a 6 node Hadoop cluster • Grab your own copy of this dataset at: http://goo.gl/fSleC

Slide 50

Slide 50 text

A Sample Input Doc { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n ", "subFolder" : "allen-p/_sent_mail", "mailbox" : "maildir", "filename" : "1.", "headers" : { "X-cc" : "", "From" : "[email protected]", "Subject" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-FileName" : "pallen (Non-Privileged).pst", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } }

Slide 51

Slide 51 text

Setting up Hadoop Streaming •Install the Python support module on each Hadoop Node: •Build (or download) the Streaming module for the Hadoop adapter: $ sudo pip install pymongo_hadoop $ git clone http://github.com/mongodb/mongo-hadoop.git $ ./sbt mongo-hadoop-streaming/assembly

Slide 52

Slide 52 text

Mapper Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 if 'headers' in doc and 'To' in doc['headers'] and 'From' in doc['headers']: from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."

Slide 53

Slide 53 text

Reducer Code #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer)

Slide 54

Slide 54 text

Running the MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper /home/ec2-user/enron_map.py -reducer /home/ec2-user/enron_reduce.py -inputURI mongodb://test_mongodb:27020/enron_mail.messages -outputURI mongodb://test_mongodb:27020/enron_mail.sender_map

Slide 55

Slide 55 text

Results! mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 6 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 3 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } has more

Slide 56

Slide 56 text

Parallelism is Good The Input Data was split into 44 pieces for parallel processing... ... coincidentally, there were exactly 44 chunks on my sharded setup. Even with an unsharded collection, MongoHadoop can calculate splits!

Slide 57

Slide 57 text

We aren’t restricted to Python •For Mongo-Hadoop 1.0, Streaming only shipped Python support •Currently in git master, and due to be released with 1.1 is support for two additional languages • Ruby (Tyler Brock - @tylerbrock) • Node.JS (Mike O’Brien - @mpobrien) •The same Enron MapReduce job can be accomplished with either of these languages as well

Slide 58

Slide 58 text

Ruby + Mongo-Hadoop Streaming •As there isn’t an official release for Ruby support yet, you’ll need to build the gem by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the mongo- hadoop module from Ruby

Slide 59

Slide 59 text

Enron Map from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.map do |document| if document.has_key?('headers') headers = document['headers'] if ['To', 'From'].all? { |header| headers.has_key? (header) } to_field = headers['To'] from_field = headers['From'] recipients = to_field.split(',').map { |recipient| recipient.strip } recipients.map { |recipient| {:_id => {:f => from_field, :t => recipient}, :count => 1} } end end end

Slide 60

Slide 60 text

Enron Reduce from Ruby #!/usr/bin/env ruby require 'mongo-hadoop' MongoHadoop.reduce do |key, values| count = values.reduce { |sum, current| sum += current['count'] } { :_id => key, :count => count } end

Slide 61

Slide 61 text

Running the Ruby MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.rb -reducer examples/enron/enron_reduce.rb -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output

Slide 62

Slide 62 text

Node.JS + Mongo-Hadoop Streaming •As there isn’t an official release for Node.JS support yet, you’ll need to build the Node module by hand out of git •Like with Python, make sure you install this gem on each of your Hadoop nodes •Once the gem is built & installed, you’ll have access to the node_mongo_hadoop module from Node.JS

Slide 63

Slide 63 text

Enron Map from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop') var trimString = function(str){ return String(str).replace(/^\s+|\s+$/g, ''); } function mapFunc(doc, callback){ if(doc.headers && doc.headers.From && doc.headers.To){ var from_field = doc['headers']['From'] var to_field = doc['headers']['To'] var recips = [] to_field.split(',').forEach(function(to){ callback( {'_id': {'f':from_field, 't':trimString(to)}, 'count': 1} ) }); } } node_mongo_hadoop.MapBSONStream(mapFunc);

Slide 64

Slide 64 text

Enron Reduce from Node.JS #!/usr/bin/env node var node_mongo_hadoop = require('node_mongo_hadoop') function reduceFunc(key, values, callback){ var count = 0; values.forEach(function(v){ count += v.count }); callback( {'_id':key, 'count':count } ); } node_mongo_hadoop.ReduceBSONStream(reduceFunc);

Slide 65

Slide 65 text

Running the Node.JS MapReduce hadoop jar mongo-hadoop-streaming-assembly-1.0.0.jar -mapper examples/enron/enron_map.rb -reducer examples/enron/enron_reduce.rb -inputURI mongodb://127.0.0.1/enron_mail.messages -outputURI mongodb://127.0.0.1/enron_mail.output

Slide 66

Slide 66 text

[Joining the Hive mind]

Slide 67

Slide 67 text

Hive + MongoDB • Over the past weekend, I ended up with a few spare hours and started playing with a frequently requested feature • Hive is a Hadoop based Data Warehousing system, providing a SQL- like language (dubbed “QL”) • Designed for large datasets stored on HDFS • Lots of SQL-like facilities such as data summarization, aggregation and analysis; all compile down to Hadoop MapReduce tasks • Custom User Defined functions can even replace inefficient Hive queries with raw MapReduce • Many users have requested support for this with MongoDB Data

Slide 68

Slide 68 text

Sticking BSON in the Hive • Step 1 involved teaching Hive to read MongoDB Backup files - essentially, raw BSON • While there are some APIs that we can use to talk directly to MongoDB, we haven’t explored that yet • With this code, it is possible to load a .bson file (typically produced by mongodump) directly into Hive and query it • No conversion needed to a “native” Hive format - BSON is read directly • Still needs some polish and tweaking, but this is now slated to be included in the upcoming 1.1 release

Slide 69

Slide 69 text

Loading BSON into Hive • As Hive emulates a Relational Database, tables need schema ( evaluating ways to ‘infer’ schema to make this more automatic ) • Let’s load some MongoDB collections into Hive and play with the data!

Slide 70

Slide 70 text

Loading BSON into Hive

Slide 71

Slide 71 text

Loading BSON into Hive • We have BSON Files to load, now we need to instruct Hive about their Schemas ...

Slide 72

Slide 72 text

Loading BSON into Hive

Slide 73

Slide 73 text

Defining Hive Schemas • We’ve given some instructions to Hive about the structure as well as storage of our MongoDB files. Let’s look at “scores” close CREATE TABLE scores ( student int, name string, score int ) ROW FORMAT SERDE "com.mongodb.hadoop.hive.BSONSerde" STORED AS INPUTFORMAT "com.mongodb.hadoop.hive.input.BSONFileInputFormat" OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION "/Users/brendan/code/mongodb/mongo-hadoop/hive/demo/meta/scores"; • Our first line defines the structure - with a column for ‘student’, ‘name’ and ‘score’, each having a SQL-like datatype. • ROW FORMAT SERDE instructs Hive to use a Serde of ‘BSONSerde’ • In Hive, a Serde is a special codec that explains how to read and write (serialize and deserialize) a custom data format containing Hive rows • We also need to tell Hive to use an INPUTFORMAT of ‘BSONFileInputFormat’ which tells it how to Read BSON files off of disk into individual rows (the Serde is instructions for how to turn individual lines of BSON into a Hive friendly format) • Finally, we specify where Hive should store the metadata, etc with LOCATION

Slide 74

Slide 74 text

Loading Data to Hive •Finally, we need to load data into the Hive table, from our raw BSON file • Now we can query! hive> LOAD DATA LOCAL INPATH "dump/training/scores.bson" INTO TABLE scores;

Slide 75

Slide 75 text

Querying Hive

Slide 76

Slide 76 text

Querying Hive • Most standard SQL-like queries work - though I’m not going to enumerate the ins and outs of HiveQL today • What we can do with Hive that we can’t with MongoDB ... is joins • In addition to the scores data, I also created a collection of student ids and randomly generated names. Let’s look at joining these to our scores in Hive

Slide 77

Slide 77 text

Joins from BSON + Hive

Slide 78

Slide 78 text

Joins from BSON + Hive hive> SELECT u.firstName, u.lastName, u.sex, s.name, s.score FROM > scores s JOIN students u ON u.studentID = s.student > ORDER BY s.score DESC; DELPHIA DOUIN Female exam 99 DOMINIQUE SUAZO Male essay 99 ETTIE BETZIG Female exam 99 ADOLFO PIRONE Male exam 99 IVORY NETTERS Male essay 99 RAFAEL HURLES Male essay 99 KRISTEN VALLERO Female exam 99 CONNIE KNAPPER Female quiz 99 JEANNA DIVELY Female exam 99 TRISTAN SEGAL Male exam 99 WILTON TRULOVE Male essay 99 THAO OTSMAN Female essay 99 CLARENCE STITZ Male quiz 99 LUIS GUAMAN Male exam 99 WILLARD RUSSAK Male quiz 99 MARCOS HOELLER Male quiz 99 TED BOTTCHER Male essay 99 LAKEISHA NAGAMINE Female essay 99 ALLEN HITT Male exam 99 MADELINE DAWKINS Female essay 99

Slide 79

Slide 79 text

This is just the beginning...

Slide 80

Slide 80 text

Looking Forward • Mongo Hadoop Connector 1.0.0 is released and available • Docs: http://api.mongodb.org/hadoop/ • Downloads & Code: http://github.com/mongodb/mongo-hadoop

Slide 81

Slide 81 text

Looking Forward •Lots More Coming; 1.1.0 expected in Summer 2012 • Support for reading from Multiple Input Collections (“MultiMongo”) • Static BSON Support... Read from and Write to Mongo Backup files! • S3 / HDFS stored, mongodump format • Great for big offline batch jobs (this is how Foursquare does it) • Pig input (Read from MongoDB into Pig) • Performance improvements (e.g. pipelining BSON for streaming) • Future: Expanded Ecosystem support (Cascading, Oozie, Mahout, etc)

Slide 82

Slide 82 text

Looking Forward • We are looking to grow our integration with Big Data • Not only Hadoop, but other data processing systems our users want such as Storm, Disco and Spark. • Initial Disco support (Nokia’s Python MapReduce framework) is almost complete; look for it this summer • If you have other data processing toolchains you’d like to see integration with, let us know!

Slide 83

Slide 83 text

http://linkd.in/joinmongo @mongodb facebook.com/mongodb Did I Mention We’re Hiring? http://www.10gen.com/careers ( Jobs of all sorts, all over the world! ) [Download the Hadoop Connector] http://github.com/mongodb/mongo-hadoop [Docs] http://api.mongodb/ *Contact Me* [email protected] (twitter: @rit)