Slide 1

Slide 1 text

An Introduction to MapReduce with MongoDB Russell Smith Friday, 20 April 12

Slide 2

Slide 2 text

/usr/bin/whoami • Russell Smith • Consultant for UKD1 Limited • I Specialise in helping companies going through rapid growth; • Code, architecture, infrastructure, devops, sysops, capacity planning, etc • <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc... Friday, 20 April 12

Slide 3

Slide 3 text

What is MongoDB • A scalable, high-performance, open source, document-oriented database. • Stores JSON like documents • Indexible on any attributes (like MySQL) • Built in MapReduce Friday, 20 April 12

Slide 4

Slide 4 text

Requirements • A running MongoDB server http://www.mongodb.org/downloads • Basic knowledge of MongoDB • Basic Javascript Friday, 20 April 12

Slide 5

Slide 5 text

What is Map Reduce • Allows aggregating data in parallel • Some built in aggregation functions exist; distinct, count • If you need to do something more, either query or MapReduce Friday, 20 April 12

Slide 6

Slide 6 text

How does it work? • You write two functions • You write them in Javascript (currently) • Map function: Called once per document - returns a key + a value • Reduce function: Called once per key emitted, with an array of values • Optional finalize function allowing rounding up of the reduce data Friday, 20 April 12

Slide 7

Slide 7 text

Some example data • I downloaded the H1B (US temporary work VISA data) http://www.flcdatacenter.com/CaseH1B.aspx • Imported the CSV data using mongoimport command • Total imported documents ~335k Friday, 20 April 12

Slide 8

Slide 8 text

What do the documents look like? • LCA_CASE_EMPLOYER_STATE • STATUS • LCA_CASE_SUMBIT / Decision_Date • LCA_CASE_WAGE_RATE_FROM { "_id" : ObjectId("4db7c981e243a6e23725570f"), "LCA_CASE_NUMBER" : "I-200-09132-243675", "STATUS" : "CERTIFIED", "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36", "VISA_CLASS" : "H-1B", "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00", "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00", "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC", "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.", "LCA_CASE_EMPLOYER_CITY" : "HOUSTON", "LCA_CASE_EMPLOYER_STATE" : "TX", "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092, "LCA_CASE_SOC_CODE" : "25-2022.00", "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio", "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR", "LCA_CASE_WAGE_RATE_FROM" : 51577.63, "LCA_CASE_WAGE_RATE_UNIT" : "Year", "FULL_TIME_POS" : "Y", "TOTAL_WORKERS" : 1, "LCA_CASE_WORKLOC1_CITY" : "HOUSTON", "LCA_CASE_WORKLOC1_STATE" : "TX", "PW_1" : 47827, "PW_UNIT_1" : "Year", "PW_SOURCE_1" : "OES", "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER", "YR_SOURCE_PUB_1" : 2010, "LCA_CASE_NAICS_CODE" : 611110, "Decision_Date" : "7/20/2010 0:00:00\r" } Friday, 20 April 12

Slide 9

Slide 9 text

What we can do with the data? • Work out the; • Applications per state • Applications by status per state • Average time from submission to decision, by status Friday, 20 April 12

Slide 10

Slide 10 text

Applications by State • Key will be LCA_CASE_EMPLOYER_STATE • Assume (wrongly) one person per document Friday, 20 April 12

Slide 11

Slide 11 text

Map • this is equal to the current document • emit a value of 1; as we are assuming a single H1B app per document m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, 1); } Friday, 20 April 12

Slide 12

Slide 12 text

Reduce • Return a value; the length of the array • This works as each value in the array is 1 r = function (k, v_arr) { return v_arr.length } Friday, 20 April 12

Slide 13

Slide 13 text

Executing • This will execute the map/reduce • Output goes to a collection named workers_by_state db.text2010.mapReduce(m,r, {out: 'workers_by_state', keeptemp:true, verbose:true}) Friday, 20 April 12

Slide 14

Slide 14 text

Result {  "_id"  :  "NEW  YORK",  "value"  :  512  } {  "_id"  :  "IOWA",  "value"  :  15  } {  "_id"  :  "KANSAS",  "value"  :  54  } ... Friday, 20 April 12

Slide 15

Slide 15 text

A more complex Map! • The last example assumed one worker per state...which is wrong. • We now emit a numeric value per state m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, this.TOTAL_WORKERS); } Friday, 20 April 12

Slide 16

Slide 16 text

Reduce • As the array now contains values other than 1, we have to iterate over it • This is standard Javascript r = function (k, v_arr) { var total = 0; var len = v_arr.length; for (var i=0, i

Slide 17

Slide 17 text

VISA Class by Application Status by Average wage • Assumptions: • People work ~40 hour weeks • Weekly wages are paid every week rather than only the weeks worked • 'Select Pay Range' seems to the the default option... m = function () { var k = this.VISA_CLASS + ' ' + this.STATUS; switch (this.LCA_CASE_WAGE_RATE_UNIT) { case 'Year': emit(k, this.LCA_CASE_WAGE_RATE_FROM); break; case 'Month': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12); break; case 'Bi-Weekly': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26); break; case 'Week': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52); break; case 'Hour': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52); break; default: emit(k, 0); } } Friday, 20 April 12

Slide 18

Slide 18 text

Reduce • Work out the average for each key • Add each of the elements up • Average them r = function (k, v_arr) { var tot = 0; var len = v_arr.length; for (var i = 0; i < len; i++) { tot += v_arr[i]; } return tot / len; } Friday, 20 April 12

Slide 19

Slide 19 text

Finalize • A finalize function may be run after reduction. • Called a single time per object • The finalize function takes a key and a value, and returns a finalized value. Friday, 20 April 12

Slide 20

Slide 20 text

Options • Persist the output • Filtering input documents • Sorting input documents • Javascript scope - allows you to pass in extra variables (cannot be changed at runtime?) Friday, 20 April 12

Slide 21

Slide 21 text

Current limitations / Watch for • Single threaded per node (which sucks) https://jira.mongodb.org/browse/SERVER-463 • Language is restricted to Javascript (which sucks) https://jira.mongodb.org/browse/SERVER-699) • Does not use secondaries in replica sets • From 1.7.3 on, you can reduce into existing collection Friday, 20 April 12

Slide 22

Slide 22 text

... • Doesn't allow creation of full documents (which can be a pain for perm MR collections if using libraries) https://jira.mongodb.org/browse/SERVER-2517 • Slow; ~x20-30 slower than Hadoop with 1.8 https://jira.mongodb.org/browse/SERVER-3055 Friday, 20 April 12

Slide 23

Slide 23 text

Using MongoDB with Hadoop • https://github.com/mongodb/mongo-hadoop • Open source • Requires knowledge of Java • Working Input and Output adapters for MongoDB are provided • Alpha quality from what I can tell Friday, 20 April 12

Slide 24

Slide 24 text

The future Friday, 20 April 12

Slide 25

Slide 25 text

1.9 / 2.0 • V8 is replacing SpiderMonkey • Recent Hadoop provider • Sharded output collections • Improved yielding (concurrency) Friday, 20 April 12

Slide 26

Slide 26 text

> 2.0 • Multi-threaded • Alternative languages https://jira.mongodb.org/browse/SERVER-699 • ~2.2 native aggregation framework • Js only mode is faster for lighter jobs https://jira.mongodb.org/browse/SERVER-2976 Friday, 20 April 12

Slide 27

Slide 27 text

Further reading • I’ve only brushed on the details, but this should be enough to get you interested / started with MongoDB Map Reduce. Some of the missing stuff; • Finalize functions - http://bit.ly/gEfKOr • Some more examples - http://bit.ly/ig1Yfj Friday, 20 April 12