Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB - Map Reduce

MongoDB - Map Reduce

Russell Smith

April 20, 2012
Tweet

More Decks by Russell Smith

Other Decks in Technology

Transcript

  1. /usr/bin/whoami • Russell Smith • Consultant for UKD1 Limited •

    I Specialise in helping companies going through rapid growth; • Code, architecture, infrastructure, devops, sysops, capacity planning, etc • <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc... Friday, 20 April 12
  2. What is MongoDB • A scalable, high-performance, open source, document-oriented

    database. • Stores JSON like documents • Indexible on any attributes (like MySQL) • Built in MapReduce Friday, 20 April 12
  3. What is Map Reduce • Allows aggregating data in parallel

    • Some built in aggregation functions exist; distinct, count • If you need to do something more, either query or MapReduce Friday, 20 April 12
  4. How does it work? • You write two functions •

    You write them in Javascript (currently) • Map function: Called once per document - returns a key + a value • Reduce function: Called once per key emitted, with an array of values • Optional finalize function allowing rounding up of the reduce data Friday, 20 April 12
  5. Some example data • I downloaded the H1B (US temporary

    work VISA data) http://www.flcdatacenter.com/CaseH1B.aspx • Imported the CSV data using mongoimport command • Total imported documents ~335k Friday, 20 April 12
  6. What do the documents look like? • LCA_CASE_EMPLOYER_STATE • STATUS

    • LCA_CASE_SUMBIT / Decision_Date • LCA_CASE_WAGE_RATE_FROM { "_id" : ObjectId("4db7c981e243a6e23725570f"), "LCA_CASE_NUMBER" : "I-200-09132-243675", "STATUS" : "CERTIFIED", "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36", "VISA_CLASS" : "H-1B", "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00", "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00", "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC", "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.", "LCA_CASE_EMPLOYER_CITY" : "HOUSTON", "LCA_CASE_EMPLOYER_STATE" : "TX", "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092, "LCA_CASE_SOC_CODE" : "25-2022.00", "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio", "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR", "LCA_CASE_WAGE_RATE_FROM" : 51577.63, "LCA_CASE_WAGE_RATE_UNIT" : "Year", "FULL_TIME_POS" : "Y", "TOTAL_WORKERS" : 1, "LCA_CASE_WORKLOC1_CITY" : "HOUSTON", "LCA_CASE_WORKLOC1_STATE" : "TX", "PW_1" : 47827, "PW_UNIT_1" : "Year", "PW_SOURCE_1" : "OES", "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER", "YR_SOURCE_PUB_1" : 2010, "LCA_CASE_NAICS_CODE" : 611110, "Decision_Date" : "7/20/2010 0:00:00\r" } Friday, 20 April 12
  7. What we can do with the data? • Work out

    the; • Applications per state • Applications by status per state • Average time from submission to decision, by status Friday, 20 April 12
  8. Applications by State • Key will be LCA_CASE_EMPLOYER_STATE • Assume

    (wrongly) one person per document Friday, 20 April 12
  9. Map • this is equal to the current document •

    emit a value of 1; as we are assuming a single H1B app per document m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, 1); } Friday, 20 April 12
  10. Reduce • Return a value; the length of the array

    • This works as each value in the array is 1 r = function (k, v_arr) { return v_arr.length } Friday, 20 April 12
  11. Executing • This will execute the map/reduce • Output goes

    to a collection named workers_by_state db.text2010.mapReduce(m,r, {out: 'workers_by_state', keeptemp:true, verbose:true}) Friday, 20 April 12
  12. Result {  "_id"  :  "NEW  YORK",  "value"  :  512  }

    {  "_id"  :  "IOWA",  "value"  :  15  } {  "_id"  :  "KANSAS",  "value"  :  54  } ... Friday, 20 April 12
  13. A more complex Map! • The last example assumed one

    worker per state...which is wrong. • We now emit a numeric value per state m = function () { emit(this.LCA_CASE_EMPLOYER_STATE, this.TOTAL_WORKERS); } Friday, 20 April 12
  14. Reduce • As the array now contains values other than

    1, we have to iterate over it • This is standard Javascript r = function (k, v_arr) { var total = 0; var len = v_arr.length; for (var i=0, i<len, i++) { total = total + v_arr[i]; } return total; } Friday, 20 April 12
  15. VISA Class by Application Status by Average wage • Assumptions:

    • People work ~40 hour weeks • Weekly wages are paid every week rather than only the weeks worked • 'Select Pay Range' seems to the the default option... m = function () { var k = this.VISA_CLASS + ' ' + this.STATUS; switch (this.LCA_CASE_WAGE_RATE_UNIT) { case 'Year': emit(k, this.LCA_CASE_WAGE_RATE_FROM); break; case 'Month': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12); break; case 'Bi-Weekly': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26); break; case 'Week': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52); break; case 'Hour': emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52); break; default: emit(k, 0); } } Friday, 20 April 12
  16. Reduce • Work out the average for each key •

    Add each of the elements up • Average them r = function (k, v_arr) { var tot = 0; var len = v_arr.length; for (var i = 0; i < len; i++) { tot += v_arr[i]; } return tot / len; } Friday, 20 April 12
  17. Finalize • A finalize function may be run after reduction.

    • Called a single time per object • The finalize function takes a key and a value, and returns a finalized value. Friday, 20 April 12
  18. Options • Persist the output • Filtering input documents •

    Sorting input documents • Javascript scope - allows you to pass in extra variables (cannot be changed at runtime?) Friday, 20 April 12
  19. Current limitations / Watch for • Single threaded per node

    (which sucks) https://jira.mongodb.org/browse/SERVER-463 • Language is restricted to Javascript (which sucks) https://jira.mongodb.org/browse/SERVER-699) • Does not use secondaries in replica sets • From 1.7.3 on, you can reduce into existing collection Friday, 20 April 12
  20. ... • Doesn't allow creation of full documents (which can

    be a pain for perm MR collections if using libraries) https://jira.mongodb.org/browse/SERVER-2517 • Slow; ~x20-30 slower than Hadoop with 1.8 https://jira.mongodb.org/browse/SERVER-3055 Friday, 20 April 12
  21. Using MongoDB with Hadoop • https://github.com/mongodb/mongo-hadoop • Open source •

    Requires knowledge of Java • Working Input and Output adapters for MongoDB are provided • Alpha quality from what I can tell Friday, 20 April 12
  22. 1.9 / 2.0 • V8 is replacing SpiderMonkey • Recent

    Hadoop provider • Sharded output collections • Improved yielding (concurrency) Friday, 20 April 12
  23. > 2.0 • Multi-threaded • Alternative languages https://jira.mongodb.org/browse/SERVER-699 • ~2.2

    native aggregation framework • Js only mode is faster for lighter jobs https://jira.mongodb.org/browse/SERVER-2976 Friday, 20 April 12
  24. Further reading • I’ve only brushed on the details, but

    this should be enough to get you interested / started with MongoDB Map Reduce. Some of the missing stuff; • Finalize functions - http://bit.ly/gEfKOr • Some more examples - http://bit.ly/ig1Yfj Friday, 20 April 12