Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB - Map Reduce

MongoDB - Map Reduce

Russell Smith

April 20, 2012
Tweet

More Decks by Russell Smith

Other Decks in Technology

Transcript

  1. An Introduction to
    MapReduce with MongoDB
    Russell Smith
    Friday, 20 April 12

    View Slide

  2. /usr/bin/whoami
    • Russell Smith
    • Consultant for UKD1 Limited
    • I Specialise in helping companies going through rapid growth;
    • Code, architecture, infrastructure, devops, sysops, capacity planning, etc
    • <3 Gearman, MongoDB, Neo4j, MySQL, Riak, Kohana, PHP, Debian, Puppet, etc...
    Friday, 20 April 12

    View Slide

  3. What is MongoDB
    • A scalable, high-performance, open source, document-oriented
    database.
    • Stores JSON like documents
    • Indexible on any attributes (like MySQL)
    • Built in MapReduce
    Friday, 20 April 12

    View Slide

  4. Requirements
    • A running MongoDB server
    http://www.mongodb.org/downloads
    • Basic knowledge of MongoDB
    • Basic Javascript
    Friday, 20 April 12

    View Slide

  5. What is Map Reduce
    • Allows aggregating data in parallel
    • Some built in aggregation functions exist;
    distinct, count
    • If you need to do something more, either query or MapReduce
    Friday, 20 April 12

    View Slide

  6. How does it work?
    • You write two functions
    • You write them in Javascript (currently)
    • Map function:
    Called once per document - returns a key + a value
    • Reduce function:
    Called once per key emitted, with an array of values
    • Optional finalize function allowing rounding up of the reduce data
    Friday, 20 April 12

    View Slide

  7. Some example data
    • I downloaded the H1B (US temporary work VISA data)
    http://www.flcdatacenter.com/CaseH1B.aspx
    • Imported the CSV data using mongoimport command
    • Total imported documents ~335k
    Friday, 20 April 12

    View Slide

  8. What do the documents look like?
    • LCA_CASE_EMPLOYER_STATE
    • STATUS
    • LCA_CASE_SUMBIT / Decision_Date
    • LCA_CASE_WAGE_RATE_FROM
    {

    "_id" : ObjectId("4db7c981e243a6e23725570f"),

    "LCA_CASE_NUMBER" : "I-200-09132-243675",

    "STATUS" : "CERTIFIED",

    "LCA_CASE_SUBMIT" : "7/14/2010 9:06:36",

    "VISA_CLASS" : "H-1B",

    "LCA_CASE_EMPLOYMENT_START_DATE" : "12/15/2010 0:00:00",

    "LCA_CASE_EMPLOYMENT_END_DATE" : "12/15/2013 0:00:00",

    "LCA_CASE_EMPLOYER_NAME" : "BRITISH SCHOOL OF AMERICA, LLC",

    "LCA_CASE_EMPLOYER_ADDRESS" : "4211 WATONGA BLVD.",

    "LCA_CASE_EMPLOYER_CITY" : "HOUSTON",

    "LCA_CASE_EMPLOYER_STATE" : "TX",

    "LCA_CASE_EMPLOYER_POSTAL_CODE" : 77092,

    "LCA_CASE_SOC_CODE" : "25-2022.00",

    "LCA_CASE_SOC_NAME" : "Middle School Teachers, Except Special and Vocatio",

    "LCA_CASE_JOB_TITLE" : "MIDDLE SCHOOL TEACHER/IB COORDINATOR",

    "LCA_CASE_WAGE_RATE_FROM" : 51577.63,

    "LCA_CASE_WAGE_RATE_UNIT" : "Year",

    "FULL_TIME_POS" : "Y",

    "TOTAL_WORKERS" : 1,

    "LCA_CASE_WORKLOC1_CITY" : "HOUSTON",

    "LCA_CASE_WORKLOC1_STATE" : "TX",

    "PW_1" : 47827,

    "PW_UNIT_1" : "Year",

    "PW_SOURCE_1" : "OES",

    "OTHER_WAGE_SOURCE_1" : "OFLC ONLINE DATA CENTER",

    "YR_SOURCE_PUB_1" : 2010,

    "LCA_CASE_NAICS_CODE" : 611110,

    "Decision_Date" : "7/20/2010 0:00:00\r"
    }
    Friday, 20 April 12

    View Slide

  9. What we can do with the data?
    • Work out the;
    • Applications per state
    • Applications by status per state
    • Average time from submission to decision, by status
    Friday, 20 April 12

    View Slide

  10. Applications by State
    • Key will be LCA_CASE_EMPLOYER_STATE
    • Assume (wrongly) one person per document
    Friday, 20 April 12

    View Slide

  11. Map
    • this is equal to the current document
    • emit a value of 1; as we are assuming a
    single H1B app per document
    m = function () {

    emit(this.LCA_CASE_EMPLOYER_STATE, 1);
    }
    Friday, 20 April 12

    View Slide

  12. Reduce
    • Return a value; the length of the array
    • This works as each value in the array is 1
    r = function (k, v_arr) {
    return v_arr.length
    }
    Friday, 20 April 12

    View Slide

  13. Executing
    • This will execute the map/reduce
    • Output goes to a collection named
    workers_by_state
    db.text2010.mapReduce(m,r,
    {out: 'workers_by_state',
    keeptemp:true, verbose:true})
    Friday, 20 April 12

    View Slide

  14. Result
    {  "_id"  :  "NEW  YORK",  "value"  :  512  }
    {  "_id"  :  "IOWA",  "value"  :  15  }
    {  "_id"  :  "KANSAS",  "value"  :  54  }
    ...
    Friday, 20 April 12

    View Slide

  15. A more complex Map!
    • The last example assumed one worker
    per state...which is wrong.
    • We now emit a numeric value per state
    m = function () {
    emit(this.LCA_CASE_EMPLOYER_STATE,
    this.TOTAL_WORKERS);
    }
    Friday, 20 April 12

    View Slide

  16. Reduce
    • As the array now contains values other
    than 1, we have to iterate over it
    • This is standard Javascript
    r = function (k, v_arr) {
    var total = 0;
    var len = v_arr.length;
    for (var i=0, i{
    total = total + v_arr[i];
    }
    return total;
    }
    Friday, 20 April 12

    View Slide

  17. VISA Class by Application Status by
    Average wage
    • Assumptions:
    • People work ~40 hour weeks
    • Weekly wages are paid every week
    rather than only the weeks worked
    • 'Select Pay Range' seems to the the
    default option...
    m = function () {
    var k = this.VISA_CLASS + ' ' + this.STATUS;
    switch (this.LCA_CASE_WAGE_RATE_UNIT)
    {
    case 'Year':
    emit(k, this.LCA_CASE_WAGE_RATE_FROM);
    break;
    case 'Month':
    emit(k, this.LCA_CASE_WAGE_RATE_FROM * 12);
    break;
    case 'Bi-Weekly':
    emit(k, this.LCA_CASE_WAGE_RATE_FROM * 26);
    break;
    case 'Week':
    emit(k, this.LCA_CASE_WAGE_RATE_FROM * 52);
    break;
    case 'Hour':
    emit(k, this.LCA_CASE_WAGE_RATE_FROM * 40 * 52);
    break;
    default:
    emit(k, 0);
    }
    }
    Friday, 20 April 12

    View Slide

  18. Reduce
    • Work out the average for each key
    • Add each of the elements up
    • Average them
    r = function (k, v_arr) {
    var tot = 0;
    var len = v_arr.length;
    for (var i = 0; i < len; i++)
    {
    tot += v_arr[i];
    }
    return tot / len;
    }
    Friday, 20 April 12

    View Slide

  19. Finalize
    • A finalize function may be run after reduction.
    • Called a single time per object
    • The finalize function takes a key and a value, and returns a finalized
    value.
    Friday, 20 April 12

    View Slide

  20. Options
    • Persist the output
    • Filtering input documents
    • Sorting input documents
    • Javascript scope - allows you to pass in extra variables (cannot be
    changed at runtime?)
    Friday, 20 April 12

    View Slide

  21. Current limitations / Watch for
    • Single threaded per node (which sucks)
    https://jira.mongodb.org/browse/SERVER-463
    • Language is restricted to Javascript (which sucks)
    https://jira.mongodb.org/browse/SERVER-699)
    • Does not use secondaries in replica sets
    • From 1.7.3 on, you can reduce into existing collection
    Friday, 20 April 12

    View Slide

  22. ...
    • Doesn't allow creation of full documents (which can be a pain for
    perm MR collections if using libraries)
    https://jira.mongodb.org/browse/SERVER-2517
    • Slow; ~x20-30 slower than Hadoop with 1.8
    https://jira.mongodb.org/browse/SERVER-3055
    Friday, 20 April 12

    View Slide

  23. Using MongoDB with Hadoop
    • https://github.com/mongodb/mongo-hadoop
    • Open source
    • Requires knowledge of Java
    • Working Input and Output adapters for MongoDB are provided
    • Alpha quality from what I can tell
    Friday, 20 April 12

    View Slide

  24. The future
    Friday, 20 April 12

    View Slide

  25. 1.9 / 2.0
    • V8 is replacing SpiderMonkey
    • Recent Hadoop provider
    • Sharded output collections
    • Improved yielding (concurrency)
    Friday, 20 April 12

    View Slide

  26. > 2.0
    • Multi-threaded
    • Alternative languages
    https://jira.mongodb.org/browse/SERVER-699
    • ~2.2 native aggregation framework
    • Js only mode is faster for lighter jobs
    https://jira.mongodb.org/browse/SERVER-2976
    Friday, 20 April 12

    View Slide

  27. Further reading
    • I’ve only brushed on the details, but this should be enough to get you
    interested / started with MongoDB Map Reduce. Some of the missing
    stuff;
    • Finalize functions - http://bit.ly/gEfKOr
    • Some more examples - http://bit.ly/ig1Yfj
    Friday, 20 April 12

    View Slide