Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to MapReduce (IPC2009 2009-11-15)

An Introduction to MapReduce (IPC2009 2009-11-15)

Workshop given at International PHP Conference 2009 in Karlsruhe, Germany.

David Zuelke

November 15, 2009
Tweet

More Decks by David Zuelke

Other Decks in Programming

Transcript

  1. • /cloudera-training-0.3.2/ • VMWare for Windows, Linux (i386 or x86_64)

    or Mac OS from /vmware/ if you don’t have it. • For Fusion, go to vmware.com and get an evaluation key. • /php/ PLEASE COPY FROM THE HD
  2. SOME NUMBERS • Google • Data processed per month: 400

    PB (in 2007!) • Average job size: 180 GB • Facebook • New data per day: • 200 GB (March 2008) • 2 TB (April 2009) • 4 TB (October 2009)
  3. is data lost? will other nodes in the grid have

    to re-start? how do you coordinate this?
  4. BASIC PRINCIPLE: MAPPER • A Mapper reads records and emits

    <key, value> pairs • Example: Apache access.log • Each line is a record • Extract client IP address and number of bytes transferred • Emit IP address as key, number of bytes as value • For hourly rotating logs, the job can be split across 24 nodes* * In pratice, it’s a lot smarter than that
  5. BASIC PRINCIPLE: REDUCER • A Reducer is given a key

    and all values for this specific key • Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers • Example: Apache access.log • The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes) • We simply sum up the bytes to get the total traffic per IP!
  6. EXAMPLE OF MAPPED INPUT IP Bytes 212.122.174.13 18271 212.122.174.13 191726

    212.122.174.13 198 74.119.8.111 91272 74.119.8.111 8371 212.122.174.13 43
  7. REDUCER WILL RECEIVE THIS IP Bytes 212.122.174.13 18271 212.122.174.13 191726

    212.122.174.13 198 212.122.174.13 43 74.119.8.111 91272 74.119.8.111 8371
  8. PSEUDOCODE function map($line_number, $line_text) { $parts = parse_apache_log($linetext); emit($parts['ip'], $parts['bytes']);

    } function reduce($key, $values) { $bytes = array_sum($values); emit($key, $bytes); } 212.122.174.13 210238 74.119.8.111 99643 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /foo HTTP/1.1" 200 18271 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /bar HTTP/1.1" 200 191726 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /baz HTTP/1.1" 200 198 74.119.8.111 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /egg HTTP/1.1" 200 43 74.119.8.111 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /moo HTTP/1.1" 200 91272 212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /yay HTTP/1.1" 200 8371
  9. HADOOP AT FACEBOOK • Predominantly used in combination with Hive

    (~95%) • 4800 cores with 12 TB of storage per node • Per day: • 4 TB of new data (compressed) • 135 TB of data scanned (compressed) • 7500+ Hive jobs per day, ~80k compute hours http://www.slideshare.net/cloudera/hw09-rethinking-the-data-warehouse-with-hadoop-and-hive
  10. HADOOP AT YAHOO! • Over 25,000 computers with over 100,000

    CPUs • Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each • Over 40% of jobs run using Pig http://wiki.apache.org/hadoop/PoweredBy
  11. THANK YOU! • http://hadoop.apache.org/ is the Hadoop project website •

    http://www.cloudera.com/hadoop-training has useful resources • Send me an E-Mail: [email protected] • Follow @dzuelke on Twitter • Slides will be on SlideShare