Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale Data Processing with Hadoop and PHP (PHPConJapan2011 2011-09-10)

David Zuelke
September 10, 2011

Large-Scale Data Processing with Hadoop and PHP (PHPConJapan2011 2011-09-10)

David Zuelke

September 10, 2011
Tweet

More Decks by David Zuelke

Other Decks in Programming

Transcript

  1. SOME NUMBERS • Facebook • New data per day: •

    200 GB (March 2008) • 2 TB (April 2009) • 4 TB (October 2009) • 12 TB (March 2010) • Google • Data processed per month: 400 PB (in 2007!) • Average job size: 180 GB
  2. is data lost? will other nodes in the grid have

    to re-start? how do you coordinate this?
  3. BASIC PRINCIPLE: MAPPER • A Mapper reads records and emits

    <key, value> pairs • Example: Apache access.log • Each line is a record • Extract client IP address and number of bytes transferred • Emit IP address as key, number of bytes as value • For hourly rotating logs, the job can be split across 24 nodes* * In pratice, it’s a lot smarter than that
  4. BASIC PRINCIPLE: REDUCER • A Reducer is given a key

    and all values for this specific key • Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers • Example: Apache access.log • The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes) • We simply sum up the bytes to get the total traffic per IP!
  5. EXAMPLE OF MAPPED INPUT IP Bytes 212.122.174.13 18271 212.122.174.13 191726

    212.122.174.13 198 74.119.8.111 91272 74.119.8.111 8371 212.122.174.13 43
  6. REDUCER WILL RECEIVE THIS IP Bytes 212.122.174.13 18271 212.122.174.13 191726

    212.122.174.13 198 212.122.174.13 43 74.119.8.111 91272 74.119.8.111 8371
  7. PSEUDOCODE function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);  

     emit($parts['ip'],  $parts['bytes']); } function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes); } 212.122.174.13  210238 74.119.8.111      99643 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  198 74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  43 74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371
  8. The name my kid gave a stuffed yellow elephant. Short,

    relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term. Doug Cutting
  9. HADOOP AT FACEBOOK (I) • Predominantly used in combination with

    Hive (~95%) • 8400 cores with ~12.5 PB of total storage • 8 cores, 12 TB storage and 32 GB RAM per node • 1x Gigabit Ethernet for each server in a rack • 4x Gigabit Ethernet from rack switch to core http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop Hadoop is aware of racks and locality of nodes
  10. HADOOP AT FACEBOOK (II) • Daily stats: • 25 TB

    logged by Scribe • 135 TB of compressed data scanned • 7500+ Hive jobs • ~80k compute hours • New data per day: • I/08: 200 GB • II/09: 2 TB (compressed) • III/09: 4 TB (compressed) • I/10: 12 TB (compressed) http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop
  11. HADOOP AT YAHOO! • Over 25,000 computers with over 100,000

    CPUs • Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each • Over 40% of jobs run using Pig http://wiki.apache.org/hadoop/PoweredBy
  12. OTHER NOTABLE USERS • Twitter (storage, logging, analysis. Heavy users

    of Pig) • Rackspace (log analysis; data pumped into Lucene/Solr) • LinkedIn (friend suggestions) • Last.fm (charts, log analysis, A/B testing) • The New York Times (converted 4 TB of scans using EC2)
  13. BASIC RULES • Uses Input Formats to split up your

    data into single records • You can optimize using combiners to reduce locally on a node • Only possible in some cases, e.g. for max(), but not avg() • You can control partitioning of map output yourself • Rarely useful, the default partitioner (key hash) is enough • And a million other things that really don’t matter right now ;)
  14. HDFS • Stores data in blocks (default block size: 64

    MB) • Designed for very large data sets • Designed for streaming rather than random reads • Write-once, read-many (although appending is possible) • Capable of compression and other cool things
  15. HDFS CONCEPTS • Large blocks minimize amount of seeks, maximize

    throughput • Blocks are stored redundantly (3 replicas as default) • Aware of infrastructure characteristics (nodes, racks, ...) • Datanodes hold blocks • Namenode holds the metadata Critical component for an HDFS cluster (HA, SPOF)
  16. HADOOPHP • A little framework to help with writing mapred

    jobs in PHP • Takes care of input splitting, can do basic decoding et cetera • Automatically detects and handles Hadoop settings such as key length or field separators • Packages jobs as one .phar archive to ease deployment • Also creates a ready-to-rock shell script to invoke the job
  17. RESOURCES • http://www.cloudera.com/developers/learn-hadoop/ • Tom White: Hadoop. The Definitive Guide.

    O’Reilly, 2009 • http://www.cloudera.com/hadoop/ • Cloudera Distribution for Hadoop is easy to install and has all the stuff included: Hadoop, Hive, Flume, Sqoop, Oozie, …