Save 37% off PRO during our Black Friday Sale! »

Big Data Analytics (with Hadoop and PHP) (DPC2013 2013-06-06)

Big Data Analytics (with Hadoop and PHP) (DPC2013 2013-06-06)

Workshop presented at Dutch PHP Conference 2013 in Amsterdam, The Netherlands.

D6ccd6409910643d05ddaea3b2cd6f13?s=128

David Zuelke

June 06, 2013
Tweet

Transcript

  1. BIG DATA ANALYTICS (WITH HADOOP AND PHP)

  2. David Zülke

  3. David Zuelke

  4. None
  5. http://en.wikipedia.org/wiki/File:München_Panorama.JPG

  6. Lead Architect

  7. None
  8. @dzuelke

  9. PROLOGUE The Big Data Challenge

  10. we want to process data

  11. how much data exactly?

  12. SOME NUMBERS • Facebook, ingest per day: • I/08: 200

    GB • II/09: 2 TB compressed • I/10: 12 TB compressed • III/12: 500 TB • Google • Data processed per month: 400 PB (in 2007!) • Average job size: 180 GB
  13. what if you have that much data?

  14. what if Google’s 180 GB per job is all the

    data you have?
  15. “No Problemo”, you say?

  16. reading 180 GB sequentially off a disk will take ~45

    minutes
  17. and you only have 16 to 64 GB of RAM

    per computer
  18. so you can't process everything at once

  19. general rule of modern computers:

  20. data can be processed much faster than it can be

    read
  21. solution: parallelize your I/O

  22. but now you need to coordinate what you’re doing

  23. and that’s hard

  24. how do you avoid overloading your network?

  25. what if a node dies?

  26. is data lost? will other nodes in the grid have

    to re-start? how do you coordinate this?
  27. CHAPTER ONE Batch Processing of Big Data

  28. Hadoop is now the industry standard

  29. I wouldn’t bother with anything else

  30. ENTER: OUR HERO Introducing MapReduce

  31. in the olden days, the workload was distributed across a

    grid
  32. and the data was shipped around between nodes

  33. or even stored centrally on something like an SAN

  34. which was fine for small amounts of information

  35. but today, on the web, we have big data

  36. I/O bottleneck

  37. along came a Google publication in 2004

  38. MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html

  39. now the data is distributed

  40. computing happens on the nodes where the data already is

  41. processes are isolated and don’t communicate (share-nothing)

  42. BASIC MAPREDUCE FLOW 1.A Mapper reads records and emits <key,

    value> pairs 1.Input could be a web server log, with each line as a record 2.A Reducer is given a key and all values for this specific key 1.Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers * In pratice, it’s a lot smarter than that
  43. EXAMPLE OF MAPPED INPUT IP Bytes 212.122.174.13 18271 212.122.174.13 191726

    212.122.174.13 198 74.119.8.111 91272 74.119.8.111 8371 212.122.174.13 43
  44. REDUCER WILL RECEIVE THIS IP Bytes 212.122.174.13 18271 212.122.174.13 191726

    212.122.174.13 198 212.122.174.13 43 74.119.8.111 91272 74.119.8.111 8371
  45. AFTER REDUCTION IP Bytes 212.122.174.13 210238 74.119.8.111 99643

  46. PSEUDOCODE function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);  

     emit($parts['ip'],  $parts['bytes']); } function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes); } 212.122.174.13  210238 74.119.8.111      99643 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  198 74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  43 74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272 212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371
  47. A YELLOW ELEPHANT Introducing Apache Hadoop

  48. None
  49. Hadoop is a MapReduce framework

  50. it allows us to focus on writing Mappers, Reducers etc.

  51. and it works extremely well

  52. how well exactly?

  53. HADOOP AT FACEBOOK • Predominantly used in combination with Hive

    (~95%) • Largest cluster holds over 100 PB of data • Typically 8 cores, 12 TB storage and 32 GB RAM per node • 1x Gigabit Ethernet for each server in a rack • 4x Gigabit Ethernet from rack switch to core Hadoop is aware of racks and locality of nodes
  54. HADOOP AT YAHOO! • Over 25,000 computers with over 100,000

    CPUs • Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each • Over 40% of jobs run using Pig http://wiki.apache.org/hadoop/PoweredBy
  55. OTHER NOTABLE USERS • Twitter (storage, logging, analysis. Heavy users

    of Pig) • Rackspace (log analysis; data pumped into Lucene/Solr) • LinkedIn (contact suggestions) • Last.fm (charts, log analysis, A/B testing) • The New York Times (converted 4 TB of scans using EC2)
  56. JOB PROCESSING How Hadoop Works

  57. Just like I already described! It’s MapReduce! \o/

  58. BASIC RULES • Uses Input Formats to split up your

    data into single records • You can optimize using combiners to reduce locally on a node • Only possible in some cases, e.g. for max(), but not avg() • You can control partitioning of map output yourself • Rarely useful, the default partitioner (key hash) is enough • And a million other things that really don’t matter right now ;)
  59. HDFS Hadoop Distributed File System

  60. HDFS • Stores data in blocks (default block size: 64

    MB) • Designed for very large data sets • Designed for streaming rather than random reads • Write-once, read-many (although appending is possible) • Capable of compression and other cool things
  61. HDFS CONCEPTS • Large blocks minimize amount of seeks, maximize

    throughput • Blocks are stored redundantly (3 replicas as default) • Aware of infrastructure characteristics (nodes, racks, ...) • Datanodes hold blocks • Namenode holds the metadata Critical component for an HDFS cluster (HA, SPOF)
  62. OTHER EXECUTION MODELS Why Even Write Code?

  63. HADOOP FRAMEWORKS AND ECOSYSTEM • Apache Hive SQL-like syntax •

    Apache Pig Data flow language • Cascading Java abstraction layer • Scalding (Scala) • Apache Mahout Machine Learning toolkit • Apache HBase BigTable-like database • Apache Nutch Search engine • Cloudera Impala Realtime queries (no MR)
  64. STREAMING Hadoop Won’t Force Us To Use Java

  65. Hadoop Streaming can use any script as Mapper or Reducer

  66. many configuration options (parsers, formats, combining, …)

  67. it works using STDIN and STDOUT

  68. Mappers are streamed the records (usually by line: <line>\n) and

    emit key/value pairs: <key>\t<value>\n
  69. Reducers are streamed key/value pairs: <keyA>\t<value1>\n <keyA>\t<value2>\n <keyA>\t<value3>\n <keyB>\t<value4>\n

  70. Caution: no separate Reducer processes per key (but keys are

    sorted)
  71. STREAMING LIBRARIES • Dumbo or Hadoopy for Python https://github.com/klbostee/dumbo https://github.com/bwhite/hadoopy

    • Wukong for Ruby https://github.com/mrflip/wukong • HadooPHP for PHP https://github.com/dzuelke/HadooPHP
  72. STREAMING WITH PHP Introducing HadooPHP

  73. HADOOPHP • A little framework to help with writing mapred

    jobs in PHP • Takes care of input splitting, can do basic decoding et cetera • Automatically detects and handles Hadoop settings such as key length or field separators • Packages jobs as one .phar archive to ease deployment • Also creates a ready-to-rock shell script to invoke the job
  74. written by

  75. None
  76. CHAPTER TWO Real-Time Big Data

  77. sometimes, you can’t wait a few hours

  78. Twitter’s trending topics, fraud warning systems, ...

  79. batch processing won’t cut it

  80. None
  81. TWITTER STORM • Often called “the Hadoop for Real-Time” •

    Central Nimbus service coordinates execution w/ ZooKeeper • A Storm cluster runs Topologies, processing continuously • Spouts produce streams: unbounded sequences of tuples • Bolts consume input streams, process, output again • Topologies can consist of many steps for complex tasks
  82. TWITTER STORM • Bolts can be written in other languages

    • Uses STDIN/STDOUT like Hadoop Streaming, plus JSON • Storm can provide transactions for topologies and guarantee processing of messages • Architecture allows for non stream processing applications • e.g. Distributed RPC
  83. CLOUDERA IMPALA • Implementation of a Dremel/BigQuery like system on

    Hadoop • Uses Hadoop v2 YARN infrastructure for distributed work • No MapReduce, no job setup overhead • Query data in HDFS or HBase • Hive compatible interface • Potential game changer for its performance characteristics
  84. HAMMER TIME! Let the Hacking Begin :)

  85. !e End

  86. THANK YOU! This was http://joind.in/8435 by @dzuelke. Questions: dz@europeanmedia.com