Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Analytics (with Hadoop and PHP) (DPC2013 2013-06-06)

Big Data Analytics (with Hadoop and PHP) (DPC2013 2013-06-06)

Workshop presented at Dutch PHP Conference 2013 in Amsterdam, The Netherlands.

David Zuelke

June 06, 2013
Tweet

More Decks by David Zuelke

Other Decks in Programming

Transcript

  1. BIG DATA ANALYTICS
    (WITH HADOOP AND PHP)

    View full-size slide

  2. David Zülke

    View full-size slide

  3. David Zuelke

    View full-size slide

  4. http://en.wikipedia.org/wiki/File:München_Panorama.JPG

    View full-size slide

  5. Lead Architect

    View full-size slide

  6. PROLOGUE
    The Big Data Challenge

    View full-size slide

  7. we want to process data

    View full-size slide

  8. how much data exactly?

    View full-size slide

  9. SOME NUMBERS
    • Facebook, ingest per day:
    • I/08: 200 GB
    • II/09: 2 TB compressed
    • I/10: 12 TB compressed
    • III/12: 500 TB
    • Google
    • Data processed per
    month: 400 PB (in 2007!)
    • Average job size: 180 GB

    View full-size slide

  10. what if you have that much data?

    View full-size slide

  11. what if Google’s 180 GB per job is all the data you have?

    View full-size slide

  12. “No Problemo”, you say?

    View full-size slide

  13. reading 180 GB sequentially off a disk will take ~45 minutes

    View full-size slide

  14. and you only have 16 to 64 GB of RAM per computer

    View full-size slide

  15. so you can't process everything at once

    View full-size slide

  16. general rule of modern computers:

    View full-size slide

  17. data can be processed much faster than it can be read

    View full-size slide

  18. solution: parallelize your I/O

    View full-size slide

  19. but now you need to coordinate what you’re doing

    View full-size slide

  20. and that’s hard

    View full-size slide

  21. how do you avoid overloading your network?

    View full-size slide

  22. what if a node dies?

    View full-size slide

  23. is data lost?
    will other nodes in the grid have to re-start?
    how do you coordinate this?

    View full-size slide

  24. CHAPTER ONE
    Batch Processing of Big Data

    View full-size slide

  25. Hadoop is now the industry standard

    View full-size slide

  26. I wouldn’t bother with anything else

    View full-size slide

  27. ENTER: OUR HERO
    Introducing MapReduce

    View full-size slide

  28. in the olden days, the workload was distributed across a grid

    View full-size slide

  29. and the data was shipped around between nodes

    View full-size slide

  30. or even stored centrally on something like an SAN

    View full-size slide

  31. which was fine for small amounts of information

    View full-size slide

  32. but today, on the web, we have big data

    View full-size slide

  33. I/O bottleneck

    View full-size slide

  34. along came a Google publication in 2004

    View full-size slide

  35. MapReduce: Simplified Data Processing on Large Clusters
    http://labs.google.com/papers/mapreduce.html

    View full-size slide

  36. now the data is distributed

    View full-size slide

  37. computing happens on the nodes where the data already is

    View full-size slide

  38. processes are isolated and don’t communicate (share-nothing)

    View full-size slide

  39. BASIC MAPREDUCE FLOW
    1.A Mapper reads records and emits pairs
    1.Input could be a web server log, with each line as a record
    2.A Reducer is given a key and all values for this specific key
    1.Even if there are many Mappers on many computers; the
    results are aggregated before they are handed to Reducers
    * In pratice, it’s a lot smarter than that

    View full-size slide

  40. EXAMPLE OF MAPPED INPUT
    IP Bytes
    212.122.174.13 18271
    212.122.174.13 191726
    212.122.174.13 198
    74.119.8.111 91272
    74.119.8.111 8371
    212.122.174.13 43

    View full-size slide

  41. REDUCER WILL RECEIVE THIS
    IP Bytes
    212.122.174.13
    18271
    212.122.174.13
    191726
    212.122.174.13
    198
    212.122.174.13
    43
    74.119.8.111
    91272
    74.119.8.111
    8371

    View full-size slide

  42. AFTER REDUCTION
    IP Bytes
    212.122.174.13 210238
    74.119.8.111 99643

    View full-size slide

  43. PSEUDOCODE
    function  map($line_number,  $line_text)  {
       $parts  =  parse_apache_log($line_text);
       emit($parts['ip'],  $parts['bytes']);
    }
    function  reduce($key,  $values)  {
       $bytes  =  array_sum($values);
       emit($key,  $bytes);
    }
    212.122.174.13  210238
    74.119.8.111      99643
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  198
    74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  43
    74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371

    View full-size slide

  44. A YELLOW ELEPHANT
    Introducing Apache Hadoop

    View full-size slide

  45. Hadoop is a MapReduce framework

    View full-size slide

  46. it allows us to focus on writing Mappers, Reducers etc.

    View full-size slide

  47. and it works extremely well

    View full-size slide

  48. how well exactly?

    View full-size slide

  49. HADOOP AT FACEBOOK
    • Predominantly used in combination with Hive (~95%)
    • Largest cluster holds over 100 PB of data
    • Typically 8 cores, 12 TB storage and 32 GB RAM per node
    • 1x Gigabit Ethernet for each server in a rack
    • 4x Gigabit Ethernet from rack switch to core
    Hadoop is aware of racks and locality of nodes

    View full-size slide

  50. HADOOP AT YAHOO!
    • Over 25,000 computers with over 100,000 CPUs
    • Biggest Cluster:
    • 4000 Nodes
    • 2x4 CPU cores each
    • 16 GB RAM each
    • Over 40% of jobs run using Pig
    http://wiki.apache.org/hadoop/PoweredBy

    View full-size slide

  51. OTHER NOTABLE USERS
    • Twitter (storage, logging, analysis. Heavy users of Pig)
    • Rackspace (log analysis; data pumped into Lucene/Solr)
    • LinkedIn (contact suggestions)
    • Last.fm (charts, log analysis, A/B testing)
    • The New York Times (converted 4 TB of scans using EC2)

    View full-size slide

  52. JOB PROCESSING
    How Hadoop Works

    View full-size slide

  53. Just like I already described! It’s MapReduce!
    \o/

    View full-size slide

  54. BASIC RULES
    • Uses Input Formats to split up your data into single records
    • You can optimize using combiners to reduce locally on a node
    • Only possible in some cases, e.g. for max(), but not avg()
    • You can control partitioning of map output yourself
    • Rarely useful, the default partitioner (key hash) is enough
    • And a million other things that really don’t matter right now ;)

    View full-size slide

  55. HDFS
    Hadoop Distributed File System

    View full-size slide

  56. HDFS
    • Stores data in blocks (default block size: 64 MB)
    • Designed for very large data sets
    • Designed for streaming rather than random reads
    • Write-once, read-many (although appending is possible)
    • Capable of compression and other cool things

    View full-size slide

  57. HDFS CONCEPTS
    • Large blocks minimize amount of seeks, maximize throughput
    • Blocks are stored redundantly (3 replicas as default)
    • Aware of infrastructure characteristics (nodes, racks, ...)
    • Datanodes hold blocks
    • Namenode holds the metadata
    Critical component for an HDFS cluster (HA, SPOF)

    View full-size slide

  58. OTHER EXECUTION MODELS
    Why Even Write Code?

    View full-size slide

  59. HADOOP FRAMEWORKS
    AND ECOSYSTEM
    • Apache Hive
    SQL-like syntax
    • Apache Pig
    Data flow language
    • Cascading
    Java abstraction layer
    • Scalding (Scala)
    • Apache Mahout
    Machine Learning toolkit
    • Apache HBase
    BigTable-like database
    • Apache Nutch
    Search engine
    • Cloudera Impala
    Realtime queries (no MR)

    View full-size slide

  60. STREAMING
    Hadoop Won’t Force Us To Use Java

    View full-size slide

  61. Hadoop Streaming can use any script as Mapper or Reducer

    View full-size slide

  62. many configuration options (parsers, formats, combining, …)

    View full-size slide

  63. it works using STDIN and STDOUT

    View full-size slide

  64. Mappers are streamed the records
    (usually by line: \n)
    and emit key/value pairs: \t\n

    View full-size slide

  65. Reducers are streamed key/value pairs:
    \t\n
    \t\n
    \t\n
    \t\n

    View full-size slide

  66. Caution: no separate Reducer processes per key
    (but keys are sorted)

    View full-size slide

  67. STREAMING LIBRARIES
    • Dumbo or Hadoopy for Python
    https://github.com/klbostee/dumbo
    https://github.com/bwhite/hadoopy
    • Wukong for Ruby
    https://github.com/mrflip/wukong
    • HadooPHP for PHP
    https://github.com/dzuelke/HadooPHP

    View full-size slide

  68. STREAMING WITH PHP
    Introducing HadooPHP

    View full-size slide

  69. HADOOPHP
    • A little framework to help with writing mapred jobs in PHP
    • Takes care of input splitting, can do basic decoding et cetera
    • Automatically detects and handles Hadoop settings such as
    key length or field separators
    • Packages jobs as one .phar archive to ease deployment
    • Also creates a ready-to-rock shell script to invoke the job

    View full-size slide

  70. CHAPTER TWO
    Real-Time Big Data

    View full-size slide

  71. sometimes, you can’t wait a few hours

    View full-size slide

  72. Twitter’s trending topics, fraud warning systems, ...

    View full-size slide

  73. batch processing won’t cut it

    View full-size slide

  74. TWITTER STORM
    • Often called “the Hadoop for Real-Time”
    • Central Nimbus service coordinates execution w/ ZooKeeper
    • A Storm cluster runs Topologies, processing continuously
    • Spouts produce streams: unbounded sequences of tuples
    • Bolts consume input streams, process, output again
    • Topologies can consist of many steps for complex tasks

    View full-size slide

  75. TWITTER STORM
    • Bolts can be written in other languages
    • Uses STDIN/STDOUT like Hadoop Streaming, plus JSON
    • Storm can provide transactions for topologies and guarantee
    processing of messages
    • Architecture allows for non stream processing applications
    • e.g. Distributed RPC

    View full-size slide

  76. CLOUDERA IMPALA
    • Implementation of a Dremel/BigQuery like system on Hadoop
    • Uses Hadoop v2 YARN infrastructure for distributed work
    • No MapReduce, no job setup overhead
    • Query data in HDFS or HBase
    • Hive compatible interface
    • Potential game changer for its performance characteristics

    View full-size slide

  77. HAMMER TIME!
    Let the Hacking Begin :)

    View full-size slide

  78. THANK YOU!
    This was
    http://joind.in/8435
    by @dzuelke.
    Questions: [email protected]

    View full-size slide