Big Data Collection, Storage and Analytics (SFLiveBerlin2012 2012-11-22)

D6ccd6409910643d05ddaea3b2cd6f13?s=47 David Zuelke
November 22, 2012

Big Data Collection, Storage and Analytics (SFLiveBerlin2012 2012-11-22)

Presentation given at Symfony Live Berlin 2012 in Berlin, Germany.


David Zuelke

November 22, 2012



  2. David Zülke

  3. David Zuelke

  4. None

  6. Founder

  7. None
  8. Lead Developer

  9. None
  10. @dzuelke

  11. PROLOGUE The Big Data Challenge

  12. we want to process data

  13. how much data exactly?

  14. SOME NUMBERS • Facebook, ingest per day: • I/08: 200

    GB • II/09: 2 TB compressed • I/10: 12 TB compressed • III/12: 500 TB • Google • Data processed per month: 400 PB (in 2007!) • Average job size: 180 GB
  15. what if you have that much data?

  16. what if Google’s 180 GB per job is all the

    data you have?
  17. “No Problemo”, you say?

  18. reading 180 GB sequentially off a disk will take ~45

  19. and you only have 16 to 64 GB of RAM

    per computer
  20. so you can't process everything at once

  21. general rule of modern computers:

  22. data can be processed much faster than it can be

  23. solution: parallelize your I/O

  24. but now you need to coordinate what you’re doing

  25. and that’s hard

  26. how do you avoid overloading your network?

  27. what if a node dies?

  28. is data lost? will other nodes in the grid have

    to re-start? how do you coordinate this?
  29. CHAPTER ONE Batch Processing of Big Data

  30. Hadoop is now the industry standard

  31. I wouldn’t bother with anything else

  32. ENTER: OUR HERO Introducing MapReduce

  33. in the olden days, the workload was distributed across a

  34. and the data was shipped around between nodes

  35. or even stored centrally on something like an SAN

  36. which was fine for small amounts of information

  37. but today, on the web, we have big data

  38. I/O bottleneck

  39. along came a Google publication in 2004

  40. MapReduce: Simplified Data Processing on Large Clusters

  41. now the data is distributed

  42. computing happens on the nodes where the data already is

  43. processes are isolated and don’t communicate (share-nothing)

  44. BASIC MAPREDUCE FLOW 1.A Mapper reads records and emits <key,

    value> pairs 1.Input could be a web server log, with each line as a record 2.A Reducer is given a key and all values for this specific key 1.Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers * In pratice, it’s a lot smarter than that
  45. EXAMPLE OF MAPPED INPUT IP Bytes 18271 191726 198 91272 8371 43
  46. REDUCER WILL RECEIVE THIS IP Bytes 18271 191726 198 43 91272 8371
  47. AFTER REDUCTION IP Bytes 210238 99643

  48. PSEUDOCODE function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);  

     emit($parts['ip'],  $parts['bytes']); } function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes); }  210238      99643  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  198      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  43      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371
  49. A YELLOW ELEPHANT Introducing Apache Hadoop

  50. None
  51. Hadoop is a MapReduce framework

  52. it allows us to focus on writing Mappers, Reducers etc.

  53. and it works extremely well

  54. how well exactly?

  55. HADOOP AT FACEBOOK • Predominantly used in combination with Hive

    (~95%) • Largest cluster holds over 100 PB of data • Typically 8 cores, 12 TB storage and 32 GB RAM per node • 1x Gigabit Ethernet for each server in a rack • 4x Gigabit Ethernet from rack switch to core Hadoop is aware of racks and locality of nodes
  56. HADOOP AT YAHOO! • Over 25,000 computers with over 100,000

    CPUs • Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each • Over 40% of jobs run using Pig
  57. OTHER NOTABLE USERS • Twitter (storage, logging, analysis. Heavy users

    of Pig) • Rackspace (log analysis; data pumped into Lucene/Solr) • LinkedIn (contact suggestions) • (charts, log analysis, A/B testing) • The New York Times (converted 4 TB of scans using EC2)
  58. OTHER EXECUTION MODELS Why Even Write Code?

  59. HADOOP FRAMEWORKS AND ECOSYSTEM • Apache Hive SQL-like syntax •

    Apache Pig Data flow language • Cascading Java abstraction layer • Scalding (Scala) • Apache Mahout Machine Learning toolkit • Apache HBase BigTable-like database • Apache Nutch Search engine • Cloudera Impala Realtime queries (no MR)
  60. CHAPTER TWO Real-Time Big Data

  61. sometimes, you can’t wait a few hours

  62. Twitter’s trending topics, fraud warning systems, ...

  63. batch processing won’t cut it

  64. None
  65. TWITTER STORM • Often called “the Hadoop for Real-Time” •

    Central Nimbus service coordinates execution w/ ZooKeeper • A Storm cluster runs Topologies, processing continuously • Spouts produce streams: unbounded sequences of tuples • Bolts consume input streams, process, output again • Topologies can consist of many steps for complex tasks
  66. TWITTER STORM • Bolts can be written in other languages

    • Uses STDIN/STDOUT like Hadoop Streaming, plus JSON • Storm can provide transactions for topologies and guarantee processing of messages • Architecture allows for non stream processing applications • e.g. Distributed RPC
  67. CLOUDERA IMPALA • Implementation of a Dremel/BigQuery like system on

    Hadoop • Uses Hadoop v2 YARN infrastructure for distributed work • No MapReduce, no job setup overhead • Query data in HDFS or HBase • Hive compatible interface • Potential game changer for its performance characteristics
  68. DEMO Hadoop Streaming & PHP in Action

  69. STREAMING WITH PHP Introducing HadooPHP

  70. HADOOPHP • A little framework to help with writing mapred

    jobs in PHP • Takes care of input splitting, can do basic decoding et cetera • Automatically detects and handles Hadoop settings such as key length or field separators • Packages jobs as one .phar archive to ease deployment • Also creates a ready-to-rock shell script to invoke the job
  71. written by

  72. None
  73. EPILOGUE Things to Keep in Mind

  74. DATA ACQUISITION • Batch loading • Log files into HDFS

    • *SQL to Hive via Sqoop • Streaming • Facebook Scribe • Apache Flume • Apache Chuckwa • Apache Kafka
  75. WANT TO KEEP IT SIMPLE? • Measure Anything, Measure Everything measure-everything/ • StatsD receives counter or timer values via UDP • StatsD::increment("grue.dinners"); • Periodically flushes information to Graphite • But you need to know what you want to know!
  76. OPS MONITORING • Flume and Chuckwa have sources for everything:

    MySQL status, Kernel I/O, FastCGI statistics, ... • Build a flow into HDFS store for persistence • Impala queries for fast checks on service outages etc • Correlate with Storm flow results to find problems • Use cloud-based notifications to produce SMS/email alerts
  77. just kidding

  78. just use Nagios/Icinga for that

  79. !e End

  80. Questions?

  81. THANK YOU! This was by @dzuelke Send me questions: