Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Collection, Storage and Analytics (SFLiveBerlin2012 2012-11-22)

David Zuelke
November 22, 2012

Big Data Collection, Storage and Analytics (SFLiveBerlin2012 2012-11-22)

Presentation given at Symfony Live Berlin 2012 in Berlin, Germany.

David Zuelke

November 22, 2012
Tweet

More Decks by David Zuelke

Other Decks in Programming

Transcript

  1. BIG DATA COLLECTION,
    STORAGE AND ANALYTICS

    View Slide

  2. David Zülke

    View Slide

  3. David Zuelke

    View Slide

  4. View Slide

  5. http://en.wikipedia.org/wiki/File:München_Panorama.JPG

    View Slide

  6. Founder

    View Slide

  7. View Slide

  8. Lead Developer

    View Slide

  9. View Slide

  10. @dzuelke

    View Slide

  11. PROLOGUE
    The Big Data Challenge

    View Slide

  12. we want to process data

    View Slide

  13. how much data exactly?

    View Slide

  14. SOME NUMBERS
    • Facebook, ingest per day:
    • I/08: 200 GB
    • II/09: 2 TB compressed
    • I/10: 12 TB compressed
    • III/12: 500 TB
    • Google
    • Data processed per
    month: 400 PB (in 2007!)
    • Average job size: 180 GB

    View Slide

  15. what if you have that much data?

    View Slide

  16. what if Google’s 180 GB per job is all the data you have?

    View Slide

  17. “No Problemo”, you say?

    View Slide

  18. reading 180 GB sequentially off a disk will take ~45 minutes

    View Slide

  19. and you only have 16 to 64 GB of RAM per computer

    View Slide

  20. so you can't process everything at once

    View Slide

  21. general rule of modern computers:

    View Slide

  22. data can be processed much faster than it can be read

    View Slide

  23. solution: parallelize your I/O

    View Slide

  24. but now you need to coordinate what you’re doing

    View Slide

  25. and that’s hard

    View Slide

  26. how do you avoid overloading your network?

    View Slide

  27. what if a node dies?

    View Slide

  28. is data lost?
    will other nodes in the grid have to re-start?
    how do you coordinate this?

    View Slide

  29. CHAPTER ONE
    Batch Processing of Big Data

    View Slide

  30. Hadoop is now the industry standard

    View Slide

  31. I wouldn’t bother with anything else

    View Slide

  32. ENTER: OUR HERO
    Introducing MapReduce

    View Slide

  33. in the olden days, the workload was distributed across a grid

    View Slide

  34. and the data was shipped around between nodes

    View Slide

  35. or even stored centrally on something like an SAN

    View Slide

  36. which was fine for small amounts of information

    View Slide

  37. but today, on the web, we have big data

    View Slide

  38. I/O bottleneck

    View Slide

  39. along came a Google publication in 2004

    View Slide

  40. MapReduce: Simplified Data Processing on Large Clusters
    http://labs.google.com/papers/mapreduce.html

    View Slide

  41. now the data is distributed

    View Slide

  42. computing happens on the nodes where the data already is

    View Slide

  43. processes are isolated and don’t communicate (share-nothing)

    View Slide

  44. BASIC MAPREDUCE FLOW
    1.A Mapper reads records and emits pairs
    1.Input could be a web server log, with each line as a record
    2.A Reducer is given a key and all values for this specific key
    1.Even if there are many Mappers on many computers; the
    results are aggregated before they are handed to Reducers
    * In pratice, it’s a lot smarter than that

    View Slide

  45. EXAMPLE OF MAPPED INPUT
    IP Bytes
    212.122.174.13 18271
    212.122.174.13 191726
    212.122.174.13 198
    74.119.8.111 91272
    74.119.8.111 8371
    212.122.174.13 43

    View Slide

  46. REDUCER WILL RECEIVE THIS
    IP Bytes
    212.122.174.13
    18271
    212.122.174.13
    191726
    212.122.174.13
    198
    212.122.174.13
    43
    74.119.8.111
    91272
    74.119.8.111
    8371

    View Slide

  47. AFTER REDUCTION
    IP Bytes
    212.122.174.13 210238
    74.119.8.111 99643

    View Slide

  48. PSEUDOCODE
    function  map($line_number,  $line_text)  {
       $parts  =  parse_apache_log($line_text);
       emit($parts['ip'],  $parts['bytes']);
    }
    function  reduce($key,  $values)  {
       $bytes  =  array_sum($values);
       emit($key,  $bytes);
    }
    212.122.174.13  210238
    74.119.8.111      99643
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  198
    74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  43
    74.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272
    212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371

    View Slide

  49. A YELLOW ELEPHANT
    Introducing Apache Hadoop

    View Slide

  50. View Slide

  51. Hadoop is a MapReduce framework

    View Slide

  52. it allows us to focus on writing Mappers, Reducers etc.

    View Slide

  53. and it works extremely well

    View Slide

  54. how well exactly?

    View Slide

  55. HADOOP AT FACEBOOK
    • Predominantly used in combination with Hive (~95%)
    • Largest cluster holds over 100 PB of data
    • Typically 8 cores, 12 TB storage and 32 GB RAM per node
    • 1x Gigabit Ethernet for each server in a rack
    • 4x Gigabit Ethernet from rack switch to core
    Hadoop is aware of racks and locality of nodes

    View Slide

  56. HADOOP AT YAHOO!
    • Over 25,000 computers with over 100,000 CPUs
    • Biggest Cluster:
    • 4000 Nodes
    • 2x4 CPU cores each
    • 16 GB RAM each
    • Over 40% of jobs run using Pig
    http://wiki.apache.org/hadoop/PoweredBy

    View Slide

  57. OTHER NOTABLE USERS
    • Twitter (storage, logging, analysis. Heavy users of Pig)
    • Rackspace (log analysis; data pumped into Lucene/Solr)
    • LinkedIn (contact suggestions)
    • Last.fm (charts, log analysis, A/B testing)
    • The New York Times (converted 4 TB of scans using EC2)

    View Slide

  58. OTHER EXECUTION MODELS
    Why Even Write Code?

    View Slide

  59. HADOOP FRAMEWORKS
    AND ECOSYSTEM
    • Apache Hive
    SQL-like syntax
    • Apache Pig
    Data flow language
    • Cascading
    Java abstraction layer
    • Scalding (Scala)
    • Apache Mahout
    Machine Learning toolkit
    • Apache HBase
    BigTable-like database
    • Apache Nutch
    Search engine
    • Cloudera Impala
    Realtime queries (no MR)

    View Slide

  60. CHAPTER TWO
    Real-Time Big Data

    View Slide

  61. sometimes, you can’t wait a few hours

    View Slide

  62. Twitter’s trending topics, fraud warning systems, ...

    View Slide

  63. batch processing won’t cut it

    View Slide

  64. View Slide

  65. TWITTER STORM
    • Often called “the Hadoop for Real-Time”
    • Central Nimbus service coordinates execution w/ ZooKeeper
    • A Storm cluster runs Topologies, processing continuously
    • Spouts produce streams: unbounded sequences of tuples
    • Bolts consume input streams, process, output again
    • Topologies can consist of many steps for complex tasks

    View Slide

  66. TWITTER STORM
    • Bolts can be written in other languages
    • Uses STDIN/STDOUT like Hadoop Streaming, plus JSON
    • Storm can provide transactions for topologies and guarantee
    processing of messages
    • Architecture allows for non stream processing applications
    • e.g. Distributed RPC

    View Slide

  67. CLOUDERA IMPALA
    • Implementation of a Dremel/BigQuery like system on Hadoop
    • Uses Hadoop v2 YARN infrastructure for distributed work
    • No MapReduce, no job setup overhead
    • Query data in HDFS or HBase
    • Hive compatible interface
    • Potential game changer for its performance characteristics

    View Slide

  68. DEMO
    Hadoop Streaming & PHP in Action

    View Slide

  69. STREAMING WITH PHP
    Introducing HadooPHP

    View Slide

  70. HADOOPHP
    • A little framework to help with writing mapred jobs in PHP
    • Takes care of input splitting, can do basic decoding et cetera
    • Automatically detects and handles Hadoop settings such as
    key length or field separators
    • Packages jobs as one .phar archive to ease deployment
    • Also creates a ready-to-rock shell script to invoke the job

    View Slide

  71. written by

    View Slide

  72. View Slide

  73. EPILOGUE
    Things to Keep in Mind

    View Slide

  74. DATA ACQUISITION
    • Batch loading
    • Log files into HDFS
    • *SQL to Hive via Sqoop
    • Streaming
    • Facebook Scribe
    • Apache Flume
    • Apache Chuckwa
    • Apache Kafka

    View Slide

  75. WANT TO KEEP IT SIMPLE?
    • Measure Anything, Measure Everything
    http://codeascraft.etsy.com/2011/02/15/measure-anything-
    measure-everything/
    • StatsD receives counter or timer values via UDP
    • StatsD::increment("grue.dinners");
    • Periodically flushes information to Graphite
    • But you need to know what you want to know!

    View Slide

  76. OPS MONITORING
    • Flume and Chuckwa have sources for everything:
    MySQL status, Kernel I/O, FastCGI statistics, ...
    • Build a flow into HDFS store for persistence
    • Impala queries for fast checks on service outages etc
    • Correlate with Storm flow results to find problems
    • Use cloud-based notifications to produce SMS/email alerts

    View Slide

  77. just kidding

    View Slide

  78. just use Nagios/Icinga for that

    View Slide

  79. !e End

    View Slide

  80. Questions?

    View Slide

  81. THANK YOU!
    This was http://joind.in/7564
    by @dzuelke
    Send me questions:
    [email protected]

    View Slide