BASIC MAPREDUCE FLOW 1.A Mapper reads records and emits pairs 1.Input could be a web server log, with each line as a record 2.A Reducer is given a key and all values for this specific key 1.Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers * In pratice, it’s a lot smarter than that
HADOOP AT FACEBOOK • Predominantly used in combination with Hive (~95%) • Largest cluster holds over 100 PB of data • Typically 8 cores, 12 TB storage and 32 GB RAM per node • 1x Gigabit Ethernet for each server in a rack • 4x Gigabit Ethernet from rack switch to core Hadoop is aware of racks and locality of nodes
HADOOP AT YAHOO! • Over 25,000 computers with over 100,000 CPUs • Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each • Over 40% of jobs run using Pig http://wiki.apache.org/hadoop/PoweredBy
OTHER NOTABLE USERS • Twitter (storage, logging, analysis. Heavy users of Pig) • Rackspace (log analysis; data pumped into Lucene/Solr) • LinkedIn (contact suggestions) • Last.fm (charts, log analysis, A/B testing) • The New York Times (converted 4 TB of scans using EC2)
TWITTER STORM • Often called “the Hadoop for Real-Time” • Central Nimbus service coordinates execution w/ ZooKeeper • A Storm cluster runs Topologies, processing continuously • Spouts produce streams: unbounded sequences of tuples • Bolts consume input streams, process, output again • Topologies can consist of many steps for complex tasks
TWITTER STORM • Bolts can be written in other languages • Uses STDIN/STDOUT like Hadoop Streaming, plus JSON • Storm can provide transactions for topologies and guarantee processing of messages • Architecture allows for non stream processing applications • e.g. Distributed RPC
CLOUDERA IMPALA • Implementation of a Dremel/BigQuery like system on Hadoop • Uses Hadoop v2 YARN infrastructure for distributed work • No MapReduce, no job setup overhead • Query data in HDFS or HBase • Hive compatible interface • Potential game changer for its performance characteristics
HADOOPHP • A little framework to help with writing mapred jobs in PHP • Takes care of input splitting, can do basic decoding et cetera • Automatically detects and handles Hadoop settings such as key length or field separators • Packages jobs as one .phar archive to ease deployment • Also creates a ready-to-rock shell script to invoke the job
WANT TO KEEP IT SIMPLE? • Measure Anything, Measure Everything http://codeascraft.etsy.com/2011/02/15/measure-anything- measure-everything/ • StatsD receives counter or timer values via UDP • StatsD::increment("grue.dinners"); • Periodically flushes information to Graphite • But you need to know what you want to know!
OPS MONITORING • Flume and Chuckwa have sources for everything: MySQL status, Kernel I/O, FastCGI statistics, ... • Build a flow into HDFS store for persistence • Impala queries for fast checks on service outages etc • Correlate with Storm flow results to find problems • Use cloud-based notifications to produce SMS/email alerts