value> pairs 1.Input could be a web server log, with each line as a record 2.A Reducer is given a key and all values for this specific key 1.Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers * In pratice, it’s a lot smarter than that
(~95%) • Largest cluster holds over 100 PB of data • Typically 8 cores, 12 TB storage and 32 GB RAM per node • 1x Gigabit Ethernet for each server in a rack • 4x Gigabit Ethernet from rack switch to core Hadoop is aware of racks and locality of nodes
of Pig) • Rackspace (log analysis; data pumped into Lucene/Solr) • LinkedIn (contact suggestions) • Last.fm (charts, log analysis, A/B testing) • The New York Times (converted 4 TB of scans using EC2)
Central Nimbus service coordinates execution w/ ZooKeeper • A Storm cluster runs Topologies, processing continuously • Spouts produce streams: unbounded sequences of tuples • Bolts consume input streams, process, output again • Topologies can consist of many steps for complex tasks
• Uses STDIN/STDOUT like Hadoop Streaming, plus JSON • Storm can provide transactions for topologies and guarantee processing of messages • Architecture allows for non stream processing applications • e.g. Distributed RPC
Hadoop • Uses Hadoop v2 YARN infrastructure for distributed work • No MapReduce, no job setup overhead • Query data in HDFS or HBase • Hive compatible interface • Potential game changer for its performance characteristics
jobs in PHP • Takes care of input splitting, can do basic decoding et cetera • Automatically detects and handles Hadoop settings such as key length or field separators • Packages jobs as one .phar archive to ease deployment • Also creates a ready-to-rock shell script to invoke the job
http://codeascraft.etsy.com/2011/02/15/measure-anything- measure-everything/ • StatsD receives counter or timer values via UDP • StatsD::increment("grue.dinners"); • Periodically flushes information to Graphite • But you need to know what you want to know!