Big Data Collection, Storage and Analytics (SFLiveBerlin2012 2012-11-22)

BIG DATA COLLECTION, STORAGE AND ANALYTICS

David Zülke

David Zuelke

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

Founder

Lead Developer

@dzuelke

PROLOGUE The Big Data Challenge

we want to process data

how much data exactly?

SOME NUMBERS • Facebook, ingest per day: • I/08: 200
GB • II/09: 2 TB compressed • I/10: 12 TB compressed • III/12: 500 TB • Google • Data processed per month: 400 PB (in 2007!) • Average job size: 180 GB

what if you have that much data?

what if Google’s 180 GB per job is all the
data you have?

“No Problemo”, you say?

reading 180 GB sequentially off a disk will take ~45
minutes

and you only have 16 to 64 GB of RAM
per computer

so you can't process everything at once

general rule of modern computers:

data can be processed much faster than it can be
read

solution: parallelize your I/O

but now you need to coordinate what you’re doing

and that’s hard

how do you avoid overloading your network?

what if a node dies?

is data lost? will other nodes in the grid have
to re-start? how do you coordinate this?

CHAPTER ONE Batch Processing of Big Data

Hadoop is now the industry standard

I wouldn’t bother with anything else

ENTER: OUR HERO Introducing MapReduce

in the olden days, the workload was distributed across a
grid

and the data was shipped around between nodes

or even stored centrally on something like an SAN

which was ﬁne for small amounts of information

but today, on the web, we have big data

I/O bottleneck

along came a Google publication in 2004

MapReduce: Simpliﬁed Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html

now the data is distributed

computing happens on the nodes where the data already is

processes are isolated and don’t communicate (share-nothing)

BASIC MAPREDUCE FLOW 1.A Mapper reads records and emits <key,
value> pairs 1.Input could be a web server log, with each line as a record 2.A Reducer is given a key and all values for this speciﬁc key 1.Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers * In pratice, it’s a lot smarter than that

EXAMPLE OF MAPPED INPUT IP Bytes 212.122.174.13 18271 212.122.174.13 191726
212.122.174.13 198 74.119.8.111 91272 74.119.8.111 8371 212.122.174.13 43

REDUCER WILL RECEIVE THIS IP Bytes 212.122.174.13 18271 212.122.174.13 191726
212.122.174.13 198 212.122.174.13 43 74.119.8.111 91272 74.119.8.111 8371

AFTER REDUCTION IP Bytes 212.122.174.13 210238 74.119.8.111 99643

PSEUDOCODE function map($line_number, $line_text) { $parts = parse_apache_log($line_text);
emit($parts['ip'], $parts['bytes']); } function reduce($key, $values) { $bytes = array_sum($values); emit($key, $bytes); } 212.122.174.13 210238 74.119.8.111 99643 212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /foo HTTP/1.1" 200 18271 212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /bar HTTP/1.1" 200 191726 212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /baz HTTP/1.1" 200 198 74.119.8.111 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /egg HTTP/1.1" 200 43 74.119.8.111 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /moo HTTP/1.1" 200 91272 212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /yay HTTP/1.1" 200 8371

A YELLOW ELEPHANT Introducing Apache Hadoop

Hadoop is a MapReduce framework

it allows us to focus on writing Mappers, Reducers etc.

and it works extremely well

how well exactly?

HADOOP AT FACEBOOK • Predominantly used in combination with Hive
(~95%) • Largest cluster holds over 100 PB of data • Typically 8 cores, 12 TB storage and 32 GB RAM per node • 1x Gigabit Ethernet for each server in a rack • 4x Gigabit Ethernet from rack switch to core Hadoop is aware of racks and locality of nodes

HADOOP AT YAHOO! • Over 25,000 computers with over 100,000
CPUs • Biggest Cluster: • 4000 Nodes • 2x4 CPU cores each • 16 GB RAM each • Over 40% of jobs run using Pig http://wiki.apache.org/hadoop/PoweredBy

OTHER NOTABLE USERS • Twitter (storage, logging, analysis. Heavy users
of Pig) • Rackspace (log analysis; data pumped into Lucene/Solr) • LinkedIn (contact suggestions) • Last.fm (charts, log analysis, A/B testing) • The New York Times (converted 4 TB of scans using EC2)

OTHER EXECUTION MODELS Why Even Write Code?

HADOOP FRAMEWORKS AND ECOSYSTEM • Apache Hive SQL-like syntax •
Apache Pig Data ﬂow language • Cascading Java abstraction layer • Scalding (Scala) • Apache Mahout Machine Learning toolkit • Apache HBase BigTable-like database • Apache Nutch Search engine • Cloudera Impala Realtime queries (no MR)

CHAPTER TWO Real-Time Big Data

sometimes, you can’t wait a few hours

Twitter’s trending topics, fraud warning systems, ...

batch processing won’t cut it

TWITTER STORM • Often called “the Hadoop for Real-Time” •
Central Nimbus service coordinates execution w/ ZooKeeper • A Storm cluster runs Topologies, processing continuously • Spouts produce streams: unbounded sequences of tuples • Bolts consume input streams, process, output again • Topologies can consist of many steps for complex tasks

TWITTER STORM • Bolts can be written in other languages
• Uses STDIN/STDOUT like Hadoop Streaming, plus JSON • Storm can provide transactions for topologies and guarantee processing of messages • Architecture allows for non stream processing applications • e.g. Distributed RPC

CLOUDERA IMPALA • Implementation of a Dremel/BigQuery like system on
Hadoop • Uses Hadoop v2 YARN infrastructure for distributed work • No MapReduce, no job setup overhead • Query data in HDFS or HBase • Hive compatible interface • Potential game changer for its performance characteristics

DEMO Hadoop Streaming & PHP in Action

STREAMING WITH PHP Introducing HadooPHP

HADOOPHP • A little framework to help with writing mapred
jobs in PHP • Takes care of input splitting, can do basic decoding et cetera • Automatically detects and handles Hadoop settings such as key length or ﬁeld separators • Packages jobs as one .phar archive to ease deployment • Also creates a ready-to-rock shell script to invoke the job

written by

EPILOGUE Things to Keep in Mind

DATA ACQUISITION • Batch loading • Log ﬁles into HDFS
• *SQL to Hive via Sqoop • Streaming • Facebook Scribe • Apache Flume • Apache Chuckwa • Apache Kafka

WANT TO KEEP IT SIMPLE? • Measure Anything, Measure Everything
http://codeascraft.etsy.com/2011/02/15/measure-anything- measure-everything/ • StatsD receives counter or timer values via UDP • StatsD::increment("grue.dinners"); • Periodically ﬂushes information to Graphite • But you need to know what you want to know!

OPS MONITORING • Flume and Chuckwa have sources for everything:
MySQL status, Kernel I/O, FastCGI statistics, ... • Build a flow into HDFS store for persistence • Impala queries for fast checks on service outages etc • Correlate with Storm flow results to find problems • Use cloud-based notifications to produce SMS/email alerts

just kidding

just use Nagios/Icinga for that

!e End

Questions?

THANK YOU! This was http://joind.in/7564 by @dzuelke Send me questions:
[email protected]

Big Data Collection, Storage and Analytics (SFL...

Big Data Collection, Storage and Analytics (SFLiveBerlin2012 2012-11-22)

More Decks by David Zuelke

Other Decks in Programming

Featured

Transcript