Sky - Hadoop & Distributed Computing

Federico Cargnelu/ / BSkyB & Distributed Compu<ng Hadoop

Distributed compu<ng uses so=ware to divide pieces of a
program among several computers. One project in par<cular has proven that the concept works extremely well.

SETI@Home Search for Extra-‐Terrestrial Intelligence •  Prove the
viability of the distributed grid compu<ng concept (succeeded) •  Detect intelligent life outside Earth (failed)

What problem are we trying to solve? Distributed Compu6ng

Counts of all the dis6nct word •  in a
ﬁle? •  in a directory? •  on the Web?

We need to process 100TB datasets •  On 1
node: o  Scanning @ 50MB/s = 23 days •  On 1000 node cluster: o  Scanning @ 50MB/s = 33 min

We need a framework for distribu<on

We need a new paradigm

Hadoop is an open-‐source Java framework for running applica<ons
on large clusters of commodity hardware

Scalable Hadoop can reliably store and process petabytes of
data. Economical Hadoop distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Eﬃcient Hadoop can process the distributed data in parallel on the nodes where the data is located. Reliable Hadoop automa<cally maintains mul<ple copies of data and automa<cally redeploys compu<ng tasks based on failures.

Hadoop Components Hadoop Distributed File System (HDFS) • 
Java, Shell, C and HTTP API’s Hadoop MapReduce •  Java and Streaming API’s Hadoop on Demand •  Tools to manage dynamic setup and teardown of Hadoop nodes

HBase Table storage on top of HDFS, modeled a=er
Google’s Big Table Pig Language for dataﬂow programming Hive SQL interface to structured data stored in HDFS Other Tools

•  Mappers and Reducers are allocated •  Code is
shipped to nodes •  Mappers and Reducers are run on same machines as DataNodes •  Two major daemons: JobTracker and TaskTracker Hadoop MapReduce

JobTracker •  Long-‐lived master daemon which distributes tasks
•  Maintains a job history of job execu<on sta<s<cs TaskTrackers •  Long-‐lived client daemon which executes Map and Reduce tasks Hadoop MapReduce

•  Setup a mul<-‐node Hadoop cluster using the Hadoop
Distributed File System (HDFS) •  Create a hierarchical HDFS with directories and ﬁles. •  Use Hadoop API to store a large text ﬁle. •  Create a MapReduce applica<on. Hadoop MapReduce

•  Mapper takes input key/value pair •  Does something
to its input •  Emits intermediate key/value pair •  One call per input record •  Fully data-‐parallel Map

(in, 1) (in, 1) (sunt, 1)
(in, 1) (elit, 1) (sed, 1) (eiusmod, 1) Map

•  Input is all list of intermediate values for a
given key •  Reducer aggregates list of intermediate values •  Returns a ﬁnal key/value pair for output Reduce

(irure, 1) (in, 3) (ea, 1)
(enim, 1) (eu, 1) (Duis, 1) (dolore, 2) Reduce Reduce

Adobe -‐ Use for data storage and processing
-‐ 30 nodes Facebook -‐ Use for repor<ng and analy<cs -‐ 320 nodes FOX -‐ Use for log analysis and data mining -‐ 140 nodes Last.fm -‐ Use for chart calcula<on and log analysis -‐ 27 nodes New York Times -‐ Use for large scale image conversion -‐ 100 nodes Yahoo! -‐ Use for Ad systems and Web search -‐ 10.000 nodes Who is using it?

•  Video and Image processing •  Log analysis
•  Spam/BOT analysis •  Behavioral analy<cs (CRM) •  Sequen<al paiern analysis (eg. Understanding long-‐term customer buying behavior for cross selling and target marke<ng) Use Cases

Commodity servers •  1 RU •  2 x
4 core CPU •  4-‐8GB of RAM using ECC memory •  4 x 1TB SATA drives •  1-‐5TB external storage Typically arranged in 2 level architecture •  30/40 nodes per rack Recommended Hardware

•  No version and dependency management. •  Configura<on: more
than 150 parameters. •  No security against accidents. User iden<fica<on added a=er Last.fm deleted a fileystem by accident. •  HDFS is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file. •  Steep learning curve. According to Facebook, using Hadoop was not easy for end users, especially for the ones who were not familiar with MapReduce. Challenges

Images: hip://www.ﬂickr.com/photos/labguest/3509303134 hip://www.ﬂickr.com/photos/tantrum_dan/3546852841 Ques6ons?

Sky - Hadoop & Distributed Computing

Sky - Hadoop & Distributed Computing

Federico Cargnelutti

More Decks by Federico Cargnelutti

Other Decks in Technology

Featured

Transcript

Federico Cargnelu/ / BSkyB & Distributed Compu<ng Hadoop

Distributed compu<ng uses so=ware to divide pieces of a

SETI@Home Search for Extra-‐Terrestrial Intelligence •  Prove the

What problem are we trying to solve? Distributed Compu6ng

Counts of all the dis6nct word •  in a

We need to process 100TB datasets •  On 1

We need a framework for distribu<on

We need a new paradigm

Hadoop is an open-‐source Java framework for running applica<ons

Scalable Hadoop can reliably store and process petabytes of

Hadoop Components Hadoop Distributed File System (HDFS) •

HBase Table storage on top of HDFS, modeled a=er

•  Mappers and Reducers are allocated •  Code is

JobTracker •  Long-‐lived master daemon which distributes tasks

•  Setup a mul<-‐node Hadoop cluster using the Hadoop

•  Mapper takes input key/value pair •  Does something

(in, 1) (in, 1) (sunt, 1)

•  Input is all list of intermediate values for a

(irure, 1) (in, 3) (ea, 1)

Adobe -‐ Use for data storage and processing

•  Video and Image processing •  Log analysis

Commodity servers •  1 RU •  2 x

•  No version and dependency management. •  Conﬁgura<on: more

Images: hip://www.ﬂickr.com/photos/labguest/3509303134 hip://www.ﬂickr.com/photos/tantrum_dan/3546852841 Ques6ons?