Sky - Hadoop & Distributed Computing

Slide 1

Slide 1 text

Federico Cargnelu/ / BSkyB & Distributed Compu

Slide 2

Slide 2 text

Distributed compu

Slide 3

Slide 3 text

SETI@Home Search for Extra-‐Terrestrial Intelligence •  Prove the viability of the distributed grid compu

Slide 4

Slide 4 text

What problem are we trying to solve? Distributed Compu6ng

Slide 5

Slide 5 text

Counts of all the dis6nct word •  in a ﬁle? •  in a directory? •  on the Web?

Slide 6

Slide 6 text

We need to process 100TB datasets •  On 1 node: o  Scanning @ 50MB/s = 23 days •  On 1000 node cluster: o  Scanning @ 50MB/s = 33 min

Slide 7

Slide 7 text

We need a framework for distribu

Slide 8

Slide 8 text

We need a new paradigm

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Hadoop is an open-‐source Java framework for running applica

Slide 11

Slide 11 text

Scalable Hadoop can reliably store and process petabytes of data. Economical Hadoop distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Eﬃcient Hadoop can process the distributed data in parallel on the nodes where the data is located. Reliable Hadoop automa

Slide 12

Slide 12 text

Hadoop Components Hadoop Distributed File System (HDFS) •  Java, Shell, C and HTTP API’s Hadoop MapReduce •  Java and Streaming API’s Hadoop on Demand •  Tools to manage dynamic setup and teardown of Hadoop nodes

Slide 13

Slide 13 text

HBase Table storage on top of HDFS, modeled a=er Google’s Big Table Pig Language for dataﬂow programming Hive SQL interface to structured data stored in HDFS Other Tools

Slide 14

Slide 14 text

•  Mappers and Reducers are allocated •  Code is shipped to nodes •  Mappers and Reducers are run on same machines as DataNodes •  Two major daemons: JobTracker and TaskTracker Hadoop MapReduce

Slide 15

Slide 15 text

JobTracker •  Long-‐lived master daemon which distributes tasks •  Maintains a job history of job execu

Slide 16

Slide 16 text

•  Setup a mul<-‐node Hadoop cluster using the Hadoop Distributed File System (HDFS) •  Create a hierarchical HDFS with directories and ﬁles. •  Use Hadoop API to store a large text ﬁle. •  Create a MapReduce applica

Slide 17

Slide 17 text

•  Mapper takes input key/value pair •  Does something to its input •  Emits intermediate key/value pair •  One call per input record •  Fully data-‐parallel Map

Slide 18

Slide 18 text

(in, 1) (in, 1) (sunt, 1) (in, 1) (elit, 1) (sed, 1) (eiusmod, 1) Map

Slide 19

Slide 19 text

•  Input is all list of intermediate values for a given key •  Reducer aggregates list of intermediate values •  Returns a ﬁnal key/value pair for output Reduce

Slide 20

Slide 20 text

(irure, 1) (in, 3) (ea, 1) (enim, 1) (eu, 1) (Duis, 1) (dolore, 2) Reduce Reduce

Slide 21

Slide 21 text

Adobe -‐ Use for data storage and processing -‐ 30 nodes Facebook -‐ Use for repor

Slide 22

Slide 22 text

•  Video and Image processing •  Log analysis •  Spam/BOT analysis •  Behavioral analy

Slide 23

Slide 23 text

Commodity servers •  1 RU •  2 x 4 core CPU •  4-‐8GB of RAM using ECC memory •  4 x 1TB SATA drives •  1-‐5TB external storage Typically arranged in 2 level architecture •  30/40 nodes per rack Recommended Hardware

Slide 24

Slide 24 text

•  No version and dependency management. •  Conﬁgura

Slide 25

Slide 25 text

Images: hip://www.ﬂickr.com/photos/labguest/3509303134 hip://www.ﬂickr.com/photos/tantrum_dan/3546852841 Ques6ons?