Slide 1

Slide 1 text

Federico  Cargnelu/  /  BSkyB   &  Distributed  Compu

Slide 2

Slide 2 text

Distributed  compu

Slide 3

Slide 3 text

SETI@Home   Search  for  Extra-­‐Terrestrial  Intelligence   •  Prove  the  viability  of  the  distributed  grid  compu

Slide 4

Slide 4 text

What  problem  are  we  trying  to  solve?   Distributed  Compu6ng  

Slide 5

Slide 5 text

Counts  of  all  the  dis6nct  word   •  in  a  file?   •  in  a  directory?   •  on  the  Web?  

Slide 6

Slide 6 text

We  need  to  process  100TB  datasets   •  On  1  node:   o  Scanning  @  50MB/s  =  23  days   •  On  1000  node  cluster:   o  Scanning  @  50MB/s  =  33  min  

Slide 7

Slide 7 text

We  need  a  framework  for  distribu

Slide 8

Slide 8 text

We  need  a  new  paradigm  

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Hadoop  is  an  open-­‐source  Java  framework  for   running  applica

Slide 11

Slide 11 text

Scalable   Hadoop  can  reliably  store  and  process  petabytes  of  data.   Economical   Hadoop  distributes  the  data  and  processing  across  clusters  of  commonly   available  computers.  These  clusters  can  number  into  the  thousands  of   nodes.   Efficient   Hadoop  can  process  the  distributed  data  in  parallel  on  the  nodes  where   the  data  is  located.     Reliable   Hadoop  automa

Slide 12

Slide 12 text

Hadoop  Components   Hadoop  Distributed  File  System  (HDFS)   •   Java,  Shell,  C  and  HTTP  API’s   Hadoop  MapReduce   •   Java  and  Streaming  API’s   Hadoop  on  Demand   •  Tools  to  manage  dynamic  setup  and  teardown  of  Hadoop   nodes  

Slide 13

Slide 13 text

HBase   Table  storage  on  top  of  HDFS,  modeled  a=er  Google’s  Big  Table   Pig   Language  for  dataflow  programming   Hive   SQL  interface  to  structured  data  stored  in  HDFS   Other  Tools  

Slide 14

Slide 14 text

•  Mappers  and  Reducers  are  allocated   •  Code  is  shipped  to  nodes     •  Mappers  and  Reducers  are  run  on  same  machines   as  DataNodes   •  Two  major  daemons:  JobTracker  and  TaskTracker     Hadoop  MapReduce  

Slide 15

Slide 15 text

JobTracker   •   Long-­‐lived  master  daemon  which  distributes  tasks     •   Maintains  a  job  history  of  job  execu

Slide 16

Slide 16 text

•  Setup  a  mul<-­‐node  Hadoop  cluster  using  the  Hadoop   Distributed  File  System  (HDFS)   •  Create  a  hierarchical  HDFS  with  directories  and  files.   •  Use  Hadoop  API  to  store  a  large  text  file.   •  Create  a  MapReduce  applica

Slide 17

Slide 17 text

•  Mapper  takes  input  key/value  pair   •  Does  something  to  its  input   •  Emits  intermediate  key/value  pair     •  One  call  per  input  record   •  Fully  data-­‐parallel   Map  

Slide 18

Slide 18 text

(in,  1)     (in,  1)     (sunt,  1)     (in,  1)     (elit,  1)     (sed,  1)     (eiusmod,  1)     Map  

Slide 19

Slide 19 text

•  Input  is  all  list  of  intermediate  values  for  a  given  key     •  Reducer  aggregates  list  of  intermediate  values     •  Returns  a  final  key/value  pair  for  output   Reduce  

Slide 20

Slide 20 text

(irure,  1)     (in,  3)     (ea,  1)     (enim,  1)     (eu,  1)     (Duis,  1)     (dolore,  2)     Reduce   Reduce  

Slide 21

Slide 21 text

Adobe   -­‐  Use  for  data  storage  and   processing   -­‐  30  nodes   Facebook   -­‐  Use  for  repor

Slide 22

Slide 22 text

•  Video  and  Image  processing   •  Log  analysis   •  Spam/BOT  analysis   •  Behavioral  analy

Slide 23

Slide 23 text

Commodity  servers   •  1  RU   •  2  x  4  core  CPU   •  4-­‐8GB  of  RAM  using  ECC  memory   •  4  x  1TB  SATA  drives     •  1-­‐5TB  external  storage   Typically  arranged  in  2  level  architecture   •  30/40  nodes  per  rack     Recommended  Hardware  

Slide 24

Slide 24 text

•  No  version  and  dependency  management.   •  Configura

Slide 25

Slide 25 text

Images:    hip://www.flickr.com/photos/labguest/3509303134   hip://www.flickr.com/photos/tantrum_dan/3546852841   Ques6ons?