Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sky - Hadoop & Distributed Computing

Sky - Hadoop & Distributed Computing

Federico Cargnelutti

August 04, 2009
Tweet

More Decks by Federico Cargnelutti

Other Decks in Technology

Transcript

  1. Federico  Cargnelu/  /  BSkyB  
    &  Distributed  CompuHadoop  

    View Slide

  2. Distributed  compupieces  of  a  program  among  several  computers.  
    One  project  in  parconcept  works  extremely  well.  

    View Slide

  3. [email protected]  
    Search  for  Extra-­‐Terrestrial  Intelligence  
    •  Prove  the  viability  of  the  distributed  grid  compuconcept  (succeeded)  
    •  Detect  intelligent  life  outside  Earth  (failed)  

    View Slide

  4. What  problem  are  we  trying  to  solve?  
    Distributed  Compu6ng  

    View Slide

  5. Counts  of  all  the  dis6nct  word  
    •  in  a  file?  
    •  in  a  directory?  
    •  on  the  Web?  

    View Slide

  6. We  need  to  process  100TB  datasets  
    •  On  1  node:  
    o  Scanning  @  50MB/s  =  23  days  
    •  On  1000  node  cluster:  
    o  Scanning  @  50MB/s  =  33  min  

    View Slide

  7. We  need  a  framework  for  distribu

    View Slide

  8. We  need  a  new  paradigm
     

    View Slide

  9. View Slide

  10. Hadoop  is  an  open-­‐source  Java  framework  for  
    running  applicahardware  

    View Slide

  11. Scalable  
    Hadoop  can  reliably  store  and  process  petabytes  of  data.  
    Economical  
    Hadoop  distributes  the  data  and  processing  across  clusters  of  commonly  
    available  computers.  These  clusters  can  number  into  the  thousands  of  
    nodes.  
    Efficient  
    Hadoop  can  process  the  distributed  data  in  parallel  on  the  nodes  where  
    the  data  is  located.    
    Reliable  
    Hadoop  automaautoma

    View Slide

  12. Hadoop  Components  
    Hadoop  Distributed  File  System  (HDFS)  
    •   Java,  Shell,  C  and  HTTP  API’s  
    Hadoop  MapReduce  
    •   Java  and  Streaming  API’s  
    Hadoop  on  Demand  
    •  Tools  to  manage  dynamic  setup  and  teardown  of  Hadoop  
    nodes  

    View Slide

  13. HBase  
    Table  storage  on  top  of  HDFS,  modeled  a=er  Google’s  Big  Table  
    Pig  
    Language  for  dataflow  programming  
    Hive  
    SQL  interface  to  structured  data  stored  in  HDFS  
    Other  Tools  

    View Slide

  14. •  Mappers  and  Reducers  are  allocated  
    •  Code  is  shipped  to  nodes    
    •  Mappers  and  Reducers  are  run  on  same  machines  
    as  DataNodes  
    •  Two  major  daemons:  JobTracker  and  TaskTracker    
    Hadoop  MapReduce
     

    View Slide

  15. JobTracker  
    •   Long-­‐lived  master  daemon  which  distributes  tasks    
    •   Maintains  a  job  history  of  job  execuTaskTrackers  
    •  Long-­‐lived  client  daemon  which  executes  Map  and  
    Reduce  tasks    
    Hadoop  MapReduce
     

    View Slide

  16. •  Setup  a  mul<-­‐node  Hadoop  cluster  using  the  Hadoop  
    Distributed  File  System  (HDFS)  
    •  Create  a  hierarchical  HDFS  with  directories  and  files.  
    •  Use  Hadoop  API  to  store  a  large  text  file.  
    •  Create  a  MapReduce  applicaHadoop  MapReduce
     

    View Slide

  17. •  Mapper  takes  input  key/value  pair  
    •  Does  something  to  its  input  
    •  Emits  intermediate  key/value  pair    
    •  One  call  per  input  record  
    •  Fully  data-­‐parallel  
    Map  

    View Slide

  18. (in,  1)    
    (in,  1)    
    (sunt,  1)    
    (in,  1)    
    (elit,  1)    
    (sed,  1)    
    (eiusmod,  1)    
    Map  

    View Slide

  19. •  Input  is  all  list  of  intermediate  values  for  a  given  key    
    •  Reducer  aggregates  list  of  intermediate  values    
    •  Returns  a  final  key/value  pair  for  output  
    Reduce  

    View Slide

  20. (irure,  1)    
    (in,  3)    
    (ea,  1)    
    (enim,  1)    
    (eu,  1)    
    (Duis,  1)    
    (dolore,  2)    
    Reduce  
    Reduce  

    View Slide

  21. Adobe  
    -­‐  Use  for  data  storage  and  
    processing  
    -­‐  30  nodes  
    Facebook  
    -­‐  Use  for  repor-­‐  320  nodes  
    FOX  
    -­‐  Use  for  log  analysis  and  data  
    mining  
    -­‐  140  nodes  
    Last.fm  
    -­‐  Use  for  chart  calcula-­‐  27  nodes    
    New  York  Times  
    -­‐  Use  for  large  scale  image  conversion  
     -­‐  100  nodes    
    Yahoo!  
     -­‐  Use  for  Ad  systems  and  Web  search  
     -­‐  10.000  nodes  
    Who  is  using  it?  

    View Slide

  22. •  Video  and  Image  processing  
    •  Log  analysis  
    •  Spam/BOT  analysis  
    •  Behavioral  analy•  Sequencustomer  buying  behavior  for  cross  selling  and  target  
    markeUse  Cases  

    View Slide

  23. Commodity  servers  
    •  1  RU  
    •  2  x  4  core  CPU  
    •  4-­‐8GB  of  RAM  using  ECC  memory  
    •  4  x  1TB  SATA  drives    
    •  1-­‐5TB  external  storage  
    Typically  arranged  in  2  level  architecture  
    •  30/40  nodes  per  rack    
    Recommended  Hardware  

    View Slide

  24. •  No  version  and  dependency  management.  
    •  Configura•  No  security  against  accidents.  User  iden<ficaLast.fm  deleted  a  fileystem  by  accident.    
    •  HDFS  is  primarily  designed  for  streaming  access  of  large  files.  
    Reading  through  small  files  normally  causes  lots  of  seeks  and  lots  
    of  hopping  from  datanode  to  datanode  to  retrieve  each  small  
    file.  
    •  Steep  learning  curve.  According  to  Facebook,  using  Hadoop  was  
    not  easy  for  end  users,  especially  for  the  ones  who  were  not  
    familiar  with  MapReduce.    
    Challenges  

    View Slide

  25. Images:  
     hip://www.flickr.com/photos/labguest/3509303134  
    hip://www.flickr.com/photos/tantrum_dan/3546852841  
    Ques6ons?  

    View Slide