Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop

Avatar for Shankar Shankar
June 06, 2011

 Hadoop

An Introduction To Hadoop

Avatar for Shankar

Shankar

June 06, 2011
Tweet

More Decks by Shankar

Other Decks in Technology

Transcript

  1. —  State  of  the  Data   —  What  is  Hadoop

      —  Hadoop  Ecosystem   —  References  
  2. —  Data  driven  businesses   —  Businesses  have  been  collecting

     information  all  the   time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics   —  Need  of  the  day  
  3. —  Data  driven  business   —  Businesses  have  been  collecting

     information   all  the  time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics  
  4. —  Applications   —  Searches,  Message  posts,  Comments,  Emails,  

    Blogs,  Photos,  Video  Clips,  Product  Listings   —  ERP,  CRM,  Databases,  Internal  Applications,  Customer/ Consumer  facing  products   —  Mobile   —  Context   —  Web,  Customers,  Products,  Business  Systems,   Processes,  Services   —  Support  Systems   —  CRM,  SOA,  Recommendation  Systems/processes,   Data  warehouses,  Business  Intelligence,  BPM  
  5. —  Data  driven  businesses   —  Businesses  have  been  collecting

     information   all  the  time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics  
  6. —  Drivers   —  ROI   —  Customer  Retention  

    —  Product  Affinity   —  Market  Trends   —  Research  Analysis   —  Customer/Consumer  Analytics   —  Process   —  Clustering   —  Classification   —  Build  Relationships   —  Regression   —  Types   —  Structured   —  Semi-­‐structured   —  Unstructured  
  7. —  Data  driven  businesses   —  Businesses  have  been  collecting

     information   all  the  time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics  
  8. —  Complex  Applications   —  Data  integration  is  a  good

     but  complex  problem  to   solve   —  Data  Growth   —  Growth  is  exponential   —  Infrastructure   —  Availability   —  Unscalable  hardware   —  Economics   —  Managing  high  data  volume  comes  at  a  price   —  Failures  are  very  costly  
  9. —  System  that  can  handle  high  volume  data   — 

    System  that  can  perform  complex  operations   —  Scalable   —  Robust   —  Highly  Available   —  Fault  Tolerant   —  Cheap  
  10. —  Top  level  Apache  project   —  Open  source  

    —  Inspired  by  Google’s  white  papers  on   Map/Reduce  (MR),  Google  File  System  (GFS)   —  Originally  developed  to  support  Apache  Nutch  Search   Engine   —  Software  Framework  -­‐  Java   —  Designed   —  For  sophisticated  analysis   —  To  deal  with  structured  and  unstructured  complex  data  
  11. —  Runs  on  commodity  hardware   —  Shared-­‐nothing  architecture  

    —  Scale  hardware  when  ever  you  want   —  System  compensates  for  hardware  scaling   and  issues  (if  any)   —  Run  large-­‐scale,  high  volume  data  processes   —  Scales  well  with  complex  analysis  jobs   —  Handles  failures   —  Ideal  to  consolidate  data  from  both  new  and  legacy  data   sources   —  Value  to  the  business  
  12. —  HDFS      Hadoop  Distributed  File  System   — 

    Map/Reduce    Software  framework  for  Clustered,        Distributed  data  processing   —  ZooKeeper    Scheduler   —  Avro      Data  Serialization   —  Chukwa      Data  Collection  System  to  monitor        Distributed  Systems   —  HBase      Data  storage  for  distributed  large        tables   —  Hive      Data  warehousing  infrastructure   —  Pig        High-­‐Level  Query  Language  
  13. —  Master/Slave  Architecture   —  Runs  on  commodity  hardware  

    —  Fault  Tolerant   —  Handle  large  volumes  of  data   —  Provides  High  Throughput   —  Streaming  data-­‐access   —  Simple  file  coherency  model   —  Portable  to  heterogeneous  hardware  and  software   —  Robust   —  Handles  disk  failures,  replication  (&  re-­‐replication)   —  Performs  cluster  rebalancing,  data  integrity  checks  
  14. Name  node   •  File  system  operations   •  Maps

     data-­‐nodes   Data  node   •  Process  read/write   •  Handles  Data-­‐blocks   •  Replication  
  15. —  Tagged  by  a  job   —  Splits  input  data-­‐set

     into  separate  chunk’s   —  Processed  by  map  tasks,  in  parallel   —  Sorts  the  output  of  the  maps   —  Processed  by  reduce  tasks,  in  parallel   —  Typically  stored  and  processed  in  a  file  system   —  Framework  takes  care  of   —  Scheduling  tasks   —  Monitoring   —  Re-­‐executing  failed  tasks