Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop

Shankar
June 06, 2011

 Hadoop

An Introduction To Hadoop

Shankar

June 06, 2011
Tweet

More Decks by Shankar

Other Decks in Technology

Transcript

  1. —  State  of  the  Data   —  What  is  Hadoop

      —  Hadoop  Ecosystem   —  References  
  2. —  Data  driven  businesses   —  Businesses  have  been  collecting

     information  all  the   time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics   —  Need  of  the  day  
  3. —  Data  driven  business   —  Businesses  have  been  collecting

     information   all  the  time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics  
  4. —  Applications   —  Searches,  Message  posts,  Comments,  Emails,  

    Blogs,  Photos,  Video  Clips,  Product  Listings   —  ERP,  CRM,  Databases,  Internal  Applications,  Customer/ Consumer  facing  products   —  Mobile   —  Context   —  Web,  Customers,  Products,  Business  Systems,   Processes,  Services   —  Support  Systems   —  CRM,  SOA,  Recommendation  Systems/processes,   Data  warehouses,  Business  Intelligence,  BPM  
  5. —  Data  driven  businesses   —  Businesses  have  been  collecting

     information   all  the  time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics  
  6. —  Drivers   —  ROI   —  Customer  Retention  

    —  Product  Affinity   —  Market  Trends   —  Research  Analysis   —  Customer/Consumer  Analytics   —  Process   —  Clustering   —  Classification   —  Build  Relationships   —  Regression   —  Types   —  Structured   —  Semi-­‐structured   —  Unstructured  
  7. —  Data  driven  businesses   —  Businesses  have  been  collecting

     information   all  the  time   —  Mine  more  ==  Collect  more  (and  vice-­‐versa)   —  Challenges   —  Application  Complexities   —  Data  growth   —  Infrastructure   —  Economics  
  8. —  Complex  Applications   —  Data  integration  is  a  good

     but  complex  problem  to   solve   —  Data  Growth   —  Growth  is  exponential   —  Infrastructure   —  Availability   —  Unscalable  hardware   —  Economics   —  Managing  high  data  volume  comes  at  a  price   —  Failures  are  very  costly  
  9. —  System  that  can  handle  high  volume  data   — 

    System  that  can  perform  complex  operations   —  Scalable   —  Robust   —  Highly  Available   —  Fault  Tolerant   —  Cheap  
  10. —  Top  level  Apache  project   —  Open  source  

    —  Inspired  by  Google’s  white  papers  on   Map/Reduce  (MR),  Google  File  System  (GFS)   —  Originally  developed  to  support  Apache  Nutch  Search   Engine   —  Software  Framework  -­‐  Java   —  Designed   —  For  sophisticated  analysis   —  To  deal  with  structured  and  unstructured  complex  data  
  11. —  Runs  on  commodity  hardware   —  Shared-­‐nothing  architecture  

    —  Scale  hardware  when  ever  you  want   —  System  compensates  for  hardware  scaling   and  issues  (if  any)   —  Run  large-­‐scale,  high  volume  data  processes   —  Scales  well  with  complex  analysis  jobs   —  Handles  failures   —  Ideal  to  consolidate  data  from  both  new  and  legacy  data   sources   —  Value  to  the  business  
  12. —  HDFS      Hadoop  Distributed  File  System   — 

    Map/Reduce    Software  framework  for  Clustered,        Distributed  data  processing   —  ZooKeeper    Scheduler   —  Avro      Data  Serialization   —  Chukwa      Data  Collection  System  to  monitor        Distributed  Systems   —  HBase      Data  storage  for  distributed  large        tables   —  Hive      Data  warehousing  infrastructure   —  Pig        High-­‐Level  Query  Language  
  13. —  Master/Slave  Architecture   —  Runs  on  commodity  hardware  

    —  Fault  Tolerant   —  Handle  large  volumes  of  data   —  Provides  High  Throughput   —  Streaming  data-­‐access   —  Simple  file  coherency  model   —  Portable  to  heterogeneous  hardware  and  software   —  Robust   —  Handles  disk  failures,  replication  (&  re-­‐replication)   —  Performs  cluster  rebalancing,  data  integrity  checks  
  14. Name  node   •  File  system  operations   •  Maps

     data-­‐nodes   Data  node   •  Process  read/write   •  Handles  Data-­‐blocks   •  Replication  
  15. —  Tagged  by  a  job   —  Splits  input  data-­‐set

     into  separate  chunk’s   —  Processed  by  map  tasks,  in  parallel   —  Sorts  the  output  of  the  maps   —  Processed  by  reduce  tasks,  in  parallel   —  Typically  stored  and  processed  in  a  file  system   —  Framework  takes  care  of   —  Scheduling  tasks   —  Monitoring   —  Re-­‐executing  failed  tasks