Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction To Hadoop

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Introduction To Hadoop

Avatar for Marc Cluet

Marc Cluet

June 18, 2013
Tweet

More Decks by Marc Cluet

Other Decks in Technology

Transcript

  1. What we’ll cover? ¡  Understand  Hadoop  components   ¡  Understand

     different  technologies  involved   ¡  Embrace  Big  Data!   Lynx  Consultants  ©  2013  
  2. What is Big Data? ¡   SQL  has  a  limited  ability

     to  process  changing  data   §  SQL  schemas  are  the  truth,  data  needs  to  fit  that   Lynx  Consultants  ©  2013  
  3. What is Big Data? ¡   Big  Data  is  the  solution!

      §  Data  can  be  truly  dynamic   Lynx  Consultants  ©  2013  
  4. What is Big Data? ¡   Big  Data  is  the  solution!

      §  Data  can  be  truly  dynamic   §  Designed  to  handle  Terabytes  of  data   Lynx  Consultants  ©  2013  
  5. What is Big Data? ¡   Big  Data  is  the  solution!

      §  Data  can  be  truly  dynamic   §  Designed  to  handle  Terabytes  of  data   §  Designed  for  fault  tolerance  and  securing  data   Lynx  Consultants  ©  2013  
  6. What is Big Data? ¡   Big  Data  is  the  solution!

      §  Data  can  be  truly  dynamic   §  Designed  to  handle  Terabytes  of  data   §  Designed  for  fault  tolerance  and  securing  data   §  Designed  around  exploiting  hardware  to  the  fullest   Lynx  Consultants  ©  2013  
  7. What is Big Data? ¡   Big  Data  is  the  solution!

      §  Data  can  be  truly  dynamic   §  Designed  to  handle  Terabytes  of  data   §  Designed  for  fault  tolerance  and  securing  data   §  Designed  around  exploiting  hardware  to  the  fullest   §  Designed  around  Map/Reduce   Lynx  Consultants  ©  2013  
  8. What is Hadoop? ¡   Hadoop  is  one  of  the  big

     players  for  Big  Data   §  Developed  as  an  Open  Source  implementation  to  implement   Google  BigTable   Lynx  Consultants  ©  2013  
  9. What is Hadoop? ¡   Hadoop  is  one  of  the  big

     players  for  Big  Data   §  Developed  as  an  Open  Source  implementation  to  implement   Google  BigTable   §  Mainly  developed  at  Yahoo!   Lynx  Consultants  ©  2013  
  10. What is Hadoop? ¡   Hadoop  is  one  of  the  big

     players  for  Big  Data   §  Developed  as  an  Open  Source  implementation  to  implement   Google  BigTable   §  Mainly  developed  at  Yahoo!   §  Current  companies  behind  it:  Hortonworks  and  Cloudera   Lynx  Consultants  ©  2013  
  11. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   §  HDFS  is  a  distributed  filesystem  across  many  nodes   §  Has  many  copies  of  your  data  (default:  3)   §  If  one  node  goes  down  makes  sure  all  the  data  is  rebalanced   Lynx  Consultants  ©  2013  
  12. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   Lynx  Consultants  ©  2013  
  13. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   ¡   Hbase  –  Hadoop  NoSQL  Database   §  Schemaless  Key-­‐Value  storage   §  All  data  exportable  in  JSON   Lynx  Consultants  ©  2013  
  14. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   ¡   Hbase  –  Hadoop  NoSQL  Database   Lynx  Consultants  ©  2013  
  15. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   ¡   Hbase  –  Hadoop  NoSQL  Database   ¡   Map/Reduce  –  The  key  to  it  all   §  This  was  invented  by  Google   §  Given  a  dataset  we  Map  all  that  match  a  criteria   §  Then  we  Reduce  this  to  a  result   Lynx  Consultants  ©  2013  
  16. What are the features of Hadoop? ¡  Map/Reduce  –  The

     key  to  it  all   Lynx  Consultants  ©  2013  
  17. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   ¡   Hbase  –  Hadoop  NoSQL  Database   ¡   Map/Reduce  –  The  key  to  it  all   ¡   Hive  –  SQL  for  NoSQL   §  Hive  provides  a  SQL  language  called  HiveSQL   §  Provides  a  good  entrance  for  SQL  users  :)   Lynx  Consultants  ©  2013  
  18. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   ¡   Hbase  –  Hadoop  NoSQL  Database   ¡   Map/Reduce  –  The  key  to  it  all   ¡   Hive  –  SQL  for  NoSQL   ¡   Pig  –  Map/Reduce  made  easy   §  Creates  data  results  given  a  reduced  language   §  Reinvents  SQL  somehow   Lynx  Consultants  ©  2013  
  19. What are the features of Hadoop? ¡   HDFS  –  Hadoop

     Distributed  File  System   ¡   Hbase  –  Hadoop  NoSQL  Database   ¡   Map/Reduce  –  The  key  to  it  all   ¡   Hive  –  SQL  for  NoSQL   ¡   Pig  –  Map/Reduce  made  easy   ¡   Flume  –  Fault  Tolerant  transport   Lynx  Consultants  ©  2013  
  20. What are the features of Hadoop? ¡   Flume   § 

    Divides  in  Sources,  Channels,  Sinks   §  Can  have  multiple  of  everything,  makes  it  fault  tolerant   §  Many  sources!   ▪  Avro,  Exec,  JMS,  Syslog,  HTTP,  NetCat,  Your  Own  (Java)   Lynx  Consultants  ©  2013  
  21. What are the features of Hadoop? ¡   Flume   § 

    Divides  in  Sources,  Channels,  Sinks   §  Can  have  multiple  of  everything,  makes  it  fault  tolerant   §  Many  sources!   §  Many  channels!   ▪  Memory,  File,  Your  Own  (Java)   Lynx  Consultants  ©  2013  
  22. What are the features of Hadoop? ¡   Flume   § 

    Divides  in  Sources,  Channels,  Sinks   §  Can  have  multiple  of  everything,  makes  it  fault  tolerant   §  Many  sources!   §  Many  channels!   §  Many  sinks!   ▪  Avro,  HDFS,  Logger,  IRC,  File,  Hbase,  ElasticSearch,  S3,  Community   sinks,  Your  Own  (Java)   Lynx  Consultants  ©  2013  
  23. How Hadoop looks like in a DC ¡   Components  

    §  Primary  Namenode   §  Secondary  Namenode   §  Data  Node   Lynx  Consultants  ©  2013  
  24. How Hadoop looks like in a DC ¡   Components  

    §  Primary  Namenode   ▪  Controls  all  the  cluster,  knows  where  the  data  resides   ▪  Runs  the  job  tracker  to  keep  track  of  Map/Reduce  jobs   ▪  Biggest  point  of  failure,  shadowing  it  is  a  potential  option   §  Secondary  Namenode   §  Data  Node   Lynx  Consultants  ©  2013  
  25. How Hadoop looks like in a DC ¡   Components  

    §  Primary  Namenode   §  Secondary  Namenode   ▪  Performs  secondary  cleanup  options   §  Data  Node   Lynx  Consultants  ©  2013  
  26. How Hadoop looks like in a DC ¡   Components  

    §  Primary  Namenode   §  Secondary  Namenode   §  Data  Node   ▪  Stores  all  the  information   ▪  Runs  Map/Reduce   Lynx  Consultants  ©  2013  
  27. How Hadoop looks like in a DC ¡   Components  

    Lynx  Consultants  ©  2013