Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Querying and scripting in Hadoop (by Kailashnath Kutti)

Querying and scripting in Hadoop (by Kailashnath Kutti)

Michael Isvy

June 09, 2014

More Decks by Michael Isvy

Other Decks in Technology


  1. Agenda   •  Evolving  data  processing  pa;erns   •  Apache

     Hadoop   •  Shortcomings  of  Hadoop   •  Apache  Hive   •  Apache  PiG   •  Other  uBliBes  of  interest     •  Trends   •  Q&A  
  2. Evolving  data  processing  pa;erns   •  Data  lake   • 

    Data  hub   •  Extended  datawarehouse  
  3. Hadoop,  the  common  element   •  Open-­‐source  Apache  project  out

     of  Yahoo!  in   2006   •  Distributed  fault-­‐tolerant  data  storage  and   batch  processing   •  Linear  scalability  on  commodity  hardware     Hadoop  can  do  real  Bme  too  
  4. What  has  change Agility   ANALYTICS   BUSINESS   APPS

      DATA   Agility                                          Profitability  
  5. Agility   •  Turn  around  Bme  to  generaBon  acBonable  

    insights  from  data   •  Data  processing  speed   •  Data  ingesBon  speed  
  6. Split   Map   Shuffle   Reduce   MR  sequence

      Hadoop  is  fun   I  love  Hadoop   Pig  is  more  fun   Hadoop,  1   Is,  1   Fun,  1   I,  1   Love,  1   Hadoop,  1   Pig,  1   Is,  1   More,  1   Fun,  1   Hadoop,  {1,1}   Is,  {1,1}   Fun,  {1,1}   I,  1   Love,  1   Pig,  1   More,  1   Hadoop,  2   Is,  2   Fun,  2   I,  1   Love,  1   Pig,  1   More,  1  
  7. Code for word count public class WordMapper extends Mapper<LongWritable, Text,

    Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
  8. Apache  PiG  make  it  easy   1. High  level  open  source

     data  flow  language            on  Hadoop,  started  by  Yahoo.   1. Pig  LaBn,  a    simple  language  for  data  manipulaBon,  that   is  compiled  into  Map  Reduce  jobs   2. Simplifies  joining  data  and  chaining  jobs  together   3. Faster  development  cycle   words  =  LOAD  '/data/input/Novel.csv  ‘  USING  PigStorage(‘\t’)  AS   (word:chararray,  count:int);   sorted_words  =  ORDER  words  BY  count  DESC;   first_words  =  LIMIT  sorted_words  10;   DUMP  first_words;  
  9. What happens  when  a  pig  script  runs   LOAD  

  10. Pig  make  it  easy  –  Process  web  server  log  file

        Step  1  –    ‘Load’  a  terabyte  sized  files     Pig   LaFn   weblog  =  LOAD  ’/data/TerabyteWebLog.csv'   USING  PigStorage(‘\t’)                          AS  (hostname:chararray,   date:chararray,  url:chararray);     SQL   select  *  from  TerabyteWebLog;     MR   Code   Too  big  to  fit  in  this  space!  
  11. Pig  make  it  easy  –  Process  web  server  log  file

        Step  2  –  Find  number  of  users  on  June  05  2014   Pig   LaFn   usersFrom10T212  =  FILTER  weblog    BY   DATE_EXTRACT_DD(date)  =  ’05/JUN/2014’;     SQL   select  *    from  users     where  to_date(date,’dd/mon/yyyy)=’05/JUN/ 2014’;     MR   Code   Too  big  to  fit  in  this  space!  
  12. Pig  make  it  easy  –  Process  web  server  log  file

        Step  3  –  Store  results  into  a  file   Pig   LaFn   STORE  usersFrom10T212  INTO  ‘/data/ output/usersFrom10T212’     “SQL”   $hive  -­‐e  "select  *  from  table  where  id  >  10"  >   ~/sample_output.txt     MR   Code   Too  big  to  fit  in  this  space!  
  13. Pig  is  widely  used  for   •  Rapid  Prototyping  of

     Algorithms   •  Simple  Extract  Transform  and  Load  (ETL)  pipelines   •  CorrelaBon  between  Unstructured  with  Structured   datasets   •  Build  AnalyBcal  Models   •  Click  stream  data  processing  pipeline   •  Cleanse  data   •  Calculate  Common  aggregates   •  Load  data  into  Enterprise  Data  Warehouse
  14. Pig  also  …..   •  Causes  Out  Of  Memory  Error

     (Reducer)     •  SomeBmes  don’t  understand  difference   between  Null  and  “”   •  Nested  Foreach  and  scoping   •  Date  Management  UDFs  can  be  improved    
  15. Apache  Hive   •  A  data  warehouse  infrastructure  built  on

     top  of  Hadoop   for  providing  data  summarizaBon,  ad-­‐hoc  queries,  and   analysis.   •  Key  Building  Principles  -­‐  SQL  is  a  familiar  language,   Extensibility  –  Types,  FuncBons,  Formats,  Scripts,   Performance   create  external  table  wordcounts  (word  string,  count  int)  row   format  delimited  fields  terminated  by  '\t’  locaBon  ’/hdfs/ warehouse/HamletNovel.csv';     select  *  from  wordcounts  order  by  count  desc  limit10;     select  SUM(count)  from  wordcounts  where  word  like  ‘love%’;  
  16. What  happens  when  a  Hive  query  runs   create  external

     table  wordcounts  (word  string,   count  int)  row  format  delimited  fields  terminated  by   '\t’  locaBon  ’/hdfs/warehouse/HamletNovel.csv';     SELECT   GROUP  BY   JOIN,  UNION   Map1   Reduce  1   Reduce  2   ORDER  BY  
  17. Hive is  not  meant  for   ¡  An  OLTP  applicaBon

      ¡  Low  latency  Database  access   ¡  TransacBonal  database  (ACID)   ¡  Row  level  inserts,  updates  or  deletes   ¡  Out  of  Memory  Errors  in  Reducers    can  be  hard  to  fix   ¡  Not  many  opBons  for  debugging   ¡  Like  PiG  understanding  the  difference  between  Null    and  “”   And….  
  18. Hive  vs  RDBMS  -­‐  Differences   Feature   Hive  /

      HiveQL   RDBMS  /  SQL   Latency   Minutes   Sub-­‐seconds   TransacFons   Not  supported   Supported   Row  Level   Inserts   Bulk  data  can   be  appended   or  overwrijen   Supported  
  19. Hive  vs  RDBMS  -­‐  Differences   Feature   Hive  /

     HiveQL   RDBMS  /   SQL   Contraints  –   Primary  Key,   Foreign  Key  ..   Not  supported   Supported   Data  Types   Simple  -­‐  Integral,  float,   boolean   Complex  –  string,  array,   map,  struct   Date  type  not  supported   Integral,  float,   text  and  binary   strings,   temporal   Updates   INSERT  OVERWRITE   TABLE  (populates  whole   table  or  parFFon)   UPDATE,   INSERT,   DELETE  
  20. Cascading   •  Java  library  to  simplify  complex  map  reduce

     jobs   •  Can  address  some  of  the  limitaBons  of  PiG   •  Easy  to  create  data  pipelines,    
  21. Apache  storm   Distributed  real  Bme  computaBon  (Stream   processing)

     system   •  Real  Bme  analyBcs   •  online  machine  learning   •  conBnuous  computaBon   •  ETL  etc  
  22. SpringXD   Spring  XD  is  a  distributed  system  for  

    •  data  ingesBon   •  real  Bme  analyBcs   •  batch  processing   •  data  export  
  23. Other  commercial/OS  packages   •  Pivotal  HAWQ   •  IBM

     BigSQL   •  Apache  Drill   •  Impala   •  Apache  SBnger  project