Querying and scripting in Hadoop (by Kailashnath Kutti)

June 09, 2014

  1. Agenda   •  Evolving  data  processing  pa;erns   •  Apache

     Hadoop   •  Shortcomings  of  Hadoop   •  Apache  Hive   •  Apache  PiG   •  Other  uBliBes  of  interest     •  Trends   •  Q&A  
  2. Evolving  data  processing  pa;erns   •  Data  lake   • 

    Data  hub   •  Extended  datawarehouse  
  3. Hadoop,  the  common  element   •  Open-­‐source  Apache  project  out

     of  Yahoo!  in   2006   •  Distributed  fault-­‐tolerant  data  storage  and   batch  processing   •  Linear  scalability  on  commodity  hardware     Hadoop  can  do  real  Bme  too  
  4. What  has  change Agility   ANALYTICS   BUSINESS   APPS

      DATA   Agility                                          Profitability  
  5. Agility   •  Turn  around  Bme  to  generaBon  acBonable  

    insights  from  data   •  Data  processing  speed   •  Data  ingesBon  speed  
  6. Split   Map   Shuffle   Reduce   MR  sequence

      Hadoop  is  fun   I  love  Hadoop   Pig  is  more  fun   Hadoop,  1   Is,  1   Fun,  1   I,  1   Love,  1   Hadoop,  1   Pig,  1   Is,  1   More,  1   Fun,  1   Hadoop,  {1,1}   Is,  {1,1}   Fun,  {1,1}   I,  1   Love,  1   Pig,  1   More,  1   Hadoop,  2   Is,  2   Fun,  2   I,  1   Love,  1   Pig,  1   More,  1  
  7. Code for word count public class WordMapper extends Mapper<LongWritable, Text,

    Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
  8. Apache  PiG  make  it  easy   1. High  level  open  source

     data  flow  language            on  Hadoop,  started  by  Yahoo.   1. Pig  LaBn,  a    simple  language  for  data  manipulaBon,  that   is  compiled  into  Map  Reduce  jobs   2. Simplifies  joining  data  and  chaining  jobs  together   3. Faster  development  cycle   words  =  LOAD  '/data/input/Novel.csv  ‘  USING  PigStorage(‘\t’)  AS   (word:chararray,  count:int);   sorted_words  =  ORDER  words  BY  count  DESC;   first_words  =  LIMIT  sorted_words  10;   DUMP  first_words;  
  9. What happens  when  a  pig  script  runs   LOAD  

  10. Pig  make  it  easy  –  Process  web  server  log  file

        Step  1  –    ‘Load’  a  terabyte  sized  files     Pig   LaFn   weblog  =  LOAD  ’/data/TerabyteWebLog.csv'   USING  PigStorage(‘\t’)                          AS  (hostname:chararray,   date:chararray,  url:chararray);     SQL   select  *  from  TerabyteWebLog;     MR   Code   Too  big  to  fit  in  this  space!  
  11. Pig  make  it  easy  –  Process  web  server  log  file

        Step  2  –  Find  number  of  users  on  June  05  2014   Pig   LaFn   usersFrom10T212  =  FILTER  weblog    BY   DATE_EXTRACT_DD(date)  =  ’05/JUN/2014’;     SQL   select  *    from  users     where  to_date(date,’dd/mon/yyyy)=’05/JUN/ 2014’;     MR   Code   Too  big  to  fit  in  this  space!  
  12. Pig  make  it  easy  –  Process  web  server  log  file

        Step  3  –  Store  results  into  a  file   Pig   LaFn   STORE  usersFrom10T212  INTO  ‘/data/ output/usersFrom10T212’     “SQL”   $hive  -­‐e  "select  *  from  table  where  id  >  10"  >   ~/sample_output.txt     MR   Code   Too  big  to  fit  in  this  space!  
  13. Pig  is  widely  used  for   •  Rapid  Prototyping  of

     Algorithms   •  Simple  Extract  Transform  and  Load  (ETL)  pipelines   •  CorrelaBon  between  Unstructured  with  Structured   datasets   •  Build  AnalyBcal  Models   •  Click  stream  data  processing  pipeline   •  Cleanse  data   •  Calculate  Common  aggregates   •  Load  data  into  Enterprise  Data  Warehouse
  14. Pig  also  …..   •  Causes  Out  Of  Memory  Error

     (Reducer)     •  SomeBmes  don’t  understand  difference   between  Null  and  “”   •  Nested  Foreach  and  scoping   •  Date  Management  UDFs  can  be  improved    
  15. Apache  Hive   •  A  data  warehouse  infrastructure  built  on

     top  of  Hadoop   for  providing  data  summarizaBon,  ad-­‐hoc  queries,  and   analysis.   •  Key  Building  Principles  -­‐  SQL  is  a  familiar  language,   Extensibility  –  Types,  FuncBons,  Formats,  Scripts,   Performance   create  external  table  wordcounts  (word  string,  count  int)  row   format  delimited  fields  terminated  by  '\t’  locaBon  ’/hdfs/ warehouse/HamletNovel.csv';     select  *  from  wordcounts  order  by  count  desc  limit10;     select  SUM(count)  from  wordcounts  where  word  like  ‘love%’;  
  16. What  happens  when  a  Hive  query  runs   create  external

     table  wordcounts  (word  string,   count  int)  row  format  delimited  fields  terminated  by   '\t’  locaBon  ’/hdfs/warehouse/HamletNovel.csv';     SELECT   GROUP  BY   JOIN,  UNION   Map1   Reduce  1   Reduce  2   ORDER  BY  
  17. Hive is  not  meant  for   ¡  An  OLTP  applicaBon

      ¡  Low  latency  Database  access   ¡  TransacBonal  database  (ACID)   ¡  Row  level  inserts,  updates  or  deletes   ¡  Out  of  Memory  Errors  in  Reducers    can  be  hard  to  fix   ¡  Not  many  opBons  for  debugging   ¡  Like  PiG  understanding  the  difference  between  Null    and  “”   And….  
  18. Hive  vs  RDBMS  -­‐  Differences   Feature   Hive  /

      HiveQL   RDBMS  /  SQL   Latency   Minutes   Sub-­‐seconds   TransacFons   Not  supported   Supported   Row  Level   Inserts   Bulk  data  can   be  appended   or  overwrijen   Supported  
  19. Hive  vs  RDBMS  -­‐  Differences   Feature   Hive  /

     HiveQL   RDBMS  /   SQL   Contraints  –   Primary  Key,   Foreign  Key  ..   Not  supported   Supported   Data  Types   Simple  -­‐  Integral,  float,   boolean   Complex  –  string,  array,   map,  struct   Date  type  not  supported   Integral,  float,   text  and  binary   strings,   temporal   Updates   INSERT  OVERWRITE   TABLE  (populates  whole   table  or  parFFon)   UPDATE,   INSERT,   DELETE  
  20. Cascading   •  Java  library  to  simplify  complex  map  reduce

     jobs   •  Can  address  some  of  the  limitaBons  of  PiG   •  Easy  to  create  data  pipelines,    
  21. Apache  storm   Distributed  real  Bme  computaBon  (Stream   processing)

     system   •  Real  Bme  analyBcs   •  online  machine  learning   •  conBnuous  computaBon   •  ETL  etc  
  22. SpringXD   Spring  XD  is  a  distributed  system  for  

    •  data  ingesBon   •  real  Bme  analyBcs   •  batch  processing   •  data  export  
  23. Other  commercial/OS  packages   •  Pivotal  HAWQ   •  IBM

     BigSQL   •  Apache  Drill   •  Impala   •  Apache  SBnger  project