Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Querying and scripting in Hadoop (by Kailashnath Kutti)

Querying and scripting in Hadoop (by Kailashnath Kutti)

Michael Isvy

June 09, 2014
Tweet

More Decks by Michael Isvy

Other Decks in Technology

Transcript

  1. Spring  User  Group  
    June  2014  

    View full-size slide

  2. Agenda  
    •  Evolving  data  processing  pa;erns  
    •  Apache  Hadoop  
    •  Shortcomings  of  Hadoop  
    •  Apache  Hive  
    •  Apache  PiG  
    •  Other  uBliBes  of  interest    
    •  Trends  
    •  Q&A  

    View full-size slide

  3. Evolving  data  processing  pa;erns  
    •  Data  lake  
    •  Data  hub  
    •  Extended  datawarehouse  

    View full-size slide

  4. High  level  architecture  building  blocks  
    BigData   FastData  
    Dataware
    house  
    ?  

    View full-size slide

  5. Hadoop,  the  common  element  
    •  Open-­‐source  Apache  project  out  of  Yahoo!  in  
    2006  
    •  Distributed  fault-­‐tolerant  data  storage  and  
    batch  processing  
    •  Linear  scalability  on  commodity  hardware  
     
    Hadoop  can  do  real  Bme  too  

    View full-size slide

  6. What  has  change
    Agility  
    ANALYTICS  
    BUSINESS  
    APPS  
    DATA  
    Agility                                          Profitability  

    View full-size slide

  7. Agility  
    •  Turn  around  Bme  to  generaBon  acBonable  
    insights  from  data  
    •  Data  processing  speed  
    •  Data  ingesBon  speed  

    View full-size slide

  8. Split   Map   Shuffle   Reduce  
    MR  sequence  
    Hadoop  is  fun  
    I  love  Hadoop  
    Pig  is  more  fun  
    Hadoop,  1  
    Is,  1  
    Fun,  1  
    I,  1  
    Love,  1  
    Hadoop,  1  
    Pig,  1  
    Is,  1  
    More,  1  
    Fun,  1  
    Hadoop,  {1,1}  
    Is,  {1,1}  
    Fun,  {1,1}  
    I,  1  
    Love,  1  
    Pig,  1  
    More,  1  
    Hadoop,  2  
    Is,  2  
    Fun,  2  
    I,  1  
    Love,  1  
    Pig,  1  
    More,  1  

    View full-size slide

  9. Code for word count
    public class WordMapper
    extends Mapper {
    private final static IntWritable ONE = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, Context context) {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    context.write(word, ONE);
    }
    }
    }
    public class IntSumReducer
    extends Reducer {
    public void reduce(Text key, Iterable values,
    Context context) {
    int sum = 0;
    for (IntWritable val : values) {
    sum += val.get();
    }
    context.write(key, new IntWritable(sum));
    }
    }
    Configuration conf = new Configuration();
    Job job = new Job(conf, "wordcount");
    Job.setJarByClass(WordCountMapper.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.waitForCompletion(true);

    View full-size slide

  10. MapReduce  is  not  so  agile!  

    View full-size slide

  11. Apache  PiG  make  it  easy  
    1. High  level  open  source  data  flow  language  
             on  Hadoop,  started  by  Yahoo.  
    1. Pig  LaBn,  a    simple  language  for  data  manipulaBon,  that  
    is  compiled  into  Map  Reduce  jobs  
    2. Simplifies  joining  data  and  chaining  jobs  together  
    3. Faster  development  cycle  
    words  =  LOAD  '/data/input/Novel.csv  ‘  USING  PigStorage(‘\t’)  AS  
    (word:chararray,  count:int);  
    sorted_words  =  ORDER  words  BY  count  DESC;  
    first_words  =  LIMIT  sorted_words  10;  
    DUMP  first_words;  

    View full-size slide

  12. What happens  when  a  pig  script  runs  
    LOAD  
    FILTER  
    LOAD  
    JOIN  
    GROUP  
    FOREACH  
    STORE  
    Map
    Reduce
    Map
    Reduce
    FILTER  
    LOCAL  REARRANGE  
    PACKAGE  
    FOREACH  
    LOCAL  REARRANGE  
    PACKAGE  
    FOREACH  

    View full-size slide

  13. Pig  make  it  easy  –  Process  web  server  log  file    
    Step  1  –    ‘Load’  a  terabyte  sized  files    
    Pig  
    LaFn  
    weblog  =  LOAD  ’/data/TerabyteWebLog.csv'  
    USING  PigStorage(‘\t’)  
                           AS  (hostname:chararray,  
    date:chararray,  url:chararray);  
     
    SQL   select  *  from  TerabyteWebLog;  
     
    MR  
    Code  
    Too  big  to  fit  in  this  space!  

    View full-size slide

  14. Pig  make  it  easy  –  Process  web  server  log  file    
    Step  2  –  Find  number  of  users  on  June  05  2014  
    Pig  
    LaFn  
    usersFrom10T212  =  FILTER  weblog    BY  
    DATE_EXTRACT_DD(date)  =  ’05/JUN/2014’;  
     
    SQL   select  *    from  users    
    where  to_date(date,’dd/mon/yyyy)=’05/JUN/
    2014’;  
     
    MR  
    Code  
    Too  big  to  fit  in  this  space!  

    View full-size slide

  15. Pig  make  it  easy  –  Process  web  server  log  file    
    Step  3  –  Store  results  into  a  file  
    Pig  
    LaFn  
    STORE  usersFrom10T212  INTO  ‘/data/
    output/usersFrom10T212’  
     
    “SQL”   $hive  -­‐e  "select  *  from  table  where  id  >  10"  >  
    ~/sample_output.txt  
     
    MR  
    Code  
    Too  big  to  fit  in  this  space!  

    View full-size slide

  16. Pig  is  widely  used  for  
    •  Rapid  Prototyping  of  Algorithms  
    •  Simple  Extract  Transform  and  Load  (ETL)  pipelines  
    •  CorrelaBon  between  Unstructured  with  Structured  
    datasets  
    •  Build  AnalyBcal  Models  
    •  Click  stream  data  processing  pipeline  
    •  Cleanse  data  
    •  Calculate  Common  aggregates  
    •  Load  data  into  Enterprise  Data  Warehouse

    View full-size slide

  17. Pig  also  …..  
    •  Causes  Out  Of  Memory  Error  (Reducer)    
    •  SomeBmes  don’t  understand  difference  
    between  Null  and  “”  
    •  Nested  Foreach  and  scoping  
    •  Date  Management  UDFs  can  be  improved  
     

    View full-size slide

  18. Apache  Hive  
    •  A  data  warehouse  infrastructure  built  on  top  of  Hadoop  
    for  providing  data  summarizaBon,  ad-­‐hoc  queries,  and  
    analysis.  
    •  Key  Building  Principles  -­‐  SQL  is  a  familiar  language,  
    Extensibility  –  Types,  FuncBons,  Formats,  Scripts,  
    Performance  
    create  external  table  wordcounts  (word  string,  count  int)  row  
    format  delimited  fields  terminated  by  '\t’  locaBon  ’/hdfs/
    warehouse/HamletNovel.csv';  
     
    select  *  from  wordcounts  order  by  count  desc  limit10;  
     
    select  SUM(count)  from  wordcounts  where  word  like  ‘love%’;  

    View full-size slide

  19. What  happens  when  a  Hive  query  runs  
    create  external  table  wordcounts  (word  string,  
    count  int)  row  format  delimited  fields  terminated  by  
    '\t’  locaBon  ’/hdfs/warehouse/HamletNovel.csv';  
     
    SELECT   GROUP  BY  
    JOIN,  UNION  
    Map1   Reduce  1  
    Reduce  2  
    ORDER  BY  

    View full-size slide

  20. Hive is  not  meant  for  
    ¡  An  OLTP  applicaBon  
    ¡  Low  latency  Database  access  
    ¡  TransacBonal  database  (ACID)  
    ¡  Row  level  inserts,  updates  or  deletes  
    ¡  Out  of  Memory  Errors  in  Reducers    can  be  hard  to  fix  
    ¡  Not  many  opBons  for  debugging  
    ¡  Like  PiG  understanding  the  difference  between  Null    and  “”  
    And….  

    View full-size slide

  21. Hive  vs  RDBMS  -­‐  Differences  
    Feature   Hive  /  
    HiveQL  
    RDBMS  /  SQL  
    Latency   Minutes   Sub-­‐seconds  
    TransacFons   Not  supported   Supported  
    Row  Level  
    Inserts  
    Bulk  data  can  
    be  appended  
    or  overwrijen  
    Supported  

    View full-size slide

  22. Hive  vs  RDBMS  -­‐  Differences  
    Feature   Hive  /  HiveQL   RDBMS  /  
    SQL  
    Contraints  –  
    Primary  Key,  
    Foreign  Key  ..  
    Not  supported   Supported  
    Data  Types   Simple  -­‐  Integral,  float,  
    boolean  
    Complex  –  string,  array,  
    map,  struct  
    Date  type  not  supported  
    Integral,  float,  
    text  and  binary  
    strings,  
    temporal  
    Updates   INSERT  OVERWRITE  
    TABLE  (populates  whole  
    table  or  parFFon)  
    UPDATE,  
    INSERT,  
    DELETE  

    View full-size slide

  23. Cascading  
    •  Java  library  to  simplify  complex  map  reduce  jobs  
    •  Can  address  some  of  the  limitaBons  of  PiG  
    •  Easy  to  create  data  pipelines,    

    View full-size slide

  24. Apache  storm  
    Distributed  real  Bme  computaBon  (Stream  
    processing)  system  
    •  Real  Bme  analyBcs  
    •  online  machine  learning  
    •  conBnuous  computaBon  
    •  ETL  etc  

    View full-size slide

  25. SpringXD  
    Spring  XD  is  a  distributed  system  for  
    •  data  ingesBon  
    •  real  Bme  analyBcs  
    •  batch  processing  
    •  data  export  

    View full-size slide

  26. Other  commercial/OS  packages  
    •  Pivotal  HAWQ  
    •  IBM  BigSQL  
    •  Apache  Drill  
    •  Impala  
    •  Apache  SBnger  project  

    View full-size slide