Slide 1

Slide 1 text

Spring  User  Group   June  2014  

Slide 2

Slide 2 text

Agenda   •  Evolving  data  processing  pa;erns   •  Apache  Hadoop   •  Shortcomings  of  Hadoop   •  Apache  Hive   •  Apache  PiG   •  Other  uBliBes  of  interest     •  Trends   •  Q&A  

Slide 3

Slide 3 text

Evolving  data  processing  pa;erns   •  Data  lake   •  Data  hub   •  Extended  datawarehouse  

Slide 4

Slide 4 text

High  level  architecture  building  blocks   BigData   FastData   Dataware house   ?  

Slide 5

Slide 5 text

Hadoop,  the  common  element   •  Open-­‐source  Apache  project  out  of  Yahoo!  in   2006   •  Distributed  fault-­‐tolerant  data  storage  and   batch  processing   •  Linear  scalability  on  commodity  hardware     Hadoop  can  do  real  Bme  too  

Slide 6

Slide 6 text

What  has  change Agility   ANALYTICS   BUSINESS   APPS   DATA   Agility                                          Profitability  

Slide 7

Slide 7 text

Agility   •  Turn  around  Bme  to  generaBon  acBonable   insights  from  data   •  Data  processing  speed   •  Data  ingesBon  speed  

Slide 8

Slide 8 text

Split   Map   Shuffle   Reduce   MR  sequence   Hadoop  is  fun   I  love  Hadoop   Pig  is  more  fun   Hadoop,  1   Is,  1   Fun,  1   I,  1   Love,  1   Hadoop,  1   Pig,  1   Is,  1   More,  1   Fun,  1   Hadoop,  {1,1}   Is,  {1,1}   Fun,  {1,1}   I,  1   Love,  1   Pig,  1   More,  1   Hadoop,  2   Is,  2   Fun,  2   I,  1   Love,  1   Pig,  1   More,  1  

Slide 9

Slide 9 text

Code for word count public class WordMapper extends Mapper { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } public class IntSumReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Slide 10

Slide 10 text

MapReduce  is  not  so  agile!  

Slide 11

Slide 11 text

Apache  PiG  make  it  easy   1. High  level  open  source  data  flow  language            on  Hadoop,  started  by  Yahoo.   1. Pig  LaBn,  a    simple  language  for  data  manipulaBon,  that   is  compiled  into  Map  Reduce  jobs   2. Simplifies  joining  data  and  chaining  jobs  together   3. Faster  development  cycle   words  =  LOAD  '/data/input/Novel.csv  ‘  USING  PigStorage(‘\t’)  AS   (word:chararray,  count:int);   sorted_words  =  ORDER  words  BY  count  DESC;   first_words  =  LIMIT  sorted_words  10;   DUMP  first_words;  

Slide 12

Slide 12 text

What happens  when  a  pig  script  runs   LOAD   FILTER   LOAD   JOIN   GROUP   FOREACH   STORE   Map Reduce Map Reduce FILTER   LOCAL  REARRANGE   PACKAGE   FOREACH   LOCAL  REARRANGE   PACKAGE   FOREACH  

Slide 13

Slide 13 text

Pig  make  it  easy  –  Process  web  server  log  file     Step  1  –    ‘Load’  a  terabyte  sized  files     Pig   LaFn   weblog  =  LOAD  ’/data/TerabyteWebLog.csv'   USING  PigStorage(‘\t’)                          AS  (hostname:chararray,   date:chararray,  url:chararray);     SQL   select  *  from  TerabyteWebLog;     MR   Code   Too  big  to  fit  in  this  space!  

Slide 14

Slide 14 text

Pig  make  it  easy  –  Process  web  server  log  file     Step  2  –  Find  number  of  users  on  June  05  2014   Pig   LaFn   usersFrom10T212  =  FILTER  weblog    BY   DATE_EXTRACT_DD(date)  =  ’05/JUN/2014’;     SQL   select  *    from  users     where  to_date(date,’dd/mon/yyyy)=’05/JUN/ 2014’;     MR   Code   Too  big  to  fit  in  this  space!  

Slide 15

Slide 15 text

Pig  make  it  easy  –  Process  web  server  log  file     Step  3  –  Store  results  into  a  file   Pig   LaFn   STORE  usersFrom10T212  INTO  ‘/data/ output/usersFrom10T212’     “SQL”   $hive  -­‐e  "select  *  from  table  where  id  >  10"  >   ~/sample_output.txt     MR   Code   Too  big  to  fit  in  this  space!  

Slide 16

Slide 16 text

Pig  is  widely  used  for   •  Rapid  Prototyping  of  Algorithms   •  Simple  Extract  Transform  and  Load  (ETL)  pipelines   •  CorrelaBon  between  Unstructured  with  Structured   datasets   •  Build  AnalyBcal  Models   •  Click  stream  data  processing  pipeline   •  Cleanse  data   •  Calculate  Common  aggregates   •  Load  data  into  Enterprise  Data  Warehouse

Slide 17

Slide 17 text

Pig  also  …..   •  Causes  Out  Of  Memory  Error  (Reducer)     •  SomeBmes  don’t  understand  difference   between  Null  and  “”   •  Nested  Foreach  and  scoping   •  Date  Management  UDFs  can  be  improved    

Slide 18

Slide 18 text

Apache  Hive   •  A  data  warehouse  infrastructure  built  on  top  of  Hadoop   for  providing  data  summarizaBon,  ad-­‐hoc  queries,  and   analysis.   •  Key  Building  Principles  -­‐  SQL  is  a  familiar  language,   Extensibility  –  Types,  FuncBons,  Formats,  Scripts,   Performance   create  external  table  wordcounts  (word  string,  count  int)  row   format  delimited  fields  terminated  by  '\t’  locaBon  ’/hdfs/ warehouse/HamletNovel.csv';     select  *  from  wordcounts  order  by  count  desc  limit10;     select  SUM(count)  from  wordcounts  where  word  like  ‘love%’;  

Slide 19

Slide 19 text

What  happens  when  a  Hive  query  runs   create  external  table  wordcounts  (word  string,   count  int)  row  format  delimited  fields  terminated  by   '\t’  locaBon  ’/hdfs/warehouse/HamletNovel.csv';     SELECT   GROUP  BY   JOIN,  UNION   Map1   Reduce  1   Reduce  2   ORDER  BY  

Slide 20

Slide 20 text

Hive is  not  meant  for   ¡  An  OLTP  applicaBon   ¡  Low  latency  Database  access   ¡  TransacBonal  database  (ACID)   ¡  Row  level  inserts,  updates  or  deletes   ¡  Out  of  Memory  Errors  in  Reducers    can  be  hard  to  fix   ¡  Not  many  opBons  for  debugging   ¡  Like  PiG  understanding  the  difference  between  Null    and  “”   And….  

Slide 21

Slide 21 text

Hive  vs  RDBMS  -­‐  Differences   Feature   Hive  /   HiveQL   RDBMS  /  SQL   Latency   Minutes   Sub-­‐seconds   TransacFons   Not  supported   Supported   Row  Level   Inserts   Bulk  data  can   be  appended   or  overwrijen   Supported  

Slide 22

Slide 22 text

Hive  vs  RDBMS  -­‐  Differences   Feature   Hive  /  HiveQL   RDBMS  /   SQL   Contraints  –   Primary  Key,   Foreign  Key  ..   Not  supported   Supported   Data  Types   Simple  -­‐  Integral,  float,   boolean   Complex  –  string,  array,   map,  struct   Date  type  not  supported   Integral,  float,   text  and  binary   strings,   temporal   Updates   INSERT  OVERWRITE   TABLE  (populates  whole   table  or  parFFon)   UPDATE,   INSERT,   DELETE  

Slide 23

Slide 23 text

Cascading   •  Java  library  to  simplify  complex  map  reduce  jobs   •  Can  address  some  of  the  limitaBons  of  PiG   •  Easy  to  create  data  pipelines,    

Slide 24

Slide 24 text

Apache  storm   Distributed  real  Bme  computaBon  (Stream   processing)  system   •  Real  Bme  analyBcs   •  online  machine  learning   •  conBnuous  computaBon   •  ETL  etc  

Slide 25

Slide 25 text

SpringXD   Spring  XD  is  a  distributed  system  for   •  data  ingesBon   •  real  Bme  analyBcs   •  batch  processing   •  data  export  

Slide 26

Slide 26 text

Other  commercial/OS  packages   •  Pivotal  HAWQ   •  IBM  BigSQL   •  Apache  Drill   •  Impala   •  Apache  SBnger  project  

Slide 27

Slide 27 text

Q&A