Querying and scripting in Hadoop (by Kailashnath Kutti)

Slide 1

Slide 1 text

Spring User Group June 2014

Slide 2

Slide 2 text

Agenda •  Evolving data processing pa;erns •  Apache Hadoop •  Shortcomings of Hadoop •  Apache Hive •  Apache PiG •  Other uBliBes of interest •  Trends •  Q&A

Slide 3

Slide 3 text

Evolving data processing pa;erns •  Data lake •  Data hub •  Extended datawarehouse

Slide 4

Slide 4 text

High level architecture building blocks BigData FastData Dataware house ?

Slide 5

Slide 5 text

Hadoop, the common element •  Open-‐source Apache project out of Yahoo! in 2006 •  Distributed fault-‐tolerant data storage and batch processing •  Linear scalability on commodity hardware Hadoop can do real Bme too

Slide 6

Slide 6 text

What has change Agility ANALYTICS BUSINESS APPS DATA Agility Proﬁtability

Slide 7

Slide 7 text

Agility •  Turn around Bme to generaBon acBonable insights from data •  Data processing speed •  Data ingesBon speed

Slide 8

Slide 8 text

Split Map Shuﬄe Reduce MR sequence Hadoop is fun I love Hadoop Pig is more fun Hadoop, 1 Is, 1 Fun, 1 I, 1 Love, 1 Hadoop, 1 Pig, 1 Is, 1 More, 1 Fun, 1 Hadoop, {1,1} Is, {1,1} Fun, {1,1} I, 1 Love, 1 Pig, 1 More, 1 Hadoop, 2 Is, 2 Fun, 2 I, 1 Love, 1 Pig, 1 More, 1

Slide 9

Slide 9 text

Code for word count public class WordMapper extends Mapper { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } public class IntSumReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Slide 10

Slide 10 text

MapReduce is not so agile!

Slide 11

Slide 11 text

Apache PiG make it easy 1. High level open source data flow language on Hadoop, started by Yahoo. 1. Pig LaBn, a simple language for data manipulaBon, that is compiled into Map Reduce jobs 2. Simplifies joining data and chaining jobs together 3. Faster development cycle words = LOAD '/data/input/Novel.csv ‘ USING PigStorage(‘\t’) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words;

Slide 12

Slide 12 text

What happens when a pig script runs LOAD FILTER LOAD JOIN GROUP FOREACH STORE Map Reduce Map Reduce FILTER LOCAL REARRANGE PACKAGE FOREACH LOCAL REARRANGE PACKAGE FOREACH

Slide 13

Slide 13 text

Pig make it easy – Process web server log file Step 1 – ‘Load’ a terabyte sized files Pig LaFn weblog = LOAD ’/data/TerabyteWebLog.csv' USING PigStorage(‘\t’) AS (hostname:chararray, date:chararray, url:chararray); SQL select * from TerabyteWebLog; MR Code Too big to fit in this space!

Slide 14

Slide 14 text

Pig make it easy – Process web server log ﬁle Step 2 – Find number of users on June 05 2014 Pig LaFn usersFrom10T212 = FILTER weblog BY DATE_EXTRACT_DD(date) = ’05/JUN/2014’; SQL select * from users where to_date(date,’dd/mon/yyyy)=’05/JUN/ 2014’; MR Code Too big to ﬁt in this space!

Slide 15

Slide 15 text

Pig make it easy – Process web server log file Step 3 – Store results into a file Pig LaFn STORE usersFrom10T212 INTO ‘/data/ output/usersFrom10T212’ “SQL” $hive -‐e "select * from table where id > 10" > ~/sample_output.txt MR Code Too big to fit in this space!

Slide 16

Slide 16 text

Pig is widely used for •  Rapid Prototyping of Algorithms •  Simple Extract Transform and Load (ETL) pipelines •  CorrelaBon between Unstructured with Structured datasets •  Build AnalyBcal Models •  Click stream data processing pipeline •  Cleanse data •  Calculate Common aggregates •  Load data into Enterprise Data Warehouse

Slide 17

Slide 17 text

Pig also ….. •  Causes Out Of Memory Error (Reducer) •  SomeBmes don’t understand diﬀerence between Null and “” •  Nested Foreach and scoping •  Date Management UDFs can be improved

Slide 18

Slide 18 text

Apache Hive •  A data warehouse infrastructure built on top of Hadoop for providing data summarizaBon, ad-‐hoc queries, and analysis. •  Key Building Principles -‐ SQL is a familiar language, Extensibility – Types, FuncBons, Formats, Scripts, Performance create external table wordcounts (word string, count int) row format delimited ﬁelds terminated by '\t’ locaBon ’/hdfs/ warehouse/HamletNovel.csv'; select * from wordcounts order by count desc limit10; select SUM(count) from wordcounts where word like ‘love%’;

Slide 19

Slide 19 text

What happens when a Hive query runs create external table wordcounts (word string, count int) row format delimited ﬁelds terminated by '\t’ locaBon ’/hdfs/warehouse/HamletNovel.csv'; SELECT GROUP BY JOIN, UNION Map1 Reduce 1 Reduce 2 ORDER BY

Slide 20

Slide 20 text

Hive is not meant for ¡  An OLTP applicaBon ¡  Low latency Database access ¡  TransacBonal database (ACID) ¡  Row level inserts, updates or deletes ¡  Out of Memory Errors in Reducers can be hard to ﬁx ¡  Not many opBons for debugging ¡  Like PiG understanding the diﬀerence between Null and “” And….

Slide 21

Slide 21 text

Hive vs RDBMS -‐ Diﬀerences Feature Hive / HiveQL RDBMS / SQL Latency Minutes Sub-‐seconds TransacFons Not supported Supported Row Level Inserts Bulk data can be appended or overwrijen Supported

Slide 22

Slide 22 text

Hive vs RDBMS -‐ Differences Feature Hive / HiveQL RDBMS / SQL Contraints – Primary Key, Foreign Key .. Not supported Supported Data Types Simple -‐ Integral, float, boolean Complex – string, array, map, struct Date type not supported Integral, float, text and binary strings, temporal Updates INSERT OVERWRITE TABLE (populates whole table or parFFon) UPDATE, INSERT, DELETE

Slide 23

Slide 23 text

Cascading •  Java library to simplify complex map reduce jobs •  Can address some of the limitaBons of PiG •  Easy to create data pipelines,

Slide 24

Slide 24 text

Apache storm Distributed real Bme computaBon (Stream processing) system •  Real Bme analyBcs •  online machine learning •  conBnuous computaBon •  ETL etc

Slide 25

Slide 25 text

SpringXD Spring XD is a distributed system for •  data ingesBon •  real Bme analyBcs •  batch processing •  data export

Slide 26

Slide 26 text

Other commercial/OS packages •  Pivotal HAWQ •  IBM BigSQL •  Apache Drill •  Impala •  Apache SBnger project

Slide 27

Slide 27 text

Q&A