Querying and scripting in Hadoop (by Kailashnath Kutti)

Spring User Group June 2014

Agenda •  Evolving data processing pa;erns •  Apache
Hadoop •  Shortcomings of Hadoop •  Apache Hive •  Apache PiG •  Other uBliBes of interest •  Trends •  Q&A

Evolving data processing pa;erns •  Data lake • 
Data hub •  Extended datawarehouse

High level architecture building blocks BigData FastData
Dataware house ?

Hadoop, the common element •  Open-‐source Apache project out
of Yahoo! in 2006 •  Distributed fault-‐tolerant data storage and batch processing •  Linear scalability on commodity hardware Hadoop can do real Bme too

What has change Agility ANALYTICS BUSINESS APPS
DATA Agility Proﬁtability

Agility •  Turn around Bme to generaBon acBonable
insights from data •  Data processing speed •  Data ingesBon speed

Split Map Shuﬄe Reduce MR sequence
Hadoop is fun I love Hadoop Pig is more fun Hadoop, 1 Is, 1 Fun, 1 I, 1 Love, 1 Hadoop, 1 Pig, 1 Is, 1 More, 1 Fun, 1 Hadoop, {1,1} Is, {1,1} Fun, {1,1} I, 1 Love, 1 Pig, 1 More, 1 Hadoop, 2 Is, 2 Fun, 2 I, 1 Love, 1 Pig, 1 More, 1

Code for word count public class WordMapper extends Mapper<LongWritable, Text,
Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

MapReduce is not so agile!

Apache PiG make it easy 1. High level open source
data flow language on Hadoop, started by Yahoo. 1. Pig LaBn, a simple language for data manipulaBon, that is compiled into Map Reduce jobs 2. Simplifies joining data and chaining jobs together 3. Faster development cycle words = LOAD '/data/input/Novel.csv ‘ USING PigStorage(‘\t’) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words;

What happens when a pig script runs LOAD
FILTER LOAD JOIN GROUP FOREACH STORE Map Reduce Map Reduce FILTER LOCAL REARRANGE PACKAGE FOREACH LOCAL REARRANGE PACKAGE FOREACH

Pig make it easy – Process web server log file
Step 1 – ‘Load’ a terabyte sized files Pig LaFn weblog = LOAD ’/data/TerabyteWebLog.csv' USING PigStorage(‘\t’) AS (hostname:chararray, date:chararray, url:chararray); SQL select * from TerabyteWebLog; MR Code Too big to fit in this space!

Step 2 – Find number of users on June 05 2014 Pig LaFn usersFrom10T212 = FILTER weblog BY DATE_EXTRACT_DD(date) = ’05/JUN/2014’; SQL select * from users where to_date(date,’dd/mon/yyyy)=’05/JUN/ 2014’; MR Code Too big to ﬁt in this space!

Step 3 – Store results into a ﬁle Pig LaFn STORE usersFrom10T212 INTO ‘/data/ output/usersFrom10T212’ “SQL” $hive -‐e "select * from table where id > 10" > ~/sample_output.txt MR Code Too big to ﬁt in this space!

Pig is widely used for •  Rapid Prototyping of
Algorithms •  Simple Extract Transform and Load (ETL) pipelines •  CorrelaBon between Unstructured with Structured datasets •  Build AnalyBcal Models •  Click stream data processing pipeline •  Cleanse data •  Calculate Common aggregates •  Load data into Enterprise Data Warehouse

Pig also ….. •  Causes Out Of Memory Error
(Reducer) •  SomeBmes don’t understand diﬀerence between Null and “” •  Nested Foreach and scoping •  Date Management UDFs can be improved

Apache Hive •  A data warehouse infrastructure built on
top of Hadoop for providing data summarizaBon, ad-‐hoc queries, and analysis. •  Key Building Principles -‐ SQL is a familiar language, Extensibility – Types, FuncBons, Formats, Scripts, Performance create external table wordcounts (word string, count int) row format delimited ﬁelds terminated by '\t’ locaBon ’/hdfs/ warehouse/HamletNovel.csv'; select * from wordcounts order by count desc limit10; select SUM(count) from wordcounts where word like ‘love%’;

What happens when a Hive query runs create external
table wordcounts (word string, count int) row format delimited ﬁelds terminated by '\t’ locaBon ’/hdfs/warehouse/HamletNovel.csv'; SELECT GROUP BY JOIN, UNION Map1 Reduce 1 Reduce 2 ORDER BY

Hive is not meant for ¡  An OLTP applicaBon
¡  Low latency Database access ¡  TransacBonal database (ACID) ¡  Row level inserts, updates or deletes ¡  Out of Memory Errors in Reducers can be hard to ﬁx ¡  Not many opBons for debugging ¡  Like PiG understanding the diﬀerence between Null and “” And….

Hive vs RDBMS -‐ Diﬀerences Feature Hive /
HiveQL RDBMS / SQL Latency Minutes Sub-‐seconds TransacFons Not supported Supported Row Level Inserts Bulk data can be appended or overwrijen Supported

Hive vs RDBMS -‐ Differences Feature Hive /
HiveQL RDBMS / SQL Contraints – Primary Key, Foreign Key .. Not supported Supported Data Types Simple -‐ Integral, float, boolean Complex – string, array, map, struct Date type not supported Integral, float, text and binary strings, temporal Updates INSERT OVERWRITE TABLE (populates whole table or parFFon) UPDATE, INSERT, DELETE

Cascading •  Java library to simplify complex map reduce
jobs •  Can address some of the limitaBons of PiG •  Easy to create data pipelines,

Apache storm Distributed real Bme computaBon (Stream processing)
system •  Real Bme analyBcs •  online machine learning •  conBnuous computaBon •  ETL etc

SpringXD Spring XD is a distributed system for
•  data ingesBon •  real Bme analyBcs •  batch processing •  data export

Other commercial/OS packages •  Pivotal HAWQ •  IBM
BigSQL •  Apache Drill •  Impala •  Apache SBnger project

Querying and scripting in Hadoop (by Kailashnat...

Querying and scripting in Hadoop (by Kailashnath Kutti)

Michael Isvy

More Decks by Michael Isvy

Other Decks in Technology

Featured

Transcript

Spring User Group June 2014

Agenda •  Evolving data processing pa;erns •  Apache

Evolving data processing pa;erns •  Data lake •

High level architecture building blocks BigData FastData

Hadoop, the common element •  Open-‐source Apache project out

What has change Agility ANALYTICS BUSINESS APPS

Agility •  Turn around Bme to generaBon acBonable

Split Map Shuﬄe Reduce MR sequence

Code for word count public class WordMapper extends Mapper<LongWritable, Text,

MapReduce is not so agile!

Apache PiG make it easy 1. High level open source

What happens when a pig script runs LOAD

Pig make it easy – Process web server log ﬁle

Pig make it easy – Process web server log ﬁle

Pig make it easy – Process web server log ﬁle

Pig is widely used for •  Rapid Prototyping of

Pig also ….. •  Causes Out Of Memory Error

Apache Hive •  A data warehouse infrastructure built on

What happens when a Hive query runs create external

Hive is not meant for ¡  An OLTP applicaBon

Hive vs RDBMS -‐ Diﬀerences Feature Hive /

Hive vs RDBMS -‐ Diﬀerences Feature Hive /

Cascading •  Java library to simplify complex map reduce

Apache storm Distributed real Bme computaBon (Stream processing)

SpringXD Spring XD is a distributed system for

Other commercial/OS packages •  Pivotal HAWQ •  IBM

Q&A