of Yahoo! in 2006 • Distributed fault-‐tolerant data storage and batch processing • Linear scalability on commodity hardware Hadoop can do real Bme too
Hadoop is fun I love Hadoop Pig is more fun Hadoop, 1 Is, 1 Fun, 1 I, 1 Love, 1 Hadoop, 1 Pig, 1 Is, 1 More, 1 Fun, 1 Hadoop, {1,1} Is, {1,1} Fun, {1,1} I, 1 Love, 1 Pig, 1 More, 1 Hadoop, 2 Is, 2 Fun, 2 I, 1 Love, 1 Pig, 1 More, 1
Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
data flow language on Hadoop, started by Yahoo. 1. Pig LaBn, a simple language for data manipulaBon, that is compiled into Map Reduce jobs 2. Simplifies joining data and chaining jobs together 3. Faster development cycle words = LOAD '/data/input/Novel.csv ‘ USING PigStorage(‘\t’) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words;
Step 1 – ‘Load’ a terabyte sized files Pig LaFn weblog = LOAD ’/data/TerabyteWebLog.csv' USING PigStorage(‘\t’) AS (hostname:chararray, date:chararray, url:chararray); SQL select * from TerabyteWebLog; MR Code Too big to fit in this space!
Step 2 – Find number of users on June 05 2014 Pig LaFn usersFrom10T212 = FILTER weblog BY DATE_EXTRACT_DD(date) = ’05/JUN/2014’; SQL select * from users where to_date(date,’dd/mon/yyyy)=’05/JUN/ 2014’; MR Code Too big to fit in this space!
Step 3 – Store results into a file Pig LaFn STORE usersFrom10T212 INTO ‘/data/ output/usersFrom10T212’ “SQL” $hive -‐e "select * from table where id > 10" > ~/sample_output.txt MR Code Too big to fit in this space!
Algorithms • Simple Extract Transform and Load (ETL) pipelines • CorrelaBon between Unstructured with Structured datasets • Build AnalyBcal Models • Click stream data processing pipeline • Cleanse data • Calculate Common aggregates • Load data into Enterprise Data Warehouse
top of Hadoop for providing data summarizaBon, ad-‐hoc queries, and analysis. • Key Building Principles -‐ SQL is a familiar language, Extensibility – Types, FuncBons, Formats, Scripts, Performance create external table wordcounts (word string, count int) row format delimited fields terminated by '\t’ locaBon ’/hdfs/ warehouse/HamletNovel.csv'; select * from wordcounts order by count desc limit10; select SUM(count) from wordcounts where word like ‘love%’;
table wordcounts (word string, count int) row format delimited fields terminated by '\t’ locaBon ’/hdfs/warehouse/HamletNovel.csv'; SELECT GROUP BY JOIN, UNION Map1 Reduce 1 Reduce 2 ORDER BY
¡ Low latency Database access ¡ TransacBonal database (ACID) ¡ Row level inserts, updates or deletes ¡ Out of Memory Errors in Reducers can be hard to fix ¡ Not many opBons for debugging ¡ Like PiG understanding the difference between Null and “” And….
HiveQL RDBMS / SQL Latency Minutes Sub-‐seconds TransacFons Not supported Supported Row Level Inserts Bulk data can be appended or overwrijen Supported