Apache™ Pig
An Introduction
Moty Michaely
June, 2015
Slide 2
Slide 2 text
Outline
What is Apache™ Pig
Motivation
Hands On
Use Cases
Questions
Slide 3
Slide 3 text
Outline
What is Apache™ Pig
Motivation
Hands On
Use Cases
Questions
Slide 4
Slide 4 text
Outline
What is Apache™ Pig
Motivation
Hands On
Use Cases
Questions
Slide 5
Slide 5 text
Outline
What is Apache™ Pig
Motivation
Hands On
Use Cases
Questions
Slide 6
Slide 6 text
Outline
What is Apache™ Pig
Motivation
Hands On
Use Cases
Questions
Slide 7
Slide 7 text
What Is
Apache™ Pig
Slide 8
Slide 8 text
“High-level platform for
creating MapReduce programs
used with Hadoop.”
(Wikipedia)
Slide 9
Slide 9 text
Apache™ Pig is
High Level Scripting Language
Pig Latin
Hadoop MapReduce/Tez Compiler
Commonly Used
Open Source
Slide 10
Slide 10 text
Apache™ Pig is
High Level Scripting Language
Hadoop MapReduce/Tez Compiler
Supports MapReduce and Tez
Commonly Used
Open Source
Slide 11
Slide 11 text
Apache™ Pig is
High Level Scripting Language
Hadoop MapReduce/Tez Compiler
Commonly Used
Netflix, Xplenty, eBay, Yahoo, Wix...
Open Source
Slide 12
Slide 12 text
Apache™ Pig is
High Level Scripting Language
Hadoop MapReduce/Tez Compiler
Commonly Used
Open Source
Backed by the community
Slide 13
Slide 13 text
Why Pig (the name)?
Pigs Eat Anything
Relational, Nested, Unstructured
Files, Key/Value stores, Databases
Pigs Live Anywhere
Pigs Are Domestic Animals
Pigs Fly
Slide 14
Slide 14 text
Why Pig (the name)?
Pigs Eat Anything
Pigs Live Anywhere
Not tied to particular framework
Pigs Are Domestic Animals
Pigs Fly
Slide 15
Slide 15 text
Why Pig (the name)?
Pigs Eat Anything
Pigs Live Anywhere
Pigs Are Domestic Animals
Easily controlled
Integration of user code
Pigs Fly
Slide 16
Slide 16 text
Why Pig (the name)?
Pigs Eat Anything
Pigs Live Anywhere
Pigs Are Domestic Animals
Pigs Fly
Faster development
Improved performance
Slide 17
Slide 17 text
Apache™ Pig architecture
Slide 18
Slide 18 text
APACHE PIG
- Pig Latin
scripting
language
- MR/Tez
Compiler
=
RECAP
Slide 19
Slide 19 text
Motivation for
Apache™ Pig
Slide 20
Slide 20 text
Motivation for Pig
Increase productivity
10 lines of Pig Latin ≈ 200 lines of Java
4 hours of Java ≈ 15 minutes of Pig Latin
Open to non-java developers
Optimization opportunities
Extensibility
Slide 21
Slide 21 text
Motivation for Pig
Increase productivity
Open to non-java developers
It’s like SQL
Optimization opportunities
Extensibility
Slide 22
Slide 22 text
Motivation for Pig
Increase productivity
Open to non-java developers
Optimization opportunities
No need to tune Hadoop for your needs
Execution plan, optimizer
Extensibility
Slide 23
Slide 23 text
Motivation for Pig
Increase productivity
Open to non-java developers
Optimization opportunities
Extensibility
User defined functions
Integration with Python, Ruby and JS
Pig Latin (For Apache™ Pig)
“A high-level language that
allows you to write data
processing and analysis
programs.”
Slide 27
Slide 27 text
Pig Latin For Apache™ Pig
Relations
- A relation (table) is a bag
- A bag is a collection of tuples
- A tuple (row) is an ordered set of fields
- A field is a piece of data
Slide 28
Slide 28 text
Running Pig - Two execution modes
Interactive Mode
$ cd /path/to/pig/bin/
$ pig
grunt> a = LOAD ‘/path/to/file’;
grunt> DUMP a;
Batch Mode
Slide 29
Slide 29 text
Running Pig
Pig supports two execution modes
Interactive Mode
Batch Mode
$ cd /path/to/pig/bin/
$ pig -f /path/to/pig/file.pig
Slide 30
Slide 30 text
Word Count Problem
“Given text files, return how
often words occur”
Slide 31
Slide 31 text
Word Count in MR
Slide 32
Slide 32 text
Word Count in MR (Mapper)
Slide 33
Slide 33 text
Word Count in MR (Reducer)
Slide 34
Slide 34 text
Word Count in MR
Slide 35
Slide 35 text
Word Count in Pig
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
STORE counts INTO 'wordcount';
Slide 36
Slide 36 text
Word Count in Pig
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
STORE counts INTO 'wordcount';
Slide 37
Slide 37 text
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
STORE counts INTO 'wordcount';
Word Count in Pig
Slide 38
Slide 38 text
Word Count in Pig
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
STORE counts INTO 'wordcount';
Slide 39
Slide 39 text
Word Count in Pig
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
STORE counts INTO 'wordcount';
Slide 40
Slide 40 text
Word Count in Pig
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
STORE counts INTO 'wordcount';
Slide 41
Slide 41 text
Word Count in Pig
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
STORE counts INTO 'wordcount';
Slide 42
Slide 42 text
Word Count in Pig - Sorted
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
sorted_counts = ORDER counts BY count DESC, word ASC;
STORE counts INTO 'wordcount';
Slide 43
Slide 43 text
Word Count in Pig - Sorted
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(words) AS count;
sorted_counts = ORDER counts BY count DESC, word ASC;
STORE sorted_counts INTO 'wordcount_sorted';
Slide 44
Slide 44 text
Word Count - MR vs. Pig
63 Lines of code 5 lines of code
-- Word Count Script (wordcount.pig)
text = LOAD 'word_count_text.txt';
words = FOREACH text GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS word;
grouped_words = GROUP words BY word;
counts = FOREACH grouped_words GENERATE
group AS word, COUNT(grouped_words) AS count;
STORE counts INTO 'wordcount';
Slide 45
Slide 45 text
HANDS ON
- Word Count
- Execution
Plan
- Optimization
=
RECAP
Slide 46
Slide 46 text
Use Cases Of
Apache™ Pig
Slide 47
Slide 47 text
“80% of the work in any data
project is in cleaning the data.”
(D.J Patel, Data Jujitsu)
Slide 48
Slide 48 text
Pig is great for
Web log processing
Data processing for web search platforms
Ad hoc queries across large data sets
Rapid prototyping of algorithms for
processing large data sets
Slide 49
Slide 49 text
Pig is great for
Web log processing
Data processing for web search platforms
Ad hoc queries across large data sets
Rapid prototyping of algorithms for
processing large data sets
Slide 50
Slide 50 text
Pig is great for
Web log processing
Data processing for web search platforms
Ad hoc queries across large data sets
Rapid prototyping of algorithms for
processing large data sets
Slide 51
Slide 51 text
Pig is great for
Web log processing
Data processing for web search platforms
Ad hoc queries across large data sets
Rapid prototyping of algorithms for
processing large data sets
Slide 52
Slide 52 text
Questions
Slide 53
Slide 53 text
Resources
Apache Pig Philosophy
Apache Pig 0.14 Documentation
Word Count Example
Introduction to Apache Tez
Pig (Language)
Pig for dummies
Pig Latin (Language Game)
Xplenty
Data Jujitsu: The art of turning data into product
Pig Cheat Sheet