Learn Apache Map Reduce

Topics to Discuss today Map Reduce Map Reduce use cases
Traditional Way Map Reduce Process Anatomy of a Map Reduce program Map Reduce Way Revisit de-identification architecture Why Map Reduce Session Input Splits Relation between Input Splits & HDFS blocks Map Reduce Flow Distributed Cache Overview of Map Reduce Combiner & Partitioner Input Formats

What is Map Reduce Provides a Parallel Programming model. Moves
Computation to where data is Handles scheduling, Fault tolerance Status Reporting and monitoring

Map Reduce Use Cases Weather Forecasting Health Care Oil &
Gas Industry Map reduce is targeted towards cases where we need to process large datasets with parallel, distributed algorithm on a cluster setup. Let’s say, you need to process huge amount of data. You can process individual records of the data set serially through one software program but if you want the processing faster. One of the things you could do is to break the large chunk of data into smaller chunks and process them in parallel.

Sample Weather Data ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/

Architecture Store Deidentified CSV file into HDFS Taking DB dump
in CSV format and ingest into HDFS Read CSV file from HDFS Deidentify columns based on configurations HDFS Matches Map Task 1 Map Task 2 . . Map Task 1 Map Task 2 . .

Mapper Class : For Running Mapper Logic Reducer Class :
For Running Reducer Logic Driver : For submitting the actual job consisting of Mapper and Reducer class Basic Things Required to write Mapreduce

Traditional Way grep cat Split Data Split Data Split Data
Split Data Matches Matches Matches Matches All Matches Very Big Data

Map Reduce Process The Overall MapReduce Word Count Process Dear
Bear River Bear, 2 Car, 3 Deer, 2 River, 2 Dear Bear River Car Car River Dear Car Bear Car Car River Dear Car Bear K1, V1 Bear, (1,1) Car, (1,1,1) Deer, (1,1) K2, List(V2) River, (1,1) Bear, 2 Car, 3 Deer, 2 River, 2 List(K3, V3) Dear, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Dear, 1 Car, 1 Bear, 1 List (K2, V2)

Map Reduce Example Word Count Example reads text files and
counts how often words occur. The input is text files and the output is text files. Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

Anatomy of a MapReduce Program Value Key Map Reduce Map
(K1, V1) List (K1, V1) Reduce (K2, List (V2)) List (K3, V3)

Result Very Big Data Partitioning Function MapReduce Way R E
D U C E M A P

Revisit De-identification Architecture Store Deidentified CSV file into HDFS Taking
DB dump in CSV format and ingest into HDFS Read CSV file from HDFS Deidentify columns based on configurations HDFS Matches Map Task 1 Map Task 2 . . Map Task 1 Map Task 2 . .

Why MapReduce ? Two biggest Advantages: Taking processing to the
data Processing data in parallel Map Reduce is a programming model for processing and generating large data sets. For processing parallel problems across huge datasets using a large number of nodes. Map-reduce jobs are automatically parallelized. Rack Rack Rack Data Center Map Task HDFS Block a b c Node Node Node

Input Splits HDFS Blocks Input Splits Physical Division Logical Division
Input Data

Relation between Input Splits and HDFS Blocks Logical records do
not fit neatly into the HDFS blocks. Logical records are lines that cross the boundary of the blocks. First split contains line 5 although it spans across blocks. 1 2 3 4 5 6 7 8 9 10 11 Block Boundary Block Boundary Block Boundary Block Boundary File Lines Split Split Split

MapReduce Flow A file is given as input to the
mapreduce program. Input data is distributed to nodes. Each node is send to the mapper Mapper emits the intermediate output in form of key value pair. Shuffling and sorting is done such that the output of the same keys is grouped together. The output is send to the reducer for processing Reducer emits the final output in the form of key-value pair. Node 1 Node 2 Reduce Reduce Node 1 Node 2 Map Map Input Data

Distributed Cache Use Case: Your map reduce application requires extra
information while execution. For Example: While retrieving keywords from a file, you want to use a file consisting of stop words Distributed Cache facility provides caching of the files (text, jar, zip, tar. gz, tgz etc) per JOB Map Reduce framework will copy the files to the slave nodes before executing the task on that node While using distributed cache: Files should be present in HDFS Distributed Cache facility tracks the modification time stamp of the file, so it should not be modified externally

Overview of MapReduce Combiners Partitioners Map Reduce Complete view of
MapReduce, illustrating combiners and partitioners in addition to Mappers and Reducers Combiners can Combiners can be viewed as ‘mini- reducers’ in the Map phase. Partitioners determine which reducer is responsible for a particular key.

Combiner -Local Reduce Local Reduce Runs immediately after Mapper task
is over . Why it is important? Speed up the execution . Minimizes the data transfer over the net. Loosen the burden on reducer . Where it can be used? Operation should be associative ([a () b = b () a]. For example: sum, multiplication . But you cannot use if you are calculating average. Can use the existing reducer as combiner. Specify the combiner class inside the driver program Job.setCombinerClass (WordCountReducer.class). The reducer logic will run on the intermediate key values generated just after the Mapper task is over. After that partitioning, shuffling, sorting takes place and the keys are send to the reducer. Perform a “Local Reduce Mini- Reducers Before we distribute the mapper results Passed workload further to the Reducers Combiners

Combiner (D, 1) (B, 1) (C, 1) (D, 1) (E,
1) (B, 1) D B C D E B (B, 1) (C, 1) (D, 2) (E, 1) (A, [2]) (B, [2,1]) (C, [1,1]) (D, [2,2]) (E, [1]) (A, 2) (B, 3) (C, 2) (D, 4) (B, 1) (D, 1) (A, 1) (A, 1) (C, 1) (D, 1) B D A A C D (D, 2) (A, 2) (C, 1) (B, 1) (E, 1) Combiner Combiner Reducer Mapper Mapper Shuffle Block 1 Block 2

Partitioner—Redirecting Output from Mapper Using Partitioner: Determine which reducer/ partition
one record should go to. Given the key, value and number of partitions, return an integer. Partition: (K2, V2, #Partitions) a integer Reducer Reducer Reducer Map Map Map Partitioner Partitioner Partitioner

Input Formats File Input Format Base class for other input
formats Text Input Format Key Value Input Format N Line Input Format Sequence File Input Format Multiple Inputs This class splits the large files into input splits i.e. files greater than the HDFS block size Sequence Files Text Files Key Value Text Input Format Sequence File Input Format

Text Input Format Efficient for processing unstructured text Default input
format KEY : Offset of a line with in a file VALUE : Entire Line as the value Key Value Text Input Format Scenario: Each line could be a key value par separated by delimiter Default delimiter is tab KEY: First word till the delimiter VALUE: Entire line after the delimiter Custom Delimiter: key.value.separator.in.input.line Text and Key Value Input Format

Scenario: Loading the data from various data sources Oracle DB,
DB2, MySql Server etc Approach: Creating a small file in which data source connections are given in a separate line Accordingly input split is created if a specify the number of lines to mapper =1 Each mapper will take one data source at a time and connect to the data source and load the data from the source to HDFS Text input format and key value input format receives variable number of line Based on input split size Length of line Useful if mapper wants to receive fixed number of lines of input N Line Input Format

Sequence File Input Format Support for binary files Stores sequences
of binary key value pair Supports compression as part of the format Can store any types Sequence File As Text Input Format Converts to Text object Sequence File As Binary Input Format Encapsulates Byte Writable objects Multiple Inputs Use Case: Same data source is generating same file but in different format One is tab separated Another is sequence file Use the input format as Multiple Input Sequence File and Multiple Input Format

Learn Apache Map Reduce

Learn Apache Map Reduce

StratApps

More Decks by StratApps

Other Decks in Education

Featured

Transcript

Learn Apache Map Reduce

Topics to Discuss today Map Reduce Map Reduce use cases

What is Map Reduce Provides a Parallel Programming model. Moves

Map Reduce Use Cases Weather Forecasting Health Care Oil &

Sample Weather Data ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/

Architecture Store Deidentified CSV file into HDFS Taking DB dump

Mapper Class : For Running Mapper Logic Reducer Class :

Traditional Way grep cat Split Data Split Data Split Data

Map Reduce Process The Overall MapReduce Word Count Process Dear

Map Reduce Example Word Count Example reads text files and

Anatomy of a MapReduce Program Value Key Map Reduce Map

Result Very Big Data Partitioning Function MapReduce Way R E

Revisit De-identification Architecture Store Deidentified CSV file into HDFS Taking

Why MapReduce ? Two biggest Advantages: Taking processing to the

Input Splits HDFS Blocks Input Splits Physical Division Logical Division

Relation between Input Splits and HDFS Blocks Logical records do

MapReduce Flow A file is given as input to the

Distributed Cache Use Case: Your map reduce application requires extra

Overview of MapReduce Combiners Partitioners Map Reduce Complete view of

Combiner -Local Reduce Local Reduce Runs immediately after Mapper task

Combiner (D, 1) (B, 1) (C, 1) (D, 1) (E,

Partitioner—Redirecting Output from Mapper Using Partitioner: Determine which reducer/ partition

Input Formats File Input Format Base class for other input

Text Input Format Efficient for processing unstructured text Default input

Scenario: Loading the data from various data sources Oracle DB,

Sequence File Input Format Support for binary files Stores sequences