An introduction to Hadoop and Mapreduce

Saturday, 23 March 13 1.! Java MapReduce: Most flexibility and
performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.

big data Saturday, 23 March 13 1.! Java MapReduce: Most
flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.

...and the little hadoop big data Saturday, 23 March 13
1.! Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.

Saturday, 23 March 13

% whoami Saturday, 23 March 13

% whoami Vedaj JP BE Information Science & Engineering NMAMIT,
Nitte Saturday, 23 March 13

Nitte i’m also accused of being a.... Saturday, 23 March 13

Nitte i’m also accused of being a.... g33k / musician / ‘occasional’ blogger / “social responsibility” freak / hack3r / philosopher (when jobless) Saturday, 23 March 13

Nitte i’m also accused of being a.... g33k / musician / ‘occasional’ blogger / “social responsibility” freak / hack3r / philosopher (when jobless) * twitter/diaspora : @vedaj * fb.com/vedaj.jp * about.me/vedaj.jp * vedaj.posterous.com Saturday, 23 March 13

Big Data Saturday, 23 March 13

Big Data Hadoop Saturday, 23 March 13

Big Data Hadoop MapReduce Saturday, 23 March 13

Big Data Hadoop MapReduce HDFS Saturday, 23 March 13

Saturday, 23 March 13 A bunch of data? An industry?
An expertise? A trend? A cliché? It’s a buzz word, but generally associated with the problem of data sets too big to manage with traditional SQL databases. A parallel development has been the NoSQL movement that is good at handling semistructured data, scaling, etc.!Flexible infrastructure for large scale computation & data processing on a network of commodity hardware !Completely written in java !Open source & distributed under Apache license !Hadoop Common, HDFS & MapReduce

what is big data? Saturday, 23 March 13 A bunch
of data? An industry? An expertise? A trend? A cliché? It’s a buzz word, but generally associated with the problem of data sets too big to manage with traditional SQL databases. A parallel development has been the NoSQL movement that is good at handling semistructured data, scaling, etc.!Flexible infrastructure for large scale computation & data processing on a network of commodity hardware !Completely written in java !Open source & distributed under Apache license !Hadoop Common, HDFS & MapReduce

“ Saturday, 23 March 13

In information technology, big data is a loosely-deﬁned term used
to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. “ Saturday, 23 March 13

to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. “ ” Saturday, 23 March 13

to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. en.wikipedia.org/wiki/Big_data “ ” Saturday, 23 March 13

how big is big? Saturday, 23 March 13

Saturday, 23 March 13 • 2008: Google processes 20 PB
a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day

0 1,500 3,000 4,500 6,000 Generator Data generated per day
by top 10 data generators of the world, 2007 Saturday, 23 March 13 • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day

0 1,500 3,000 4,500 6,000 Generator Data generated per day
by top 10 data generators of the world, 2007 LOC CIA Amazon Youtube ChoicePt Sprint Google AT & T NERSC Climate Terabytes Saturday, 23 March 13 • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day

Saturday, 23 March 13 Bioinformatics data: from about 3.3 billion
base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;

•Twitter (over 7~ TB/day) Saturday, 23 March 13 Bioinformatics data:
from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;

•Twitter (over 7~ TB/day) •Facebook (over 10~ TB/day) Saturday, 23
March 13 Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;

•Twitter (over 7~ TB/day) •Facebook (over 10~ TB/day) •Google (over
20~ PB/day) Saturday, 23 March 13 Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;

•1000 Genomics projects (200TB and growing...) •Twitter (over 7~ TB/day)
•Facebook (over 10~ TB/day) •Google (over 20~ PB/day) Saturday, 23 March 13 Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;

woah....thats a lot! Saturday, 23 March 13

but why do we need all this? Saturday, 23 March
13

how do we scale data? Saturday, 23 March 13

how do we scale data? divide & conquer Saturday, 23
March 13

how do we scale data? divide & conquer “Work” w1
w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Partition Combine Saturday, 23 March 13

Saturday, 23 March 13 • How do we assign tasks
to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization?

parallel processing is complicated Saturday, 23 March 13 • How
do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization?

Saturday, 23 March 13 For example: • 1000 hosts, each
with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail

data storage is not trivial Saturday, 23 March 13 For
example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail

data storage is not trivial massive data volumes Saturday, 23
March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail

data storage is not trivial massive data volumes reliably storing
PB’s of data is challenging Saturday, 23 March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail

PB’s of data is challenging disk/hardware/ network failures Saturday, 23 March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail

PB’s of data is challenging disk/hardware/ network failures with number of machines, probability of failure also increases Saturday, 23 March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail

Cluster of machines running Hadoop at Yahoo! (credit: Yahoo!) hadoop
Saturday, 23 March 13 Hadoop was created by Doug Cutting and Michael J. Cafarella. Doug, who was working at Yahoo at the time, named it after his son's toy elephant.It was originally developed to support distribution for the Nutch search engine project. Hadoop consists of the Hadoop Common which provides access to the filesystems supported by Hadoop. The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community. For effective scheduling of work, every Hadoop-compatible filesystem should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. The Hadoop Distributed File System (HDFS) uses this when replicating data, to try to keep different copies of the data on different racks. The goal is to

Saturday, 23 March 13 • Redundant, fault-tolerant data storage •
Parallel computation framework • Job coordination Rather than banging away at one, huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed in parallel.

what does it provide? Saturday, 23 March 13 • Redundant,
fault-tolerant data storage • Parallel computation framework • Job coordination Rather than banging away at one, huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed in parallel.

joy Saturday, 23 March 13

Hadoop is an open-source implementation based on GFS and MapReduce
from Google Saturday, 23 March 13

from Google Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003) The Google File System Saturday, 23 March 13

from Google Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003) The Google File System Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simpliﬁed Data Processing on Large Clusters. OSDI 2004 Saturday, 23 March 13

Saturday, 23 March 13 HBase is an open source, non-relational,
distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom ﬁlters on a per-column basis as outlined in the original BigTable paper.[2] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API [3] but also through REST, Avro or Thrift gateway APIs. HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites,[4][5] including Facebook's Messaging Platform

hadoop stack Saturday, 23 March 13 HBase is an open
source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom ﬁlters on a per-column basis as outlined in the original BigTable paper.[2] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API [3] but also through REST, Avro or Thrift gateway APIs. HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites,[4][5] including Facebook's Messaging Platform

MapReduce (Distributed Programming Framework) Pig (Data Flow) Hive (SQL) HDFS
(Hadoop Distributed File System) Cascading (Java) HBase (Columnar Database) hadoop stack Saturday, 23 March 13 HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom ﬁlters on a per-column basis as outlined in the original BigTable paper.[2] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API [3] but also through REST, Avro or Thrift gateway APIs. HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites,[4][5] including Facebook's Messaging Platform

HDFS Saturday, 23 March 13

Saturday, 23 March 13 HDFS is a distributed, scalable, and
portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

• A distributed file system Saturday, 23 March 13 HDFS
is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

• A distributed file system • Redundant storage Saturday, 23
March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

• A distributed file system • Redundant storage • Designed
to reliably store data using commodity hardware Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

to reliably store data using commodity hardware • Designed to expect hardware failures Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.

Saturday, 23 March 13 • Files are stored as a
collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks

files & blocks Saturday, 23 March 13 • Files are
stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks

Saturday, 23 March 13 Designed for system-to-system interaction, and not
for user-to-system interaction. The chunk servers replicate the data automatically.

replication Saturday, 23 March 13 Designed for system-to-system interaction, and
not for user-to-system interaction. The chunk servers replicate the data automatically.

replication Multiple copies of a block are stored Replication strategy:
• Copy #1 on another node on same rack • Copy #2 on another node on different rack Saturday, 23 March 13 Designed for system-to-system interaction, and not for user-to-system interaction. The chunk servers replicate the data automatically.

HDFS: writes Saturday, 23 March 13

DataNode Block Slave node NameNode Master File Client Rack #1
Rack #2 Note: Write path for a single block shown. Client writes multiple blocks in parallel. block DataNode Block Slave node DataNode Block Slave node HDFS: writes Saturday, 23 March 13

HDFS: reads Saturday, 23 March 13

DataNode Block Slave node NameNode Master File Client Client reads
multiple blocks in parallel and re- assembles into a ﬁle. block 1 block 2 block N DataNode Block Slave node DataNode Block Slave node HDFS: reads Saturday, 23 March 13

Saturday, 23 March 13 DNs check in with the NN
to report health Upon failure NN orders DNs to replicate under-replicated blocks

what about datanode failures? Saturday, 23 March 13 DNs check
in with the NN to report health Upon failure NN orders DNs to replicate under-replicated blocks

"Map" step: The master node takes the input, divides it
into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Saturday, 23 March 13 MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.[1] The model is inspired by the map and reduce functions commonly used in functional programming,[2] although their purpose in the MapReduce framework is not the same as their original forms.[3] MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop.

mapreduce "Map" step: The master node takes the input, divides
it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Saturday, 23 March 13 MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.[1] The model is inspired by the map and reduce functions commonly used in functional programming,[2] although their purpose in the MapReduce framework is not the same as their original forms.[3] MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop.

mapreduce is.... Saturday, 23 March 13

mapreduce is.... • A programming model for expressing distributed computations
at a massive scale Saturday, 23 March 13

mapreduce is.... • A programming model for expressing distributed computations
at a massive scale • An execution framework for organizing and performing such computations Saturday, 23 March 13

Map Reduce (Dean and Ghemawat, OSDI 2004) • Iterate over
a large number of records • Extract something of interest from each • Shufﬂe and sort intermediate results • Aggregate intermediate results • Generate ﬁnal output typical large data problem Saturday, 23 March 13

mapreduce paradigm Saturday, 23 March 13

mapreduce paradigm • Implement two functions: Map(k1, v1) -> list(k2,
v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer Saturday, 23 March 13

mapreduce ﬂow Saturday, 23 March 13

map map map map Shuffle and Sort: aggregate values by
keys reduce reduce reduce k 1 k 2 k 3 k 4 k 5 k 6 v 1 v 2 v 3 v 4 v 5 v 6 b a 1 2 c c 3 6 a c 5 2 b c 7 8 a 1 5 b 2 7 c 2 3 6 8 r 1 s 1 r 2 s 2 r 3 s 3 mapreduce ﬂow Saturday, 23 March 13

mapreduce paradigm part 2 Saturday, 23 March 13

mapreduce paradigm part 2 There’s more! • Partioners decide what
key goes to what reducer partition(k’, numPartitions) -> partNumber Divides key space into parallel reducers chunks Default is hash-based • Combiners can combine Mapper output before sending to reducer Reduce(k2, list(v2)) -> list(v3) Saturday, 23 March 13

Saturday, 23 March 13 • Reduce starts after all mappers
complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning

mapreduce ﬂow Saturday, 23 March 13 • Reduce starts after
all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning

combine combine combine combine b a 1 2 c 9
a c 5 2 b c 7 8 partition partition partition partition map map map map k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6 b a 1 2 c c 3 6 a c 5 2 b c 7 8 Shuffle and Sort: aggregate values by keys reduce reduce reduce a 1 5 b 2 7 c 2 9 8 r1 s1 r2 s2 r3 s3 c 2 3 6 8 mapreduce ﬂow Saturday, 23 March 13 • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning

• Job: a user-submitted map and reduce implementation to apply
to a data set Saturday, 23 March 13

to a data set • Task: a single mapper or reducer task Saturday, 23 March 13

to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically Saturday, 23 March 13

to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally Saturday, 23 March 13

to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation Saturday, 23 March 13

to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation • TaskTrackers (TT) ask for work and execute tasks Saturday, 23 March 13

mapreduce architecture Saturday, 23 March 13

mapreduce architecture Master Client TaskTracker Slave node JobTracker TaskTracker Task
Slave node Job TaskTracker Slave node Task Task Saturday, 23 March 13

Saturday, 23 March 13 Tasks will fail JT will retry
failed tasks up to N attempts After N failed attempts for a task, job fails Some tasks are slower than other Speculative execution is JT starting up multiple of the same task First one to complete wins, other is killed

when tasks fail... Saturday, 23 March 13 Tasks will fail
JT will retry failed tasks up to N attempts After N failed attempts for a task, job fails Some tasks are slower than other Speculative execution is JT starting up multiple of the same task First one to complete wins, other is killed

Saturday, 23 March 13 • Move computation to the data
• Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN

mapreduce data locality Saturday, 23 March 13 • Move computation
to the data • Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN

mapreduce is good for... Saturday, 23 March 13

mapreduce is good for... • Embarrassingly parallel algorithms • Summing,
grouping, ﬁltering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset Saturday, 23 March 13

mapreduce is ok for... Saturday, 23 March 13

mapreduce is ok for... • Iterative jobs (i.e., graph algorithms)
• Each iteration must read/write data to disk • IO and latency cost of an iteration is high Saturday, 23 March 13

mapreduce is not good for... Saturday, 23 March 13

mapreduce is not good for... • Jobs that need shared
state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records Saturday, 23 March 13

hadoop combined architecture Saturday, 23 March 13

hadoop combined architecture Master TaskTracker DataNode Slave node JobTracker SecondaryNameNode
Backup NameNode TaskTracker DataNode Slave node TaskTracker DataNode Slave node Saturday, 23 March 13

Tool for browsing HDFS Saturday, 23 March 13

namenode UI Tool for browsing HDFS Saturday, 23 March 13

Tool to see running/completed/failed jobs Saturday, 23 March 13

Tool to see running/completed/failed jobs jobtracker UI Saturday, 23 March
13

running hadoop Saturday, 23 March 13

• Multiple options • On your local machine (standalone or
pseudo-distributed) • Local with a virtual machine • On the cloud (i.e. Amazon EC2) • In your own datacenter running hadoop Saturday, 23 March 13

cloudera VM Saturday, 23 March 13

cloudera VM • Virtual machine with Hadoop and related technologies
pre-loaded • Great tool for learning Hadoop • Eases the pain of downloading/installing • Pre-loaded with sample data and jobs • Documented tutorials • VM: https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4 • Tutorial: https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial Saturday, 23 March 13

QUESTIONS ? Saturday, 23 March 13

“Lets take oﬀ to a new era of
innovation” Saturday, 23 March 13

An introduction to Hadoop and Mapreduce

An introduction to Hadoop and Mapreduce

More Decks by Vedaj JP

Other Decks in Technology

Featured

Transcript