Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to Hadoop and Mapreduce

An introduction to Hadoop and Mapreduce

This presentation explains what Hadoop is, how does it work and what mapreduce algorithm is

Vedaj JP

March 23, 2013
Tweet

More Decks by Vedaj JP

Other Decks in Technology

Transcript

  1. Saturday, 23 March 13 1.! Java MapReduce: Most flexibility and

    performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.
  2. big data Saturday, 23 March 13 1.! Java MapReduce: Most

    flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.
  3. big data Saturday, 23 March 13 1.! Java MapReduce: Most

    flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.
  4. big data Saturday, 23 March 13 1.! Java MapReduce: Most

    flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.
  5. ...and the little hadoop big data Saturday, 23 March 13

    1.! Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2.! Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3.! Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4.! Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5.! Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6.! Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.
  6. % whoami Vedaj JP BE Information Science & Engineering NMAMIT,

    Nitte i’m also accused of being a.... Saturday, 23 March 13
  7. % whoami Vedaj JP BE Information Science & Engineering NMAMIT,

    Nitte i’m also accused of being a.... g33k / musician / ‘occasional’ blogger / “social responsibility” freak / hack3r / philosopher (when jobless) Saturday, 23 March 13
  8. % whoami Vedaj JP BE Information Science & Engineering NMAMIT,

    Nitte i’m also accused of being a.... g33k / musician / ‘occasional’ blogger / “social responsibility” freak / hack3r / philosopher (when jobless) Saturday, 23 March 13
  9. % whoami Vedaj JP BE Information Science & Engineering NMAMIT,

    Nitte i’m also accused of being a.... g33k / musician / ‘occasional’ blogger / “social responsibility” freak / hack3r / philosopher (when jobless) * twitter/diaspora : @vedaj * fb.com/vedaj.jp * about.me/vedaj.jp * vedaj.posterous.com Saturday, 23 March 13
  10. Saturday, 23 March 13 A bunch of data? An industry?

    An expertise? A trend? A cliché? It’s a buzz word, but generally associated with the problem of data sets too big to manage with traditional SQL databases. A parallel development has been the NoSQL movement that is good at handling semistructured data, scaling, etc.!Flexible infrastructure for large scale computation & data processing on a network of commodity hardware !Completely written in java !Open source & distributed under Apache license !Hadoop Common, HDFS & MapReduce
  11. what is big data? Saturday, 23 March 13 A bunch

    of data? An industry? An expertise? A trend? A cliché? It’s a buzz word, but generally associated with the problem of data sets too big to manage with traditional SQL databases. A parallel development has been the NoSQL movement that is good at handling semistructured data, scaling, etc.!Flexible infrastructure for large scale computation & data processing on a network of commodity hardware !Completely written in java !Open source & distributed under Apache license !Hadoop Common, HDFS & MapReduce
  12. In information technology, big data is a loosely-defined term used

    to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. “ Saturday, 23 March 13
  13. In information technology, big data is a loosely-defined term used

    to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. “ ” Saturday, 23 March 13
  14. In information technology, big data is a loosely-defined term used

    to describe data sets so large and complex that they become awkward to work with using on-hand database management tools. en.wikipedia.org/wiki/Big_data “ ” Saturday, 23 March 13
  15. Saturday, 23 March 13 • 2008: Google processes 20 PB

    a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day
  16. 0 1,500 3,000 4,500 6,000 Generator Data generated per day

    by top 10 data generators of the world, 2007 Saturday, 23 March 13 • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day
  17. 0 1,500 3,000 4,500 6,000 Generator Data generated per day

    by top 10 data generators of the world, 2007 LOC CIA Amazon Youtube ChoicePt Sprint Google AT & T NERSC Climate Terabytes Saturday, 23 March 13 • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day
  18. Saturday, 23 March 13 Bioinformatics data: from about 3.3 billion

    base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;
  19. •Twitter (over 7~ TB/day) Saturday, 23 March 13 Bioinformatics data:

    from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;
  20. •Twitter (over 7~ TB/day) •Facebook (over 10~ TB/day) Saturday, 23

    March 13 Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;
  21. •Twitter (over 7~ TB/day) •Facebook (over 10~ TB/day) •Google (over

    20~ PB/day) Saturday, 23 March 13 Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;
  22. •1000 Genomics projects (200TB and growing...) •Twitter (over 7~ TB/day)

    •Facebook (over 10~ TB/day) •Google (over 20~ PB/day) Saturday, 23 March 13 Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analyze … Financial applications: that analyses volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars;
  23. how do we scale data? divide & conquer “Work” w1

    w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Partition Combine Saturday, 23 March 13
  24. Saturday, 23 March 13 • How do we assign tasks

    to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization?
  25. parallel processing is complicated Saturday, 23 March 13 • How

    do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization?
  26. Saturday, 23 March 13 For example: • 1000 hosts, each

    with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail
  27. data storage is not trivial Saturday, 23 March 13 For

    example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail
  28. data storage is not trivial massive data volumes Saturday, 23

    March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail
  29. data storage is not trivial massive data volumes reliably storing

    PB’s of data is challenging Saturday, 23 March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail
  30. data storage is not trivial massive data volumes reliably storing

    PB’s of data is challenging disk/hardware/ network failures Saturday, 23 March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail
  31. data storage is not trivial massive data volumes reliably storing

    PB’s of data is challenging disk/hardware/ network failures with number of machines, probability of failure also increases Saturday, 23 March 13 For example: • 1000 hosts, each with 10 disks • a disk lasts 3 year • how many failures per day? 9 disks/day will fail
  32. Cluster of machines running Hadoop at Yahoo! (credit: Yahoo!) hadoop

    Saturday, 23 March 13 Hadoop was created by Doug Cutting and Michael J. Cafarella. Doug, who was working at Yahoo at the time, named it after his son's toy elephant.It was originally developed to support distribution for the Nutch search engine project. Hadoop consists of the Hadoop Common which provides access to the filesystems supported by Hadoop. The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community. For effective scheduling of work, every Hadoop-compatible filesystem should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. The Hadoop Distributed File System (HDFS) uses this when replicating data, to try to keep different copies of the data on different racks. The goal is to
  33. Saturday, 23 March 13 • Redundant, fault-tolerant data storage •

    Parallel computation framework • Job coordination Rather than banging away at one, huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed in parallel.
  34. Saturday, 23 March 13 • Redundant, fault-tolerant data storage •

    Parallel computation framework • Job coordination Rather than banging away at one, huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed in parallel.
  35. what does it provide? Saturday, 23 March 13 • Redundant,

    fault-tolerant data storage • Parallel computation framework • Job coordination Rather than banging away at one, huge block of data with a single machine, Hadoop breaks up Big Data into multiple parts so each part can be processed and analyzed in parallel.
  36. Hadoop is an open-source implementation based on GFS and MapReduce

    from Google Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003) The Google File System Saturday, 23 March 13
  37. Hadoop is an open-source implementation based on GFS and MapReduce

    from Google Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003) The Google File System Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 Saturday, 23 March 13
  38. Saturday, 23 March 13 HBase is an open source, non-relational,

    distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper.[2] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API [3] but also through REST, Avro or Thrift gateway APIs. HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites,[4][5] including Facebook's Messaging Platform
  39. hadoop stack Saturday, 23 March 13 HBase is an open

    source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper.[2] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API [3] but also through REST, Avro or Thrift gateway APIs. HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites,[4][5] including Facebook's Messaging Platform
  40. MapReduce (Distributed Programming Framework) Pig (Data Flow) Hive (SQL) HDFS

    (Hadoop Distributed File System) Cascading (Java) HBase (Columnar Database) hadoop stack Saturday, 23 March 13 HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper.[2] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API [3] but also through REST, Avro or Thrift gateway APIs. HBase is not a direct replacement for a classic SQL database, although recently its performance has improved, and it is now serving several data-driven websites,[4][5] including Facebook's Messaging Platform
  41. Saturday, 23 March 13 HDFS is a distributed, scalable, and

    portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  42. • A distributed file system Saturday, 23 March 13 HDFS

    is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  43. • A distributed file system • Redundant storage Saturday, 23

    March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  44. • A distributed file system • Redundant storage • Designed

    to reliably store data using commodity hardware Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  45. • A distributed file system • Redundant storage • Designed

    to reliably store data using commodity hardware • Designed to expect hardware failures Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  46. • A distributed file system • Redundant storage • Designed

    to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  47. • A distributed file system • Redundant storage • Designed

    to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  48. • A distributed file system • Redundant storage • Designed

    to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts • The Hadoop Distributed File System Saturday, 23 March 13 HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[11]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. HDFS was designed to handle very large files.
  49. Saturday, 23 March 13 • Files are stored as a

    collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks
  50. files & blocks Saturday, 23 March 13 • Files are

    stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks
  51. Saturday, 23 March 13 Designed for system-to-system interaction, and not

    for user-to-system interaction. The chunk servers replicate the data automatically.
  52. replication Saturday, 23 March 13 Designed for system-to-system interaction, and

    not for user-to-system interaction. The chunk servers replicate the data automatically.
  53. replication Multiple copies of a block are stored Replication strategy:

    • Copy #1 on another node on same rack • Copy #2 on another node on different rack Saturday, 23 March 13 Designed for system-to-system interaction, and not for user-to-system interaction. The chunk servers replicate the data automatically.
  54. DataNode Block Slave node NameNode Master File Client Rack #1

    Rack #2 Note: Write path for a single block shown. Client writes multiple blocks in parallel. block DataNode Block Slave node DataNode Block Slave node HDFS: writes Saturday, 23 March 13
  55. DataNode Block Slave node NameNode Master File Client Client reads

    multiple blocks in parallel and re- assembles into a file. block 1 block 2 block N DataNode Block Slave node DataNode Block Slave node HDFS: reads Saturday, 23 March 13
  56. Saturday, 23 March 13 DNs check in with the NN

    to report health Upon failure NN orders DNs to replicate under-replicated blocks
  57. what about datanode failures? Saturday, 23 March 13 DNs check

    in with the NN to report health Upon failure NN orders DNs to replicate under-replicated blocks
  58. "Map" step: The master node takes the input, divides it

    into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Saturday, 23 March 13 MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.[1] The model is inspired by the map and reduce functions commonly used in functional programming,[2] although their purpose in the MapReduce framework is not the same as their original forms.[3] MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop.
  59. mapreduce "Map" step: The master node takes the input, divides

    it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Saturday, 23 March 13 MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.[1] The model is inspired by the map and reduce functions commonly used in functional programming,[2] although their purpose in the MapReduce framework is not the same as their original forms.[3] MapReduce libraries have been written in many programming languages. A popular free implementation is Apache Hadoop.
  60. mapreduce is.... • A programming model for expressing distributed computations

    at a massive scale • An execution framework for organizing and performing such computations Saturday, 23 March 13
  61. Map Reduce (Dean and Ghemawat, OSDI 2004) • Iterate over

    a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output typical large data problem Saturday, 23 March 13
  62. mapreduce paradigm • Implement two functions: Map(k1, v1) -> list(k2,

    v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer Saturday, 23 March 13
  63. map map map map Shuffle and Sort: aggregate values by

    keys reduce reduce reduce k 1 k 2 k 3 k 4 k 5 k 6 v 1 v 2 v 3 v 4 v 5 v 6 b a 1 2 c c 3 6 a c 5 2 b c 7 8 a 1 5 b 2 7 c 2 3 6 8 r 1 s 1 r 2 s 2 r 3 s 3 mapreduce flow Saturday, 23 March 13
  64. mapreduce paradigm part 2 There’s more! • Partioners decide what

    key goes to what reducer partition(k’, numPartitions) -> partNumber Divides key space into parallel reducers chunks Default is hash-based • Combiners can combine Mapper output before sending to reducer Reduce(k2, list(v2)) -> list(v3) Saturday, 23 March 13
  65. Saturday, 23 March 13 • Reduce starts after all mappers

    complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning
  66. mapreduce flow Saturday, 23 March 13 • Reduce starts after

    all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning
  67. combine combine combine combine b a 1 2 c 9

    a c 5 2 b c 7 8 partition partition partition partition map map map map k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6 b a 1 2 c c 3 6 a c 5 2 b c 7 8 Shuffle and Sort: aggregate values by keys reduce reduce reduce a 1 5 b 2 7 c 2 9 8 r1 s1 r2 s2 r3 s3 c 2 3 6 8 mapreduce flow Saturday, 23 March 13 • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning
  68. • Job: a user-submitted map and reduce implementation to apply

    to a data set • Task: a single mapper or reducer task Saturday, 23 March 13
  69. • Job: a user-submitted map and reduce implementation to apply

    to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically Saturday, 23 March 13
  70. • Job: a user-submitted map and reduce implementation to apply

    to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally Saturday, 23 March 13
  71. • Job: a user-submitted map and reduce implementation to apply

    to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation Saturday, 23 March 13
  72. • Job: a user-submitted map and reduce implementation to apply

    to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation • TaskTrackers (TT) ask for work and execute tasks Saturday, 23 March 13
  73. mapreduce architecture Master Client TaskTracker Slave node JobTracker TaskTracker Task

    Slave node Job TaskTracker Slave node Task Task Saturday, 23 March 13
  74. Saturday, 23 March 13 Tasks will fail JT will retry

    failed tasks up to N attempts After N failed attempts for a task, job fails Some tasks are slower than other Speculative execution is JT starting up multiple of the same task First one to complete wins, other is killed
  75. when tasks fail... Saturday, 23 March 13 Tasks will fail

    JT will retry failed tasks up to N attempts After N failed attempts for a task, job fails Some tasks are slower than other Speculative execution is JT starting up multiple of the same task First one to complete wins, other is killed
  76. Saturday, 23 March 13 • Move computation to the data

    • Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN
  77. mapreduce data locality Saturday, 23 March 13 • Move computation

    to the data • Moving data between nodes has a cost • MapReduce tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN
  78. mapreduce is good for... • Embarrassingly parallel algorithms • Summing,

    grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset Saturday, 23 March 13
  79. mapreduce is ok for... • Iterative jobs (i.e., graph algorithms)

    • Each iteration must read/write data to disk • IO and latency cost of an iteration is high Saturday, 23 March 13
  80. mapreduce is not good for... • Jobs that need shared

    state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records Saturday, 23 March 13
  81. hadoop combined architecture Master TaskTracker DataNode Slave node JobTracker SecondaryNameNode

    Backup NameNode TaskTracker DataNode Slave node TaskTracker DataNode Slave node Saturday, 23 March 13
  82. • Multiple options • On your local machine (standalone or

    pseudo-distributed) • Local with a virtual machine • On the cloud (i.e. Amazon EC2) • In your own datacenter running hadoop Saturday, 23 March 13
  83. cloudera VM • Virtual machine with Hadoop and related technologies

    pre-loaded • Great tool for learning Hadoop • Eases the pain of downloading/installing • Pre-loaded with sample data and jobs • Documented tutorials • VM: https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4 • Tutorial: https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial Saturday, 23 March 13
  84. “Lets  take  off  to  a   new  era  of  

    innovation” Saturday, 23 March 13