Learn Apache Hadoop - Speaker Deck

Slide 1

Slide 1 text

Learn Apache Hadoop

Slide 2

Slide 2 text

Topics to Discuss today Hadoop Modes Terminal Commands Web UI Url’s Sample Cluster Configuration Hadoop Configuration Files DD for each component Name Node Recovery Sample example list in Hadoop Running Teragen Example Dump of MR job Data Loading Techniques Using Hadoop Copy Commands FLUME SQOOP Data Analysis Techniques PIG HIVE Heads Up Session

Slide 3

Slide 3 text

Hadoop Modes Hadoop can run in any of the following three modes: Standalone Mode Fully Distributed Mode Pseudo Distributed Mode Standalone (or Local) Mode No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS. Pseudo-Distributed Mode Hadoop daemons run on the local machine. Fully Distributed Mode Hadoop daemons run on a cluster of machines.

Slide 4

Slide 4 text

Terminal Commands

Slide 5

Slide 5 text

Terminal Commands Listing of files present on HDFS Listing of files present in bin Directory

Slide 6

Slide 6 text

Web UI URLs Name Node Status : Job Tracker Status : Task Tracker Status : Data Block Scanner Report : http://localhost:50070/dfshealth.jsp http://localhost:50030/jobtracker.jsp http://localhost:50060/tasktracker.jsp http://localhost:50075/blockScannerReport

Slide 7

Slide 7 text

Sample Cluster Configuration Master NameNode: http://master:50070/ JobTracker: http://master:50030/ Slave Node Task Tracker Data Node Slave Node Task Tracker Data Node Slave Node Task Tracker Data Node Slave Node Task Tracker Data Node

Slide 8

Slide 8 text

Configuration Filenames Format Description hadoop-env.sh Base Script Environment variables that are used in the scripts to run Hadoop. core-site.xml Hadoop Configuration XML Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce. hdfs-site.xml Hadoop Configuration XML Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes. mapred-site.xml Hadoop Configuration XML Configuration settings for MapReduce daemons : the job-tracker and the task-trackers. masters Plain Text A list of machines (one per line) that each run a secondary namenode. slave Plain Text A list of machines (one per line) that each run a datanode and a task-tracker. hadoop- metric.properties Java Properties Properties for controlling how metrics are published in Hadoop. log4j.properties Java Properties Properties for system log files, the namenode audit log and the task log for the task-tracker child process. Hadoop Configuration Files

Slide 9

Slide 9 text

Hadoop Configuration Files

Slide 10

Slide 10 text

DD for each Component HDFS Map Reduce core-site.xml hdfs-site.xml mapred-site.xml Core

Slide 11

Slide 11 text

MapReducing HDFS/GFS GFS-File Structure: Divide into 64 MB chucks Chucks replicated. (Default 3 replication). Chucks divided into 64KB blocks. Each block has a 34 bit checksum. HDFS-File Structure: Divided into 128 MB blocks Name node holds block replica as 2 files. One for Data One for checksum and generation Stamp. HDFS Vs. GFS Hadoop Google

Slide 12

Slide 12 text

core-site.xml and hdfs-site.xml hdfs-site.xml core-site.xml dfs.replication fs.default.name 1 http://localhost:8020

Slide 13

Slide 13 text

Defining HDFS Details In hdfs-site.xml Property Value Description dfs.data.dir /disk1/hdfs/data, /disk2/hdfs/data A list of directories where the datanode stores blocks. Each block is stored in only one of these directories. ${hadoop.tmp.dir}/dfs/data fs.checkpoint.dir /disk1/hdfs/namesecondary, /disk2/hdfs/namesecondary A list of directories where the secondary namenode stores checkpoints. It stores a copy of the checkpoint in each directory in the list ${hadoop.tmp.dir}/dfs/name secondary

Slide 14

Slide 14 text

mapred-site.xml mapred-site.xml mapred.job.tracker localhost:8021

Slide 15

Slide 15 text

Property Value Description mapred.job.tracker localhost:8021 The hostname and the port that the jobtracker RPC server runs on. If set to the default value of local, then the jobtracker runs in-process on demand when you run a MapReduce job. mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores intermediate data for jobs. The data is cleared out when the job ends. mapred.system.dir ${hadoop.tmp.dir}/mapred/system The hostname and the port that the jobtracker RPC server runs on. If set to the default value of local, then the jobtracker runs in-process on demand when you run a MapReduce job. mapred.tasktracker. map/reducer .tasks.maximum 2 The number of map/reducer tasks that may be run on a tasktracker at any one time Defining mapred-site.xml

Slide 16

Slide 16 text

All Properties http://hadoop.apache.org/docs/r1.1.2/core-default.html http://hadoop.apache.org/docs/r1.1.2/mapred-default.html http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html

Slide 17

Slide 17 text

Slaves and Masters Two files are used by the startup and shutdown commands: Slaves Contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers. Masters Contains a list of hosts, one per line, that are to host Secondary NameNode servers.

Slide 18

Slide 18 text

Set parameter JAVA_HOME This file also offers a way to provide custom parameters for each of the servers. Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/directory of the installation. Examples of environment variables that you can specify: Export: HADOOP_DATANODE_HEAPSIZE ="128" Export : HADOOP_TASKTRACKER_HEAPSI ZE="512" Per-Process Runtime Environment hadoop-env.sh JVM

Slide 19

Slide 19 text

# Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-7-sun-1.7.0.45 # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" ….. ….. # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER hadoop.env-sh Sample

Slide 20

Slide 20 text

Reporting This file controls the reporting The default is not to report hadoop- metrics.pro perties

Slide 21

Slide 21 text

Critical Properties fs.default.name hadoop.tmp.dir mapred.job.tracker fs.default.name: It points to the default URI for all file system requests in Hadoop. Hadoop.tmp.dir hadoop.tmp.dir is used as the base for temporary directories locally, and also in HDFS Mapred.job.tracker The host and port of the MapReduce job tracker where it runs. If "local", then jobs are run in-process as a single map and reduce task.

Slide 22

Slide 22 text

Network Requirements Uses Shell (SSH) to launch the server processes on the slave nodes Requires password-less SSH connection between the master and all the slaves and secondary machines Hadoop Core

Slide 23

Slide 23 text

NameNode Recovery Shut down the secondary NameNode. secondary:fs.checkpoint.dir -> Namenode:dfs.name.dir secondary:fs.checkpoint.edits -> Namenode:dfs.name.edits.dir When the copy completes, start the NameNode and restart the secondary NameNode. 1 2 3 4

Slide 24

Slide 24 text

Sample Examples List

Slide 25

Slide 25 text

Running the Teragen Example

Slide 26

Slide 26 text

Checking the Output

Slide 27

Slide 27 text

Checking the Output

Slide 28

Slide 28 text

Data Loading Techniques and Data Analysis Using Pig Using Hive Using Hadoop Copy Commands Using Sqoop Using Flume Data Loading Data Analysis HDFS

Slide 29

Slide 29 text

No content