Learn Apache Hadoop

57b84427e92074c1f6f33a358994d91b?s=47 StratApps
February 14, 2014

Learn Apache Hadoop

Hadoop Modes
Terminal Commands
Web UI Url’s
Sample Cluster Configuration
Hadoop Configuration Files
DD for each component
Name Node Recovery
Sample example list in Hadoop
Running Teragen Example
Dump of MR job

57b84427e92074c1f6f33a358994d91b?s=128

StratApps

February 14, 2014
Tweet

Transcript

  1. 2.

    Topics to Discuss today Hadoop Modes Terminal Commands Web UI

    Url’s Sample Cluster Configuration Hadoop Configuration Files DD for each component Name Node Recovery Sample example list in Hadoop Running Teragen Example Dump of MR job Data Loading Techniques Using Hadoop Copy Commands FLUME SQOOP Data Analysis Techniques PIG HIVE Heads Up Session
  2. 3.

    Hadoop Modes Hadoop can run in any of the following

    three modes: Standalone Mode Fully Distributed Mode Pseudo Distributed Mode Standalone (or Local) Mode No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS. Pseudo-Distributed Mode Hadoop daemons run on the local machine. Fully Distributed Mode Hadoop daemons run on a cluster of machines.
  3. 6.

    Web UI URLs Name Node Status : Job Tracker Status

    : Task Tracker Status : Data Block Scanner Report : http://localhost:50070/dfshealth.jsp http://localhost:50030/jobtracker.jsp http://localhost:50060/tasktracker.jsp http://localhost:50075/blockScannerReport
  4. 7.

    Sample Cluster Configuration Master NameNode: http://master:50070/ JobTracker: http://master:50030/ Slave Node

    Task Tracker Data Node Slave Node Task Tracker Data Node Slave Node Task Tracker Data Node Slave Node Task Tracker Data Node
  5. 8.

    Configuration Filenames Format Description hadoop-env.sh Base Script Environment variables that

    are used in the scripts to run Hadoop. core-site.xml Hadoop Configuration XML Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce. hdfs-site.xml Hadoop Configuration XML Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes. mapred-site.xml Hadoop Configuration XML Configuration settings for MapReduce daemons : the job-tracker and the task-trackers. masters Plain Text A list of machines (one per line) that each run a secondary namenode. slave Plain Text A list of machines (one per line) that each run a datanode and a task-tracker. hadoop- metric.properties Java Properties Properties for controlling how metrics are published in Hadoop. log4j.properties Java Properties Properties for system log files, the namenode audit log and the task log for the task-tracker child process. Hadoop Configuration Files
  6. 11.

    MapReducing HDFS/GFS GFS-File Structure: Divide into 64 MB chucks Chucks

    replicated. (Default 3 replication). Chucks divided into 64KB blocks. Each block has a 34 bit checksum. HDFS-File Structure: Divided into 128 MB blocks Name node holds block replica as 2 files. One for Data One for checksum and generation Stamp. HDFS Vs. GFS Hadoop Google
  7. 12.

    core-site.xml and hdfs-site.xml hdfs-site.xml core-site.xml <?xml version -"1.0"?> <?xml version

    -"1.0"?> <!--hdfs-site.xml--> <!--core-site.xml--> <configuration> <configuration> <property> <property> <name>dfs.replication</name> <name>fs.default.name</name> <value>1</value> <value>http://localhost:8020</value> </property> </property> </configuration> </configuration>
  8. 13.

    Defining HDFS Details In hdfs-site.xml Property Value Description dfs.data.dir <value>

    /disk1/hdfs/data, /disk2/hdfs/data </value> A list of directories where the datanode stores blocks. Each block is stored in only one of these directories. ${hadoop.tmp.dir}/dfs/data fs.checkpoint.dir <value> /disk1/hdfs/namesecondary, /disk2/hdfs/namesecondary </value> A list of directories where the secondary namenode stores checkpoints. It stores a copy of the checkpoint in each directory in the list ${hadoop.tmp.dir}/dfs/name secondary
  9. 15.

    Property Value Description mapred.job.tracker <value> localhost:8021 </value> The hostname and

    the port that the jobtracker RPC server runs on. If set to the default value of local, then the jobtracker runs in-process on demand when you run a MapReduce job. mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores intermediate data for jobs. The data is cleared out when the job ends. mapred.system.dir ${hadoop.tmp.dir}/mapred/system The hostname and the port that the jobtracker RPC server runs on. If set to the default value of local, then the jobtracker runs in-process on demand when you run a MapReduce job. mapred.tasktracker. map/reducer .tasks.maximum 2 The number of map/reducer tasks that may be run on a tasktracker at any one time Defining mapred-site.xml
  10. 17.

    Slaves and Masters Two files are used by the startup

    and shutdown commands: Slaves Contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers. Masters Contains a list of hosts, one per line, that are to host Secondary NameNode servers.
  11. 18.

    Set parameter JAVA_HOME This file also offers a way to

    provide custom parameters for each of the servers. Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/directory of the installation. Examples of environment variables that you can specify: Export: HADOOP_DATANODE_HEAPSIZE ="128" Export : HADOOP_TASKTRACKER_HEAPSI ZE="512" Per-Process Runtime Environment hadoop-env.sh JVM
  12. 19.

    # Set Hadoop-specific environment variables here. # The only required

    environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-7-sun-1.7.0.45 # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" ….. ….. # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER hadoop.env-sh Sample
  13. 20.
  14. 21.

    Critical Properties fs.default.name hadoop.tmp.dir mapred.job.tracker fs.default.name: It points to the

    default URI for all file system requests in Hadoop. Hadoop.tmp.dir hadoop.tmp.dir is used as the base for temporary directories locally, and also in HDFS Mapred.job.tracker The host and port of the MapReduce job tracker where it runs. If "local", then jobs are run in-process as a single map and reduce task.
  15. 22.

    Network Requirements Uses Shell (SSH) to launch the server processes

    on the slave nodes Requires password-less SSH connection between the master and all the slaves and secondary machines Hadoop Core
  16. 23.

    NameNode Recovery Shut down the secondary NameNode. secondary:fs.checkpoint.dir -> Namenode:dfs.name.dir

    secondary:fs.checkpoint.edits -> Namenode:dfs.name.edits.dir When the copy completes, start the NameNode and restart the secondary NameNode. 1 2 3 4
  17. 28.

    Data Loading Techniques and Data Analysis Using Pig Using Hive

    Using Hadoop Copy Commands Using Sqoop Using Flume Data Loading Data Analysis HDFS
  18. 29.