Hadoop Operations

Marc Cluet – Lynx Consultants How Hadoop Works

What we’ll cover? ¡  Understand Hadoop in detail ¡ 
See how Hadoop works operationally ¡  Be able to start asking the right questions from your data Lynx Consultants © 2013

Hadoop Distributions ¡  Cloudera CDH ¡  Hortonworks ¡ 
MapR Lynx Consultants © 2013

Hadoop Components ¡  HDFS ¡  Hbase ¡  MapRed
¡  YARN Lynx Consultants © 2013

Hadoop Components ¡  HDFS §  Hadoop Distributed File System
§  Everything sits on top of it §  Has 3 copies by default of every block ¡  Hbase ¡  MapRed ¡  YARN Lynx Consultants © 2013

Hadoop Components ¡  HDFS ¡  Hbase §  Hadoop
Schemaless Database §  Key value Store §  Sits on top of HDFS ¡  MapRed ¡  YARN Lynx Consultants © 2013

Hadoop Component Breakdown ¡  All these components divide themselves in
§  client/server §  master/slave scenarios ¡  We will now check each individual component breakdown Lynx Consultants © 2013

Hadoop Components Breakdown ¡  HDFS §  Master Namenode
▪  Keeps track of all ﬁle allocation on Datanodes ▪  Rebalances data if one of the namenodes goes down ▪  Is Rack aware §  Secondary Namenode ▪  Does cleanup services for the namenode ▪  Not necessarily two diﬀerent servers §  Datanode ▪  Stores the data ▪  Good to have not RAID disks for extra I/O speed Lynx Consultants © 2013

Hadoop Components Breakdown ¡  HDFS §  How to access
▪  Client can connect with hadoop client to hdfs://namenode:8020 ▪  Supports all basic Unix commands §  Configuration files ▪  /etc/hadoop/conf/core-‐site.xml ▪  Defines major configuration as hdfs namenode and default parameters ▪  /etc/hadoop/conf/hdfs-‐site.xml ▪  Defines configuration specific to namenode or datanode on file locations ▪  /etc/hadoop/conf/slaves ▪  Defines the list of servers that are available in this cluster Lynx Consultants © 2013

Hadoop Components Breakdown ¡  Hbase §  Master ▪ 
Controls the Hbase cluster, knows where the data is allocated and provides a client listening socket using Thrift and/or a RESTful API §  Regionserver ▪  Hbase node, stores some of the information in one of the regions, it’d be equivalent to sharding §  Thrift / REST ▪  Interface to connect to HBase Lynx Consultants © 2013

Hadoop Components Breakdown ¡  Hbase §  How to access
▪  Through the Hbase client (using Thrift) ▪  Through the RESTful API §  Configuration files ▪  /etc/hbase/conf/hbase-‐site.xml ▪  Defines all the basic configuration for accessing hbase ▪  /etc/hbase/conf/hbase-‐policy.xml ▪  Defines all the security (ACL) and all the hbase memory tweaks ▪  /etc/hbase/conf/regionservers ▪  List all the regionservers available to this cluster Lynx Consultants © 2013

Hadoop Components Breakdown ¡  MapRed §  JobTracker ▪ 
Creates the Map/Reduce jobs ▪  Stores all the intermediate data ▪  Keeps track of all the previous results through the HistoryServer §  TaskTracker ▪  Executed Tasks related to the Map/Reduce job ▪  Very CPU and memory intensive ▪  Stores intermediate results which then are pushed to JobTracker Lynx Consultants © 2013

Hadoop Components Breakdown ¡  MapRed §  How to access
▪  Through the Hadoop Client ▪  Through any MapRed client like Pig or Hive ▪  Own Java code §  Configuration files ▪  /etc/hadoop/conf/mapred-‐site.xml ▪  Defines how to contact this MapRed Cluster ▪  /etc/hadoop/conf/mapred-‐queue-‐acls.xml ▪  Defines ACL structure for accessing MapRed, normally not necessary ▪  /etc/hadoop/conf/slaves ▪  Defines the list of TaskTrackers in this cluster Lynx Consultants © 2013

Hadoop Components Breakdown ¡  YARN §  Same structure as
MapRed (lives on top of it) §  Configuration files ▪  /etc/hadoop/conf/yarn-‐site.xml ▪  All required configuration for YARN Lynx Consultants © 2013

Hadoop Cluster Breakdown ¡  Namenode Server §  HDFS Namenode
§  Hbase Master ¡  Secondary Namenode Server §  HDFS Secondary Namenode ¡  JobTracker Server §  MapRed JobTracker §  MapRed History Server Lynx Consultants © 2013

Hadoop Cluster Breakdown ¡  Datanode Server §  HDFS Datanode
§  Hbase RegionServer §  MapRed TaskTracker Lynx Consultants © 2013

Hadoop Hardware Requirements ¡  Namenode Server §  Redundant power
supplies §  RAID1 Drives §  Enough memory (16Gb) ¡  Secondary Namenode Server §  Almost none Lynx Consultants © 2013

Hadoop Hardware Requirements ¡  Jobtracker Server §  Redundant power
supplies §  RAID1 Drives §  Enough memory (16Gb) ¡  Datanode Server §  Lots of cheap disk (no RAID) §  Lots of memory (32Gb) §  Lots of CPU Lynx Consultants © 2013

Hadoop Default Ports ¡  HDFS §  8020: HDFS Namenode
§  50010: HDFS Datanode FS transfer ¡  MapRed §  No defaults ¡  Hbase §  60010: Master §  60020: Regionserver Lynx Consultants © 2013

Flume ¡  Transports streams of data from point A to
point B ¡  Source §  Where the data is read from ¡  Channel §  How the data is buﬀered ¡  Sink §  Where the data is written Lynx Consultants © 2013

Flume ¡  Flume is fault tolerant ¡  Sources are
pointer kept §  With some exceptions, but most sources are in a known state ¡  Channels can be fault tolerant §  Channel written to disk can recover from where it left ¡  Sinks can be redundant §  More than one sink for the same data §  Data is serialised and deduplicated using AVRO Lynx Consultants © 2013

Flume ¡  Configuration files §  /etc/flume-‐ng/conf/flume.conf ▪  Defines
the agent configuration with source, channel, sink Lynx Consultants © 2013

Hadoop References ¡  Hadoop §  http://hadoop.apache.org/docs/stable/cluster_setup.html §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-‐yarn/hadoop-‐yarn-‐site/
ClusterSetup.html §  http://pig.apache.org/docs/r0.7.0/setup.html §  http://wiki.apache.org/hadoop/NameNodeFailover ¡  Hbase §  http://hbase.apache.org/book/book.html ¡  Flume §  http://archive.cloudera.com/cdh4/cdh/4/ﬂume-‐ng/ FlumeUserGuide.html Lynx Consultants © 2013

Hadoop Operations

Hadoop Operations

Marc Cluet

More Decks by Marc Cluet

Other Decks in Technology

Featured

Transcript

Marc Cluet – Lynx Consultants How Hadoop Works

What we’ll cover? ¡  Understand Hadoop in detail ¡

Hadoop Distributions ¡  Cloudera CDH ¡  Hortonworks ¡

Hadoop Components ¡  HDFS ¡  Hbase ¡  MapRed

Hadoop Components ¡  HDFS §  Hadoop Distributed File System

Hadoop Components ¡  HDFS ¡  Hbase §  Hadoop

Hadoop Components ¡  HDFS ¡  Hbase ¡  MapRed

Hadoop Components ¡  HDFS ¡  Hbase ¡  MapRed

Hadoop Component Breakdown ¡  All these components divide themselves in

Hadoop Components Breakdown ¡  HDFS §  Master Namenode

Hadoop Components Breakdown ¡  HDFS §  How to access

Hadoop Components Breakdown ¡  Hbase §  Master ▪

Hadoop Components Breakdown ¡  Hbase §  How to access

Hadoop Components Breakdown ¡  MapRed §  JobTracker ▪

Hadoop Components Breakdown ¡  MapRed §  How to access

Hadoop Components Breakdown ¡  YARN §  Same structure as

Hadoop Cluster Breakdown ¡  Namenode Server §  HDFS Namenode

Hadoop Cluster Breakdown ¡  Datanode Server §  HDFS Datanode

Hadoop Hardware Requirements ¡  Namenode Server §  Redundant power

Hadoop Hardware Requirements ¡  Jobtracker Server §  Redundant power

Hadoop Default Ports ¡  HDFS §  8020: HDFS Namenode

Hadoop HDFS Workflow Lynx Consultants © 2013

Hadoop MapRed Workflow Lynx Consultants © 2013

Hadoop MapRed Workflow Lynx Consultants © 2013

Flume ¡  Transports streams of data from point A to

Flume ¡  Flume is fault tolerant ¡  Sources are

Flume Lynx Consultants © 2013

Flume ¡  Configuration files §  /etc/flume-‐ng/conf/flume.conf ▪  Defines

Flume Lynx Consultants © 2013

Hadoop Recommended Reads Lynx Consultants © 2013

Hadoop References ¡  Hadoop §  http://hadoop.apache.org/docs/stable/cluster_setup.html §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-‐yarn/hadoop-‐yarn-‐site/

Questions? Lynx Consultants © 2013