Hadoop Operations - Speaker Deck

Slide 1

Slide 1 text

Marc Cluet – Lynx Consultants How Hadoop Works

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Hadoop Components Breakdown ¡  HDFS §  Master Namenode ▪  Keeps track of all ﬁle allocation on Datanodes ▪  Rebalances data if one of the namenodes goes down ▪  Is Rack aware §  Secondary Namenode ▪  Does cleanup services for the namenode ▪  Not necessarily two diﬀerent servers §  Datanode ▪  Stores the data ▪  Good to have not RAID disks for extra I/O speed Lynx Consultants © 2013

Slide 11

Slide 11 text

Hadoop Components Breakdown ¡  HDFS §  How to access ▪  Client can connect with hadoop client to hdfs://namenode:8020 ▪  Supports all basic Unix commands §  Configuration files ▪  /etc/hadoop/conf/core-‐site.xml ▪  Defines major configuration as hdfs namenode and default parameters ▪  /etc/hadoop/conf/hdfs-‐site.xml ▪  Defines configuration specific to namenode or datanode on file locations ▪  /etc/hadoop/conf/slaves ▪  Defines the list of servers that are available in this cluster Lynx Consultants © 2013

Slide 12

Slide 12 text

Hadoop Components Breakdown ¡  Hbase §  Master ▪  Controls the Hbase cluster, knows where the data is allocated and provides a client listening socket using Thrift and/or a RESTful API §  Regionserver ▪  Hbase node, stores some of the information in one of the regions, it’d be equivalent to sharding §  Thrift / REST ▪  Interface to connect to HBase Lynx Consultants © 2013

Slide 13

Slide 13 text

Hadoop Components Breakdown ¡  Hbase §  How to access ▪  Through the Hbase client (using Thrift) ▪  Through the RESTful API §  Configuration files ▪  /etc/hbase/conf/hbase-‐site.xml ▪  Defines all the basic configuration for accessing hbase ▪  /etc/hbase/conf/hbase-‐policy.xml ▪  Defines all the security (ACL) and all the hbase memory tweaks ▪  /etc/hbase/conf/regionservers ▪  List all the regionservers available to this cluster Lynx Consultants © 2013

Slide 14

Slide 14 text

Hadoop Components Breakdown ¡  MapRed §  JobTracker ▪  Creates the Map/Reduce jobs ▪  Stores all the intermediate data ▪  Keeps track of all the previous results through the HistoryServer §  TaskTracker ▪  Executed Tasks related to the Map/Reduce job ▪  Very CPU and memory intensive ▪  Stores intermediate results which then are pushed to JobTracker Lynx Consultants © 2013

Slide 15

Slide 15 text

Hadoop Components Breakdown ¡  MapRed §  How to access ▪  Through the Hadoop Client ▪  Through any MapRed client like Pig or Hive ▪  Own Java code §  Configuration files ▪  /etc/hadoop/conf/mapred-‐site.xml ▪  Defines how to contact this MapRed Cluster ▪  /etc/hadoop/conf/mapred-‐queue-‐acls.xml ▪  Defines ACL structure for accessing MapRed, normally not necessary ▪  /etc/hadoop/conf/slaves ▪  Defines the list of TaskTrackers in this cluster Lynx Consultants © 2013

Slide 16

Slide 16 text

Hadoop Components Breakdown ¡  YARN §  Same structure as MapRed (lives on top of it) §  Configuration files ▪  /etc/hadoop/conf/yarn-‐site.xml ▪  All required configuration for YARN Lynx Consultants © 2013

Slide 17

Slide 17 text

Hadoop Cluster Breakdown ¡  Namenode Server §  HDFS Namenode §  Hbase Master ¡  Secondary Namenode Server §  HDFS Secondary Namenode ¡  JobTracker Server §  MapRed JobTracker §  MapRed History Server Lynx Consultants © 2013

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Hadoop Hardware Requirements ¡  Jobtracker Server §  Redundant power supplies §  RAID1 Drives §  Enough memory (16Gb) ¡  Datanode Server §  Lots of cheap disk (no RAID) §  Lots of memory (32Gb) §  Lots of CPU Lynx Consultants © 2013

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Flume ¡  Transports streams of data from point A to point B ¡  Source §  Where the data is read from ¡  Channel §  How the data is buﬀered ¡  Sink §  Where the data is written Lynx Consultants © 2013

Slide 26

Slide 26 text

Flume ¡  Flume is fault tolerant ¡  Sources are pointer kept §  With some exceptions, but most sources are in a known state ¡  Channels can be fault tolerant §  Channel written to disk can recover from where it left ¡  Sinks can be redundant §  More than one sink for the same data §  Data is serialised and deduplicated using AVRO Lynx Consultants © 2013

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Hadoop References ¡  Hadoop §  http://hadoop.apache.org/docs/stable/cluster_setup.html §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-‐yarn/hadoop-‐yarn-‐site/ ClusterSetup.html §  http://pig.apache.org/docs/r0.7.0/setup.html §  http://wiki.apache.org/hadoop/NameNodeFailover ¡  Hbase §  http://hbase.apache.org/book/book.html ¡  Flume §  http://archive.cloudera.com/cdh4/cdh/4/ﬂume-‐ng/ FlumeUserGuide.html Lynx Consultants © 2013

Slide 32

Slide 32 text