Slide 1

Slide 1 text

Hadoop Operations JaxLUG, 2013 Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu

Slide 2

Slide 2 text

About @odimulescu • Working on the Web since 1997 • • Organizer for JaxMUG.com • Co-Organizer for Jax Big Data meetup

Slide 3

Slide 3 text

• Background • Architecture 1.0 vs 2.0 • Installation • Security • Monitoring • Demo • Questions? Agenda

Slide 4

Slide 4 text

• Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware • Created by Doug Cutting (Lucene & Nutch creator) • Named after Doug’s son’s toy elephant What is ?

Slide 5

Slide 5 text

Processing diverse large datasets in practical time at low cost • Consolidates data in a distributed file system • Moves computation to data rather then data to computation • Simplifies programming model CPU CPU CPU CPU CPU CPU CPU CPU What and how is solving?

Slide 6

Slide 6 text

Why does it matter? • Volume, Velocity, Variety and Value • Datasets do not fit on local HDDs let alone RAM • Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)

Slide 7

Slide 7 text

Why does it matter? • Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster • Low $/TB for commodity drives • Low-end servers are multicore capable

Slide 8

Slide 8 text

HDFS Distributed filesystem designed for low cost storage and high bandwidth access across the cluster. Map-Reduce Programming model for processing and generating large data sets. Architecture – Core Components

Slide 9

Slide 9 text

HDFS - Design • Files are stored as blocks (64MB default size) • Configurable data replication (3x, Rack Aware*) • Fault Tolerant, Expects HW failures • HUGE files, Expects Streaming not Low Latency • Mostly WORM • Not POSIX compliant • Not mountable OOTB*

Slide 10

Slide 10 text

Namenode (NN) Datanode 1 Datanode 2 Datanode N Namenode - Master • Filesystem metadata • Files R/W control • Blocks replication Datanode - Slaves • Blocks R/W per clients • Replicates blocks per master • Notifies master about block-ids H D F S Client ask NN for file NN returns DNs that has it Client ask DN for data Architecture - HDFS

Slide 11

Slide 11 text

HDFS - Fault tolerance • DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)* • NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*

Slide 12

Slide 12 text

JobsTracker (JT) TaskTracker 1 JobTracker - Master • Accepts MR jobs submitted by clients • Assigns MR tasks to TaskTrackers • Monitors tasks and TaskTracker status, re-executes tasks upon failure • Speculative execution TaskTracker - Slaves • Runs MR tasks received from JobTracker • Manages storage and transmission of intermediate output J O B S API Client starts a job TaskTracker 2 TaskTracker N Architecture - MapReduce

Slide 13

Slide 13 text

JobsTracker TaskTracker 1 DataNode 1 J O B S API NameNode H D F S * Mini OS: Filesystem & Scheduler Hadoop - Core Architecture TaskTracker 2 DataNode 2 TaskTracker N DataNode N

Slide 14

Slide 14 text

Hadoop 2.0 - HDFS Architecture • Distributed Namespace • Multiple Block Pools

Slide 15

Slide 15 text

Hadoop 2.0 - YARN Architecture

Slide 16

Slide 16 text

Java - Native C++ - Pipes framework Any – Streaming Pig Latin, Hive HQL, C via JNI hadoop pipes -input path_in -output path_out -program exec_program hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out hadoop jar jar_path main_class input_path output_path MapReduce - Clients

Slide 17

Slide 17 text

HDFS HBase Storage MapReduce Data Processing ZooKeeper Chukwa Management Pig Sqoop Data Access Hive MPI Giraph Hama Impala Ambari Hadoop - Ecosystem HUE Flume Mahout

Slide 18

Slide 18 text

Installation - Platforms Production Linux – Official Development Linux OSX Windows via Cygwin *Nix

Slide 19

Slide 19 text

Installation - Versions Public Numbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha version Development Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)

Slide 20

Slide 20 text

Installation - For toying Option I - Official project releases hadoop.apache.org/common/releases.html Option 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapR Option 3 - Cloud • Amazon’s EMR • Hadoop on Azure

Slide 21

Slide 21 text

Installation - For real Vendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7 Hosted solutions • AWS EMR • Hadoop on Azure Use Virtualization - VMware Serengeti *

Slide 22

Slide 22 text

Security - Simple Mode • Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user • Reasonable protection for accidental misuse • Simple to setup

Slide 23

Slide 23 text

Security - Secure Mode • Kerberos based • Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal • Use a dedicated MIT KDC • Hook it to your primary KDC (AD, etc.) • Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)

Slide 24

Slide 24 text

Monitoring Built-in • JMX • REST • No SNMP support Other Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)

Slide 25

Slide 25 text

Demo

Slide 26

Slide 26 text

References Hadoop Operations, by Eric Sammer Hadoop Security, by Hortonworks Blog HDFS Federation, by Suresh Srinivas Hadoop 2.0 New Features, by VertiCloud Inc MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio

Slide 27

Slide 27 text

Questions ? Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu