Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop Operations

Hadoop Operations

Hadoop is a flexible framework for large scale computation and data processing designed to scale out on commodity hardware. This talk covers the operational challenges around provisioning, high-availability, monitoring and security in Hadoop 1.0 and 2.0.

Ovidiu Dimulescu

January 16, 2013
Tweet

More Decks by Ovidiu Dimulescu

Other Decks in Technology

Transcript

  1. Hadoop Operations
    JaxLUG, 2013
    Ovidiu Dimulescu
    @odimulescu
    speakerdeck.com/odimulescu

    View full-size slide

  2. About @odimulescu
    • Working on the Web since 1997

    • Organizer for JaxMUG.com
    • Co-Organizer for Jax Big Data meetup

    View full-size slide

  3. • Background
    • Architecture 1.0 vs 2.0
    • Installation
    • Security
    • Monitoring
    • Demo
    • Questions?
    Agenda

    View full-size slide

  4. • Apache Hadoop is an open source Java software
    framework for running data-intensive applications on
    large clusters of commodity hardware
    • Created by Doug Cutting (Lucene & Nutch creator)
    • Named after Doug’s son’s toy elephant
    What is ?

    View full-size slide

  5. Processing diverse large datasets in practical time at low cost
    • Consolidates data in a distributed file system
    • Moves computation to data rather then data to computation
    • Simplifies programming model
    CPU
    CPU
    CPU
    CPU
    CPU
    CPU
    CPU
    CPU
    What and how is solving?

    View full-size slide

  6. Why does it matter?
    • Volume, Velocity, Variety and Value
    • Datasets do not fit on local HDDs let alone RAM
    • Scaling up
    ‣ Is expensive (licensing, hardware, etc.)
    ‣ Has a ceiling (physical, technical, etc.)

    View full-size slide

  7. Why does it matter?
    • Scanning 10TB at sustained transfer of 75MB/s takes
    ~2 days on 1 node
    ~5 hrs on 10 nodes cluster
    • Low $/TB for commodity drives
    • Low-end servers are multicore capable

    View full-size slide

  8. HDFS
    Distributed filesystem designed for low cost storage
    and high bandwidth access across the cluster.
    Map-Reduce
    Programming model for processing and generating
    large data sets.
    Architecture – Core Components

    View full-size slide

  9. HDFS - Design
    • Files are stored as blocks (64MB default size)
    • Configurable data replication (3x, Rack Aware*)
    • Fault Tolerant, Expects HW failures
    • HUGE files, Expects Streaming not Low Latency
    • Mostly WORM
    • Not POSIX compliant
    • Not mountable OOTB*

    View full-size slide

  10. Namenode (NN)
    Datanode 1 Datanode 2 Datanode N
    Namenode - Master
    • Filesystem metadata
    • Files R/W control
    • Blocks replication
    Datanode - Slaves
    • Blocks R/W per clients
    • Replicates blocks per master
    • Notifies master about block-ids
    H
    D
    F
    S
    Client ask NN for file
    NN returns DNs that has it
    Client ask DN for data
    Architecture - HDFS

    View full-size slide

  11. HDFS - Fault tolerance
    • DataNode
     Uses CRC32 to avoid corruption
     Data is replicated on other nodes (3x)*
    • NameNode
     fsimage - last snapshot
     edits - changes log since last snapshot
     Checkpoint Node
     Backup NameNode
     Failover is manual*

    View full-size slide

  12. JobsTracker (JT)
    TaskTracker 1
    JobTracker - Master
    • Accepts MR jobs submitted by clients
    • Assigns MR tasks to TaskTrackers
    • Monitors tasks and TaskTracker status,
    re-executes tasks upon failure
    • Speculative execution
    TaskTracker - Slaves
    • Runs MR tasks received from JobTracker
    • Manages storage and transmission of
    intermediate output
    J
    O
    B
    S
    API
    Client starts a job
    TaskTracker 2 TaskTracker N
    Architecture - MapReduce

    View full-size slide

  13. JobsTracker
    TaskTracker 1
    DataNode 1
    J
    O
    B
    S
    API
    NameNode
    H
    D
    F
    S
    * Mini OS: Filesystem & Scheduler
    Hadoop - Core Architecture
    TaskTracker 2
    DataNode 2
    TaskTracker N
    DataNode N

    View full-size slide

  14. Hadoop 2.0 - HDFS Architecture
    • Distributed Namespace
    • Multiple Block Pools

    View full-size slide

  15. Hadoop 2.0 - YARN Architecture

    View full-size slide

  16. Java - Native
    C++ - Pipes framework
    Any – Streaming
    Pig Latin, Hive HQL, C via JNI
    hadoop pipes -input path_in -output path_out -program exec_program
    hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
    input path_in -output path_out
    hadoop jar jar_path main_class input_path output_path
    MapReduce - Clients

    View full-size slide

  17. HDFS HBase
    Storage
    MapReduce
    Data Processing
    ZooKeeper Chukwa
    Management
    Pig Sqoop
    Data Access
    Hive
    MPI
    Giraph Hama
    Impala
    Ambari
    Hadoop - Ecosystem
    HUE
    Flume
    Mahout

    View full-size slide

  18. Installation - Platforms
    Production
    Linux – Official
    Development
    Linux
    OSX
    Windows via Cygwin
    *Nix

    View full-size slide

  19. Installation - Versions
    Public Numbering
    1.0.x - current stable version
    1.1.x - current beta version for 1.x branch
    2.X - current alpha version
    Development Numbering
    0.20.x aka 1.x - CDH 3 & HDP 1
    0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)

    View full-size slide

  20. Installation - For toying
    Option I - Official project releases
    hadoop.apache.org/common/releases.html
    Option 2 - Demo VM from vendors
    • Cloudera
    • Hortonworks
    • Greenplum
    • MapR
    Option 3 - Cloud
    • Amazon’s EMR
    • Hadoop on Azure

    View full-size slide

  21. Installation - For real
    Vendor distributions
    • Cloudera CDH
    • Hortonworks HDP
    • Greenplum GPHD
    • MapR M3, M5 or M7
    Hosted solutions
    • AWS EMR
    • Hadoop on Azure
    Use Virtualization - VMware Serengeti *

    View full-size slide

  22. Security - Simple Mode
    • Use in a trusted environment
    ‣ Identity comes from euid of the client process
    ‣ MapReduce tasks run as the TaskTracker user
    ‣ User that starts the NameNode is super-user
    • Reasonable protection for accidental misuse
    • Simple to setup

    View full-size slide

  23. Security - Secure Mode
    • Kerberos based
    • Use for tight granular access
    ‣ Identity comes from Kerberos Principal
    ‣ MapReduce tasks run as Kerberos Principal
    • Use a dedicated MIT KDC
    • Hook it to your primary KDC (AD, etc.)
    • Significant setup effort (users, groups and Kerberos keys
    on all nodes, etc.)

    View full-size slide

  24. Monitoring
    Built-in
    • JMX
    • REST
    • No SNMP support
    Other
    Cloudera Manager (Free up to 50 nodes)
    Ambari - Free, RPM based systems (RH, CentOS)

    View full-size slide

  25. References
    Hadoop Operations, by Eric Sammer
    Hadoop Security, by Hortonworks Blog
    HDFS Federation, by Suresh Srinivas
    Hadoop 2.0 New Features, by VertiCloud Inc
    MapReduce in Simple Terms, by Saliya Ekanayake
    Hadoop Architecture, by Phillipe Julio

    View full-size slide

  26. Questions ?
    Ovidiu Dimulescu
    @odimulescu
    speakerdeck.com/odimulescu

    View full-size slide