$30 off During Our Annual Pro Sale. View Details »

Hadoop on Azure - One elephant went out to play

Hadoop on Azure - One elephant went out to play

Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.

We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.

Ovidiu Dimulescu

March 16, 2013
Tweet

More Decks by Ovidiu Dimulescu

Other Decks in Technology

Transcript

  1. One elephant went out to play, Azure way
    Orlando Code Camp, 2013
    Ovidiu Dimulescu
    @odimulescu
    speakerdeck.com/odimulescu

    View Slide

  2. • Overview
    • Installation
    • Azure story
    • .Net Integration
    • MapReduce
    • Q & A
    Agenda

    View Slide

  3. About @odimulescu
    • Working on the Web since 1997

    • Organizer for JaxMUG.com
    • Co-Organizer for Jax Big Data meetup

    View Slide

  4. Apache Hadoop is an open source framework
    for running data-intensive applications on large
    clusters of commodity hardware
    What is ?

    View Slide

  5. Processing diverse large datasets in practical time at low cost
    • Consolidates data in a distributed file system
    • Moves computation to data rather then data to computation
    • Simplifies programming model
    CPU
    CPU
    CPU
    CPU
    CPU
    CPU
    CPU
    CPU
    What and how is solving?

    View Slide

  6. Why does it matter?
    • Volume - Datasets outgrow local HDDs let alone RAM
    • Velocity - Data grows at tremendous pace
    • Variety - Data is heterogeneous
    • Value
    - Scaling up is expensive (licensing, cpus, disks, fabric, etc.)
    - Scaling up has a ceiling (physical, technical, etc.)

    View Slide

  7. 20%
    80%
    Data types
    Complex
    Structured
    Complex Data
    Images, Video
    Logs
    Documents
    Call records
    Sensor data
    Mail archives
    Structured Data
    User Profiles
    CRM
    HR Records
    * Chart Source: IDC White Paper
    Why does it matter?

    View Slide

  8. • ETL
    • Pattern Recognition
    • Recommendation Engines
    • Prediction Models
    • Log Processing
    • Data “sandbox”
    Use cases

    View Slide

  9. Who uses it?

    View Slide

  10. Who supports it?

    View Slide

  11. • Not a database replacement
    • Not a data warehousing, complements it
    • Not for interactive reporting
    • Not a general purpose storage mechanism
    • Not for problems that are not parallelizable in a
    share-nothing fashion *
    When not to use?

    View Slide

  12. HDFS
    Distributed filesystem designed for low cost storage
    and high bandwidth access across the cluster.
    MapReduce
    Simpler programming model for processing and
    generating large data sets.
    Architecture – Core Components

    View Slide

  13. Namenode (NN)
    Datanode 1 Datanode 2 Datanode N
    Namenode - Master
    • Filesystem metadata
    • Files R/W control
    • Blocks replication
    Datanode - Slaves
    • Blocks R/W per clients
    • Replicates blocks per master
    • Notifies master about block-ids
    H
    D
    F
    S
    Client ask NN for file
    NN returns DNs that has it
    Client ask DN for data
    Architecture - HDFS

    View Slide

  14. JobsTracker (JT)
    TaskTracker 1
    JobTracker - Master
    • Accepts MR jobs submitted by clients
    • Assigns MR tasks to TaskTrackers
    • Monitors tasks and TaskTracker status,
    re-executes tasks upon failure
    • Speculative execution
    TaskTracker - Slaves
    • Runs MR tasks received from JobTracker
    • Manages storage and transmission of
    intermediate output
    J
    O
    B
    S
    API
    Client starts a job
    TaskTracker 2 TaskTracker N
    Architecture - MapReduce

    View Slide

  15. JobsTracker
    TaskTracker 1
    DataNode 1
    J
    O
    B
    S
    API
    NameNode
    H
    D
    F
    S
    * Mini OS: Filesystem & Scheduler
    Architecture - Core Hadoop
    TaskTracker 2
    DataNode 2
    TaskTracker N
    DataNode N

    View Slide

  16. HDFS HBase
    Storage
    MapReduce
    Data Processing
    ZooKeeper Chukwa
    Management
    Pig
    Data Access
    Hive
    Mahout
    Giraph Hama
    Ambari
    Hadoop - Ecosystem
    HUE
    Sqoop Stinger
    Impala

    View Slide

  17. HDFS HBase
    Storage
    MapReduce
    Data Processing
    ZooKeeper Chukwa
    Management
    Pig Impala
    Data Access
    Hive
    Mahout
    Giraph Hama
    Sqoop
    Ambari
    Hadoop - Ecosystem
    HUE
    Stinger

    View Slide

  18. Installation - Platform Notes
    Production


    Linux – Official
    Development


    Linux


    OSX


    Windows via Cygwin *


    Other Unixes

    View Slide

  19. Installation
    1. Download & configure single-node cluster
    hadoop.apache.org/common/releases.html
    2. Download a demo VM
    Cloudera, Hortonworks, MapR, etc.
    3. Download MS HDInsight Server
    4. Cloud: Amazon EMR, Azure HDInsight Service

    View Slide

  20. Hadoop - Azure Story
    Name:
    Windows Azure HDInsight Service
    Where:
    Hadoop on Azure dot com
    Status:
    Public Preview
    *On-premise: Microsoft HDInsight Server

    View Slide

  21. Hadoop - Azure Story

    View Slide

  22. Hadoop - Azure Story

    View Slide

  23. Hadoop - Azure Story

    View Slide

  24. Hadoop - Azure Story

    View Slide

  25. Hadoop - Azure Story

    View Slide

  26. Microsoft Distribution of Hadoop
    C library for HDFS file access
    Hadoop .Net HDFS File Access
    Managed C++ Solution
    HDFS - .Net access

    View Slide

  27. HDFS - .Net access

    View Slide

  28. hadoopsdk.codeplex.com
    • MapReduce
    • LINQ to Hive
    • WebHDFS Client
    Hadoop .Net SDK

    View Slide

  29. ODBC Driver
    Excel PowerPivot
    Other BI tools
    Connector for Hadoop
    Import / Export via SQOOP
    Hadoop Integration

    View Slide

  30. slideshare.net/esaliya/mapreduce-in-simple-terms
    by Saliya Ekanayake
    30

    View Slide

  31. Java - Native
    C++ - Pipes framework
    Any – Streaming
    Pig Latin, Hive HQL, C via JNI
    hadoop pipes -input path_in -output path_out -program exec_program
    hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
    input path_in -output path_out
    hadoop jar jar_path main_class input_path output_path
    MapReduce - Clients

    View Slide

  32. C# - Streaming - Mapper

    View Slide

  33. C# - Streaming - Reducer

    View Slide

  34. C# - .Net SDK Mapper & Reducer

    View Slide

  35. C# - .Net SDK Driver Class

    View Slide

  36. C# - .Net SDK Driver Class
    MRRunner -dll WordFrequency.dll -- input output
    MRRunner -dll WordFrequency.dll -class WordFrequency -- input output

    View Slide

  37. C# - .Net SDK Debugging

    View Slide

  38. References
    Hadoop at Yahoo!, by Y! Developer Network
    MapReduce in Simple Terms, by Saliya Ekanayake
    Hadoop on Azure, Getting Started
    Hadoop .Net SDK
    .Net HDFS File Access
    SQL Server Connector for Hadoop

    View Slide

  39. Questions ?
    Ovidiu Dimulescu
    @odimulescu
    speakerdeck.com/odimulescu

    View Slide