Hadoop on Azure - One elephant went out to play

One elephant went out to play, Azure way Orlando Code
Camp, 2013 Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu

• Overview • Installation • Azure story • .Net Integration
• MapReduce • Q & A Agenda

About @odimulescu • Working on the Web since 1997 •
• Organizer for JaxMUG.com • Co-Organizer for Jax Big Data meetup

Apache Hadoop is an open source framework for running data-intensive
applications on large clusters of commodity hardware What is ?

Processing diverse large datasets in practical time at low cost
• Consolidates data in a distributed ﬁle system • Moves computation to data rather then data to computation • Simpliﬁes programming model CPU CPU CPU CPU CPU CPU CPU CPU What and how is solving?

Why does it matter? • Volume - Datasets outgrow local
HDDs let alone RAM • Velocity - Data grows at tremendous pace • Variety - Data is heterogeneous • Value - Scaling up is expensive (licensing, cpus, disks, fabric, etc.) - Scaling up has a ceiling (physical, technical, etc.)

20% 80% Data types Complex Structured Complex Data Images, Video
Logs Documents Call records Sensor data Mail archives Structured Data User Proﬁles CRM HR Records * Chart Source: IDC White Paper Why does it matter?

• ETL • Pattern Recognition • Recommendation Engines • Prediction
Models • Log Processing • Data “sandbox” Use cases

Who uses it?

Who supports it?

• Not a database replacement • Not a data warehousing,
complements it • Not for interactive reporting • Not a general purpose storage mechanism • Not for problems that are not parallelizable in a share-nothing fashion * When not to use?

HDFS Distributed ﬁlesystem designed for low cost storage and high
bandwidth access across the cluster. MapReduce Simpler programming model for processing and generating large data sets. Architecture – Core Components

Namenode (NN) Datanode 1 Datanode 2 Datanode N Namenode -
Master • Filesystem metadata • Files R/W control • Blocks replication Datanode - Slaves • Blocks R/W per clients • Replicates blocks per master • Notiﬁes master about block-ids H D F S Client ask NN for ﬁle NN returns DNs that has it Client ask DN for data Architecture - HDFS

JobsTracker (JT) TaskTracker 1 JobTracker - Master • Accepts MR
jobs submitted by clients • Assigns MR tasks to TaskTrackers • Monitors tasks and TaskTracker status, re-executes tasks upon failure • Speculative execution TaskTracker - Slaves • Runs MR tasks received from JobTracker • Manages storage and transmission of intermediate output J O B S API Client starts a job TaskTracker 2 TaskTracker N Architecture - MapReduce

JobsTracker TaskTracker 1 DataNode 1 J O B S API
NameNode H D F S * Mini OS: Filesystem & Scheduler Architecture - Core Hadoop TaskTracker 2 DataNode 2 TaskTracker N DataNode N

HDFS HBase Storage MapReduce Data Processing ZooKeeper Chukwa Management Pig
Data Access Hive Mahout Giraph Hama Ambari Hadoop - Ecosystem HUE Sqoop Stinger Impala

HDFS HBase Storage MapReduce Data Processing ZooKeeper Chukwa Management Pig
Impala Data Access Hive Mahout Giraph Hama Sqoop Ambari Hadoop - Ecosystem HUE Stinger

Installation - Platform Notes Production Linux – Ofﬁcial Development Linux
OSX Windows via Cygwin * Other Unixes

Installation 1. Download & conﬁgure single-node cluster hadoop.apache.org/common/releases.html 2. Download
a demo VM Cloudera, Hortonworks, MapR, etc. 3. Download MS HDInsight Server 4. Cloud: Amazon EMR, Azure HDInsight Service

Hadoop - Azure Story Name: Windows Azure HDInsight Service Where:
Hadoop on Azure dot com Status: Public Preview *On-premise: Microsoft HDInsight Server

Hadoop - Azure Story

Microsoft Distribution of Hadoop C library for HDFS ﬁle access
Hadoop .Net HDFS File Access Managed C++ Solution HDFS - .Net access

HDFS - .Net access

hadoopsdk.codeplex.com • MapReduce • LINQ to Hive • WebHDFS Client
Hadoop .Net SDK

ODBC Driver Excel PowerPivot Other BI tools Connector for Hadoop
Import / Export via SQOOP Hadoop Integration

slideshare.net/esaliya/mapreduce-in-simple-terms by Saliya Ekanayake 30

Java - Native C++ - Pipes framework Any – Streaming
Pig Latin, Hive HQL, C via JNI hadoop pipes -input path_in -output path_out -program exec_program hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out hadoop jar jar_path main_class input_path output_path MapReduce - Clients

C# - Streaming - Mapper

C# - Streaming - Reducer

C# - .Net SDK Mapper & Reducer

C# - .Net SDK Driver Class

C# - .Net SDK Driver Class MRRunner -dll WordFrequency.dll --
input output MRRunner -dll WordFrequency.dll -class WordFrequency -- input output

C# - .Net SDK Debugging

References Hadoop at Yahoo!, by Y! Developer Network MapReduce in
Simple Terms, by Saliya Ekanayake Hadoop on Azure, Getting Started Hadoop .Net SDK .Net HDFS File Access SQL Server Connector for Hadoop

Questions ? Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu

Hadoop on Azure - One elephant went out to play

Hadoop on Azure - One elephant went out to play

More Decks by Ovidiu Dimulescu

Other Decks in Technology

Featured

Transcript