Slide 1

Slide 1 text

One elephant went out to play, Azure way Orlando Code Camp, 2013 Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu

Slide 2

Slide 2 text

• Overview • Installation • Azure story • .Net Integration • MapReduce • Q & A Agenda

Slide 3

Slide 3 text

About @odimulescu • Working on the Web since 1997 • • Organizer for JaxMUG.com • Co-Organizer for Jax Big Data meetup

Slide 4

Slide 4 text

Apache Hadoop is an open source framework for running data-intensive applications on large clusters of commodity hardware What is ?

Slide 5

Slide 5 text

Processing diverse large datasets in practical time at low cost • Consolidates data in a distributed file system • Moves computation to data rather then data to computation • Simplifies programming model CPU CPU CPU CPU CPU CPU CPU CPU What and how is solving?

Slide 6

Slide 6 text

Why does it matter? • Volume - Datasets outgrow local HDDs let alone RAM • Velocity - Data grows at tremendous pace • Variety - Data is heterogeneous • Value - Scaling up is expensive (licensing, cpus, disks, fabric, etc.) - Scaling up has a ceiling (physical, technical, etc.)

Slide 7

Slide 7 text

20% 80% Data types Complex Structured Complex Data Images, Video Logs Documents Call records Sensor data Mail archives Structured Data User Profiles CRM HR Records * Chart Source: IDC White Paper Why does it matter?

Slide 8

Slide 8 text

• ETL • Pattern Recognition • Recommendation Engines • Prediction Models • Log Processing • Data “sandbox” Use cases

Slide 9

Slide 9 text

Who uses it?

Slide 10

Slide 10 text

Who supports it?

Slide 11

Slide 11 text

• Not a database replacement • Not a data warehousing, complements it • Not for interactive reporting • Not a general purpose storage mechanism • Not for problems that are not parallelizable in a share-nothing fashion * When not to use?

Slide 12

Slide 12 text

HDFS Distributed filesystem designed for low cost storage and high bandwidth access across the cluster. MapReduce Simpler programming model for processing and generating large data sets. Architecture – Core Components

Slide 13

Slide 13 text

Namenode (NN) Datanode 1 Datanode 2 Datanode N Namenode - Master • Filesystem metadata • Files R/W control • Blocks replication Datanode - Slaves • Blocks R/W per clients • Replicates blocks per master • Notifies master about block-ids H D F S Client ask NN for file NN returns DNs that has it Client ask DN for data Architecture - HDFS

Slide 14

Slide 14 text

JobsTracker (JT) TaskTracker 1 JobTracker - Master • Accepts MR jobs submitted by clients • Assigns MR tasks to TaskTrackers • Monitors tasks and TaskTracker status, re-executes tasks upon failure • Speculative execution TaskTracker - Slaves • Runs MR tasks received from JobTracker • Manages storage and transmission of intermediate output J O B S API Client starts a job TaskTracker 2 TaskTracker N Architecture - MapReduce

Slide 15

Slide 15 text

JobsTracker TaskTracker 1 DataNode 1 J O B S API NameNode H D F S * Mini OS: Filesystem & Scheduler Architecture - Core Hadoop TaskTracker 2 DataNode 2 TaskTracker N DataNode N

Slide 16

Slide 16 text

HDFS HBase Storage MapReduce Data Processing ZooKeeper Chukwa Management Pig Data Access Hive Mahout Giraph Hama Ambari Hadoop - Ecosystem HUE Sqoop Stinger Impala

Slide 17

Slide 17 text

HDFS HBase Storage MapReduce Data Processing ZooKeeper Chukwa Management Pig Impala Data Access Hive Mahout Giraph Hama Sqoop Ambari Hadoop - Ecosystem HUE Stinger

Slide 18

Slide 18 text

Installation - Platform Notes Production Linux – Official Development Linux OSX Windows via Cygwin * Other Unixes

Slide 19

Slide 19 text

Installation 1. Download & configure single-node cluster hadoop.apache.org/common/releases.html 2. Download a demo VM Cloudera, Hortonworks, MapR, etc. 3. Download MS HDInsight Server 4. Cloud: Amazon EMR, Azure HDInsight Service

Slide 20

Slide 20 text

Hadoop - Azure Story Name: Windows Azure HDInsight Service Where: Hadoop on Azure dot com Status: Public Preview *On-premise: Microsoft HDInsight Server

Slide 21

Slide 21 text

Hadoop - Azure Story

Slide 22

Slide 22 text

Hadoop - Azure Story

Slide 23

Slide 23 text

Hadoop - Azure Story

Slide 24

Slide 24 text

Hadoop - Azure Story

Slide 25

Slide 25 text

Hadoop - Azure Story

Slide 26

Slide 26 text

Microsoft Distribution of Hadoop C library for HDFS file access Hadoop .Net HDFS File Access Managed C++ Solution HDFS - .Net access

Slide 27

Slide 27 text

HDFS - .Net access

Slide 28

Slide 28 text

hadoopsdk.codeplex.com • MapReduce • LINQ to Hive • WebHDFS Client Hadoop .Net SDK

Slide 29

Slide 29 text

ODBC Driver Excel PowerPivot Other BI tools Connector for Hadoop Import / Export via SQOOP Hadoop Integration

Slide 30

Slide 30 text

slideshare.net/esaliya/mapreduce-in-simple-terms by Saliya Ekanayake 30

Slide 31

Slide 31 text

Java - Native C++ - Pipes framework Any – Streaming Pig Latin, Hive HQL, C via JNI hadoop pipes -input path_in -output path_out -program exec_program hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out hadoop jar jar_path main_class input_path output_path MapReduce - Clients

Slide 32

Slide 32 text

C# - Streaming - Mapper

Slide 33

Slide 33 text

C# - Streaming - Reducer

Slide 34

Slide 34 text

C# - .Net SDK Mapper & Reducer

Slide 35

Slide 35 text

C# - .Net SDK Driver Class

Slide 36

Slide 36 text

C# - .Net SDK Driver Class MRRunner -dll WordFrequency.dll -- input output MRRunner -dll WordFrequency.dll -class WordFrequency -- input output

Slide 37

Slide 37 text

C# - .Net SDK Debugging

Slide 38

Slide 38 text

References Hadoop at Yahoo!, by Y! Developer Network MapReduce in Simple Terms, by Saliya Ekanayake Hadoop on Azure, Getting Started Hadoop .Net SDK .Net HDFS File Access SQL Server Connector for Hadoop

Slide 39

Slide 39 text

Questions ? Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu