Introduction to Hadoop and Big Data

by Joe Alex

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Who am I • Joe Alex – Software Architect / Data Scientist Loves to code in Java, Scala – Areas of Interest: Big Data, Data Analytics, Machine Learning, Hadoop, Cassandra – Currently working as Team Lead for Managed Security Services Portal at Verizon

Slide 3

Slide 3 text

New kind of data • Social - messages, posts, blogs, photos, videos, maps, graphs, friends • Machine – sensors, firewalls, routers, logs, metrics, health monitoring, cell phones, credit card transactions

Slide 4

Slide 4 text

New kind of data • Volume – massive, TB  PB – Convert 350 billion annual meter readings to better predict power consumption – Turn 12 terabytes of Tweets created each day into improved product sentiment analysis • Types – structured, semi/un-structured – Text, audio, video, click streams, logs, machine – Monitor 100’s of live video feeds from surveillance cameras to target points of interest • Velocity (time sensitive) – ideally processed as it is streaming (realtime, near-realtime, batch) – Scrutinize 5 million trade events created each day to identify potential fraud – Analyze 500 million daily call detail records in real-time to predict customer churn faster

Slide 5

Slide 5 text

What is Big Data about • We are drowning is a sea of data, sometimes we throw away a lot. • Still we cant make much sense of it • We consider data as a cost • But Data is an opportunity • This is what Big Data is about – New Insights – New Business

Slide 6

Slide 6 text

Big Data Domains • Digital marketing • Data discovery – patterns, trends • Fraud detection • Machine generated Data Analytics – Remote device insight, sensing, location based intel • Social • Data retention

Slide 7

Slide 7 text

Big Data Architecture • Traditional – High Availability, RDBMS, Structured data • Big Data – High scalabilty/availability/flexibility, Compute/Storage on same nodes, Structured/Semi/Un-Structured data

Slide 8

Slide 8 text

How to tackle Big Data • Layered Architecture – Speed Layer – Batch layer

Slide 9

Slide 9 text

Apache Hadoop • Open source project under Apache Software Foundation • Based on papers published by Google – MapReduce: http://research.google.com/archive/mapreduce.html – GFS: http://research.google.com/archive/gfs.html

Slide 10

Slide 10 text

Reliability • "Failure is the defining difference between distributed and local programming“ • - Ken Arnold, CORBA Designer

Slide 11

Slide 11 text

Why Hadoop • Data processed by Google every month: 400Pb… in 2007 • Average job size: 180Gb • Time 180Gb of data would take to read sequentially off a single disk drive: 45 minutes • Solution: parallel reads • – 1 HDD = 75Mb/sec • – 1,000 HDDs = 75Gb/sec (Far more acceptable) • Data Access Speed is the Bottleneck • We can process data very quickly, but we can only read/write it very slowly

Slide 12

Slide 12 text

Core Components • Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce • There are many other projects based around core Hadoop – Often referred to as the “Hadoop Ecosystem” – Pig, Hive, HBase, Flume, Oozie, Sqoop etc • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster • Individual machines are known as nodes. A cluster can have as few as one node, as many as several thousands • More nodes = better performance

Slide 13

Slide 13 text

System Requirements • System should support partial failure – Failure of one part of the system should result in a graceful decline in performance. Not a full halt • System should support data recoverability – If components fail, their workload should be picked up by still functioning units • System should support individual recoverability – Nodes that fail and restart should be able to rejoin the group activity without a full group restart

Slide 14

Slide 14 text

System Requirements (cont’d) • System should be consistent – – Concurrent operations or partial internal failures should not cause the results of the job to change • System should be scalable – – Adding increased load to a system should not cause outright failure. Instead, should result in a graceful decline • Increasing resources should support a proportional increase in load capacity

Slide 15

Slide 15 text

Hadoop’s radical approach • Hadoop provides a radical approach to these issues: – Nodes talk to each other as little as possible, Probably never. – This is known as a “shared nothing” architecture – Programmer should not explicitly be allowed to write code which communicates between nodes • Data is spread throughout machines in the cluster – Data distribution happens when data is loaded on to the cluster • Instead of bringing data to the processors, Hadoop brings the processing to the data

Slide 16

Slide 16 text

Hadoop’s radical approach • Batch Oriented • Data Locality (code is shipped around) • Heavy Parallelization • Process Management • Append-only ﬁles • Express your computation in Map Reduce, get parallelism and and scalability for free

Slide 17

Slide 17 text

Core Hadoop Daemons • Each node in a Hadoop installation runs one or more daemons executing MapReduce code or HDFS commands. Each daemon’s responsibilities in the cluster are: – NameNode: manages HDFS and communicates with every DataNode daemon in the cluster – JobTracker: dispatches jobs and assigns splits to mappers or reducers as each stage completes – TaskTracker: executes tasks sent by the JobTracker and reports status – DataNode: Manages HDFS content in the node and updates status to the NameNode

Slide 18

Slide 18 text

Config files • hadoop-env.sh — environmental configuration, JVM configuration etc • core-site.xml — site wide configuration • hdfs-site.xml — HDFS block size, Name and Data node directories • mapred-site.xml — total MapReduce tasks, JobTracker address • masters, slaves files — NameNode, JobTracker, DataNodes, and TaskTrackers addresses, as appropriate

Slide 19

Slide 19 text

HDFS: Hadoop Distributed File System • Based on Google’s GFS (Google File System) • Provides redundant storage of massive amounts of data – Using cheap, unreliable computers • At load time, data is distributed across all nodes – Provides for efficient MapReduce processing

Slide 20

Slide 20 text

HDFS Assumptions • High component failure rates – Inexpensive components fail all the time • “Modest” number of HUGE files – Just a few million – Each file likely to be 100Mb or larger – Multi-Gigabyte files typical • Large streaming reads – Not random access • High sustained throughput should be favored over low latency

Slide 21

Slide 21 text

HDFS Features • Operates ‘on top of’ an existing filesystem • Files are stored as ‘blocks’ – Much larger than for most filesystems – Default is 64Mb • Provides reliability through replication – Each block is replicated across three or more DataNodes • Single NameNode stores metadata and co-ordinates access – Provides simple, centralized management • No data caching – Would provide little benefit due to large datasets, streaming reads • Familiar interface, but customize the API – Simplify the problem and focus on distributed applications

Slide 22

Slide 22 text

HDFS Block diagram

Slide 23

Slide 23 text

MapReduce • MapReduce is a method for distributing a task across multiple nodes in the Hadoop cluster • Consists of two phases: Map, and then Reduce – Between the two is a stage known as the shuffle and sort • Each Map task operates on a discrete portion of the overall dataset. Typically one HDFS block of data • After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase

Slide 24

Slide 24 text

Features of MapReduce • Automatic parallelization and distribution • Fault-tolerance • Status and monitoring tools • A clean abstraction for programmers – – MapReduce programs are usually written in Java – – Can be written in any scripting language using Hadoop Streaming – – All of Hadoop is written in Java • MapReduce abstracts all the “housekeeping” away from the developer – – Developer can concentrate simply on writing the Map and Reduce functions

Slide 25

Slide 25 text

MapReduce example • Map • // assume input is a set of text files k is a line offset v is the line for that offset • let map(k, v) = • for each word in v: • emit(word, 1) • Reduce • // k is a word vals is a list of 1s • let reduce(k, vals) = • emit(k, vals.length())

Slide 26

Slide 26 text

MapReduce Highlevel

Slide 27

Slide 27 text

Map Process • map (in_key, in_value) -> (out_key, out_value) list

Slide 28

Slide 28 text

Reduce Process • reduce (out_key, out_value list) -> (final_key, final_value) list

Slide 29

Slide 29 text

MapReduce

Slide 30

Slide 30 text

Streaming API • Many organizations have developers skilled in languages other than Java – Perl, Ruby, Python, Etc • The Streaming API allows developers to use any language they wish to write Mappers and Reducers – As long as the language can read from standard input and write to standard output • Advantages of the Streaming API: – No need for non-Java coders to learn Java – Fast development time – Ability to use existing code libraries

Slide 31

Slide 31 text

Job Driver • Job Driver JobConf conf = new JobConf(WordCount.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordMapper.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setReducerClass(SumReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); • Driver is submitted to the Hadoop cluster for processing, along with the rest of the code in a .jar file.

Slide 32

Slide 32 text

Mapper • The basic Java code implementation for the mapper has the form: public class WordMapper extends MapReduceBase implements Mapper { public void map(LongWritable key, Text value, OutputCollector collector, Reporter reporter) throws IOException { /* implementation here */ } } • The implementation itself uses standard Java text manipulation tools; you can use regular expressions, scanners, whatever is necessary.

Slide 33

Slide 33 text

Reducer • Reducer public class SumReducer extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector collector, Reporter reporter) throws IOException { /* implementation */} } • The reducer iterates over keys and values generated in the previous step and sums up the occurrence of the word

Slide 34

Slide 34 text

Input/Output Formats • Input Formats – KeyValueTextInputFormat — Each line represents a key and value delimited by a separator – TextInputFormat — The key is the byte offset, the value is the text itself for each line – Sequence Input Format — Raw format serialized key/value pairs • Output Formats – Specify final output

Slide 35

Slide 35 text

Hadoop Eco System : Hive • Hive – SQL-based data warehousing app – Data analysts are very familiar with SQL than Java etc – Hive allows users to query data using HiveQL, a language very similar to standard SQL – Hive turns HiveQL queries into standard MapReduce jobs – Automatically runs the jobs, and displays the results to the user – Note that Hive is not an RDBMS • Results take many seconds, minutes, or even hours to be produced • Not possible to modify the data using HiveQL – Features for analyzing very large data sets

Slide 36

Slide 36 text

Hadoop Eco System: Pig • Pig – Data-flow oriented language – Pig can be used as an alternative to writing MapReduce jobs in Java (or some other language) – Provides a scripting language known as Pig Latin – Abstracts MapReduce details away from the user – Made up of a set of operations that are applied to the input data to produce output – Fairly easy to write complex asks such as joins of multiple datasets – Under the covers, Pig Latin scripts are converted to MapReduce jobs

Slide 37

Slide 37 text

Hadoop Eco System: HBase • HBase – Distributed, sparse, column-oriented datastore • Distributed: designed to use multiple machines to store and serve data • Sparse: each row may or may not have values for all columns • Column-oriented: Data is stored grouped by column, rather than by row. Columns are grouped into ‘column families’, which define what columns are physically stored together – Leverages HDFS – Modeled after Google’s BigTable datastore

Slide 38

Slide 38 text

Hadoop Eco System: Others • Flume – Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced. – Ideally suited to gathering logs from multiple systems and inserting them into HDFS as they are generated • Sqoop – Sqoop is “the SQL-to-Hadoop database import tool” – Designed to import data from RDBMS into Hadoop – Can also send data the other way, from Hadoop to an RDBMS – Uses JDBC to connect to the RDBMS • Oozie – Dataflow

Slide 39

Slide 39 text

Hadoop Eco System: Others • Zookeeper – Distributed consensus engine – Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Avro – Serialization and RPC framework • Mahout – Machine learning library

Slide 40

Slide 40 text

Next Gen • Storm – distributed realtime computation. Makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing • Spark – Spark is an open source cluster computing system that aims to make data analytics fast. • Impala – real-time processing

Slide 41

Slide 41 text

Questions Twitter @joealex Email [email protected]