Intro to Big Data using Hadoop

Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai
@sergejusb

Information is powerful… but it is how we use it
that will define us how we use it

Data Explosion picture from Big Data Integration relational text audio
video images

Big Data (globally) – creates over 30 billion pieces of
content per day – stores 30 petabytes of data – produces over 90 million tweets per day 30 billion 30 petabytes 90 million

Big Data (our example) – logs over 300 gigabytes of
transactions per day – stores more than 1,5 terabyte of aggregated data 300 gigabytes 1,5 terabyte

4 Vs of Big Data volume velocity variety variability volume
velocity variety variability

Big Data Challenges Sort 10TB on 1 node = 100-node
cluster = 2,5 days 35 mins ( log ) → (log )

Big Data Challenges “Fat” servers implies high cost – use
cheap commodity nodes instead Large # of cheap nodes implies often failures – leverage automatic fault-tolerance commodity fault-tolerance

Big Data Challenges We need new data-parallel programming model for
clusters of commodity machines data-parallel

MapReduce to the rescue!

MapReduce Published in 2004 by Google – MapReduce: Simplified Data
Processing on Large Clusters Popularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, … Hadoop Google

1, 1 → 2, 2 1, (2) → (3, 3)
MapReduce

Word Count Example the quick brown fox the fox ate
the mouse how now brown cow Map Map Map Reduce Reduce the, 3 brown, 2 fox, 2 how, 1 now, 1 quick, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output

the mouse how now brown cow Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output the, 1 quick, 1 brown, 1 fox, 1 the, 1 fox, 1 ate, 1 the, 1 mouse, 1 how, 1 now, 1 brown, 1 cow, 1 the, 1 brown, 1 fox, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 quick, 1 ate, 1 mouse, 1 cow, 1

the mouse how now brown cow Map Map Map Reduce Reduce the, 3 brown, 2 fox, 2 how, 1 now, 1 quick, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output the, [1,1,1] brown, [1,1] fox, [1,1] how, [1] now, [1] quick, [1] ate, [1] mouse, [1] cow, [1]

MapReduce philosophy – hide complexity – make it scalable –
make it cheap philosophy

MapReduce popularized by Apache Hadoop project Hadoop

Hadoop Overview Open source implementation of – Google MapReduce paper
– Google File System (GFS) paper First release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc. MapReduce (GFS) Yahoo!

Hadoop Core MapReduce (Job Scheduling / Execution System) Hadoop Distributed
File System (HDFS)

Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop
Distributed File System (HDFS) • Name Node stores file metadata • files split into 64 MB blocks • blocks replicated across 3 Data Nodes Name Node blocks 3 Data Nodes

Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop
Distributed File System (HDFS) Name Node Data Node

Hadoop Core (MapReduce) • Job Tracker distributes tasks and handles
failures • tasks are assigned based on data locality • Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node Job Tracker data locality Task Trackers

Hadoop Core (MapReduce) MapReduce (Job Scheduling / Execution System) Hadoop
Distributed File System (HDFS) Name Node Data Node Job Tracker Task Tracker

Hadoop Core (Job submission) Name Node Data Node Job Tracker
Task Tracker Client

HBase Hadoop Ecosystem Hadoop Distributed File System (HDFS) MapReduce (Job
Scheduling / Execution System) Pig (ETL) Avro Zookeeper Hive (BI) Sqoop (RDBMS)

JavaScript MapReduce var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };

Pig words = LOAD '/example/count' AS ( word: chararray, count:
int ); popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10; DUMP top_popular_words;

Hive CREATE EXTERNAL TABLE WordCount ( word string, count int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION "/example/count"; SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Demo Hadoop in the Cloud Über Demo

Thanks! Questions? Questions?

Intro to Big Data using Hadoop

Intro to Big Data using Hadoop

Sergejus

More Decks by Sergejus

Other Decks in Technology

Featured

Transcript