Slide 1

Slide 1 text

Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai @sergejusb

Slide 2

Slide 2 text

Information is powerful… but it is how we use it that will define us how we use it

Slide 3

Slide 3 text

Data Explosion picture from Big Data Integration relational text audio video images

Slide 4

Slide 4 text

Big Data (globally) – creates over 30 billion pieces of content per day – stores 30 petabytes of data – produces over 90 million tweets per day 30 billion 30 petabytes 90 million

Slide 5

Slide 5 text

Big Data (our example) – logs over 300 gigabytes of transactions per day – stores more than 1,5 terabyte of aggregated data 300 gigabytes 1,5 terabyte

Slide 6

Slide 6 text

4 Vs of Big Data volume velocity variety variability volume velocity variety variability

Slide 7

Slide 7 text

Big Data Challenges Sort 10TB on 1 node = 100-node cluster = 2,5 days 35 mins ( log ) → (log )

Slide 8

Slide 8 text

Big Data Challenges “Fat” servers implies high cost – use cheap commodity nodes instead Large # of cheap nodes implies often failures – leverage automatic fault-tolerance commodity fault-tolerance

Slide 9

Slide 9 text

Big Data Challenges We need new data-parallel programming model for clusters of commodity machines data-parallel

Slide 10

Slide 10 text

MapReduce to the rescue!

Slide 11

Slide 11 text

MapReduce Published in 2004 by Google – MapReduce: Simplified Data Processing on Large Clusters Popularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, … Hadoop Google

Slide 12

Slide 12 text

1, 1 → 2, 2 1, (2) → (3, 3) MapReduce

Slide 13

Slide 13 text

Word Count Example the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce the, 3 brown, 2 fox, 2 how, 1 now, 1 quick, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output

Slide 14

Slide 14 text

Word Count Example the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output the, 1 quick, 1 brown, 1 fox, 1 the, 1 fox, 1 ate, 1 the, 1 mouse, 1 how, 1 now, 1 brown, 1 cow, 1 the, 1 brown, 1 fox, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 quick, 1 ate, 1 mouse, 1 cow, 1

Slide 15

Slide 15 text

Word Count Example the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce the, 3 brown, 2 fox, 2 how, 1 now, 1 quick, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output the, [1,1,1] brown, [1,1] fox, [1,1] how, [1] now, [1] quick, [1] ate, [1] mouse, [1] cow, [1]

Slide 16

Slide 16 text

MapReduce philosophy – hide complexity – make it scalable – make it cheap philosophy

Slide 17

Slide 17 text

MapReduce popularized by Apache Hadoop project Hadoop

Slide 18

Slide 18 text

Hadoop Overview Open source implementation of – Google MapReduce paper – Google File System (GFS) paper First release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc. MapReduce (GFS) Yahoo!

Slide 19

Slide 19 text

Hadoop Core MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)

Slide 20

Slide 20 text

Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) • Name Node stores file metadata • files split into 64 MB blocks • blocks replicated across 3 Data Nodes Name Node blocks 3 Data Nodes

Slide 21

Slide 21 text

Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node

Slide 22

Slide 22 text

Hadoop Core (MapReduce) • Job Tracker distributes tasks and handles failures • tasks are assigned based on data locality • Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node Job Tracker data locality Task Trackers

Slide 23

Slide 23 text

Hadoop Core (MapReduce) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node Job Tracker Task Tracker

Slide 24

Slide 24 text

Hadoop Core (Job submission) Name Node Data Node Job Tracker Task Tracker Client

Slide 25

Slide 25 text

HBase Hadoop Ecosystem Hadoop Distributed File System (HDFS) MapReduce (Job Scheduling / Execution System) Pig (ETL) Avro Zookeeper Hive (BI) Sqoop (RDBMS)

Slide 26

Slide 26 text

JavaScript MapReduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };

Slide 27

Slide 27 text

Pig words = LOAD '/example/count' AS ( word: chararray, count: int ); popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10; DUMP top_popular_words;

Slide 28

Slide 28 text

Hive CREATE EXTERNAL TABLE WordCount ( word string, count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION "/example/count"; SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Slide 29

Slide 29 text

Demo Hadoop in the Cloud Über Demo

Slide 30

Slide 30 text

Thanks! Questions? Questions?