MapReduce Dre

An Introduction to MapReduce Gregory Ditzler Drexel University Department of
Electrical & Computer Engineering [email protected] EESI Group Meeting (February 2015) An Introduction to MapReduce

Modern Computing & Data Analytics Instrumentation & Data Collection Storage
Grid Computing Data Analytics Mostly Append Archiving   Data Explore High  Fidelity EESI Group Meeting (February 2015) An Introduction to MapReduce

Challenges “Mo Data, Mo Problems” – The Notorious B.I.G. (?)
There are some massive sources of data: Google, Skype, Yahoo! Facebook, Twitter. Classical data mining & machine learning: single machine How can we easily and efﬁciently distribute computation? In general, performing distributed computing is not always straight forward. Case & Point: Proteus. MapReduce was adopted from Google’s computation & data manipulation model Motivation Google: 20+ Billion webpages × 20kB = 400+TB, and 1 computer reads 30-35MB/sec from disk That means it would take 4 months to read the data from the web! And even longer to do something with it! EESI Group Meeting (February 2015) An Introduction to MapReduce

Architecture of a Cluster 2-10 Gbps backbone between racks 1
Gbps between any pair of nodes in a rack Switch Switch Switch … Each rack contains 16-64 nodes Memory CPU Disk Memory CPU Disk Memory CPU Disk Memory CPU Disk EESI Group Meeting (February 2015) An Introduction to MapReduce

The Essence of the Problem The central problem with working
on big data is that you cannot bring the data to the computation. Hadoop via MapReduce brings the computation to the data! EESI Group Meeting (February 2015) An Introduction to MapReduce

Apache Hadoop EESI Group Meeting (February 2015) An Introduction to
MapReduce

What is Hadoop? Apache Hadoop Hadoop is open source software
project that enables the distributed processing of large data sets across clusters of servers Servers are not specific, rather they are commercial off the shelf (COTS) Think of Hadoop as the operating system for big data Works with one machine. . . Works with thousands of machines! And extending a Hadoop cluster is easy. Core of any OS: (i) ability to store file, and (ii) ability to run applications. The Cornerstones of Hadoop YARN (Yet Another Resource Negotiator): Assigns CPU, memory, and storage to applications running on a Hadoop cluster. Its a bit like a scheduler and YARN allows the use of applications other than MapReduce. HDFS (Hadoop Distributed File System): file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. EESI Group Meeting (February 2015) An Introduction to MapReduce

Why Hadoop? Apache Hadoop Scalable: Nodes can be added as
needed and there is no need to change data formats or how distributed programs are written. Cost effective: Hadoop is free! Sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all of the data. Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Fault tolerant: When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat. EESI Group Meeting (February 2015) An Introduction to MapReduce

Distributed File Systems (DFS) Why Should We Distribute the File
System? Copying data over a network is extremely slow and can be a huge bottle neck during computation Idea: instead of bringing the data to the computation, bring the computation to the data. MapReduce addresses these problems by using a DFS and a new programming model. If nodes fail, how to store data persistently? Distributed File System! Data are rarely updated in place. Reads and appends are far more common What Makes Up a Hadoop Cluster? Chunk servers: Files are split into contiguous chunks (e.g., 16-64MB) and each chunk replicated (e.g., 2x or 3x). Replicas would be stored in different racks Master node (Name Node in Hadoop’s HDFS): Stores metadata about where files are stored Client library for file access: Talks to master to find chunk servers and connects directly to chunk servers to access data EESI Group Meeting (February 2015) An Introduction to MapReduce

File A File B A1 A1 A1 A1 A2 A2
A2 A2 A2 A2 A2 A3 A3 A3 A3 A3 A3 B1 B1 B1 B1 B1 B1 B2 B2 B2 B2 B2 B2 B2 B3 B3 B3 B3 B3 B3 B3 EESI Group Meeting (February 2015) An Introduction to MapReduce

The MapReduce Computational and Data Management Model MapReduce MapReduce addresses
the challenges of cluster computing Store data redundantly on multiple nodes for persistence and availability Move the computation close to the data to minimize data movement Simple programming model to hide the complexity of distributed computing Example We have a huge text document that cannot be loaded into memory. Goal: count the number of times each distinct word appears in the ﬁle. Applications count k-mers for a potentially large k analyze web server logs to ﬁnd popular URLs language models EESI Group Meeting (February 2015) An Introduction to MapReduce

MapReduce Example Case 1: The file is too large for
memory, but all of the (word, count) pairs fit into memory Build a hash table by sweeping through the file and incrementing a counter in the hash table every time we come across a word. Case 2: Not even the (word, count) pairs fit into memory $ words document.txt | sort | uniq -c Captures the essence of MapReduce. Map scan the input file one record at a time extract something you care about from each record (key) Group by key sort and shuffle Reduce aggregate, summarize, filter or transform write the result The Map and Reduce function need to be changed to deal with the problem you’re solving EESI Group Meeting (February 2015) An Introduction to MapReduce

MapReduce on (part of) the Great Gatsby In my younger
and more vulnerable years my father gave me some advice that I’ve been turning over in my mind ever since. “Whenever you feel like criticizing any one,” he told me, “just remember that all the people in this world haven’t had the advantages that you’ve had.” He didn’t say any more, but we’ve always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence, I’m inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The Raw Text in 1 my 1 younger 1 and 1 more 1 vulnerable 1 years 1 my 1 father 1 gave 1 me 1 some 1 advice 1 that 1 i've 1 been 1 turning 1 over 1 in 1 my 1 mind 1 ever 1 since. 1 "whenever 1 you 1 feel 1 like 1 criticizing 1 any 1 Map a 1 a 1 a 1 a 1 a 1 abnormal 1 about 1 advantages 1 advice 1 all 1 all 1 also 1 always 1 and 1 and 1 and 1 and 1 and 1 any 1 any 1 appears 1 attach 1 been 1 been 1 Group by Key 1 1 a 5 abnormal 1 about 1 advantages 1 advice 1 all 2 also 1 always 1 and 5 any 2 appears 1 attach 1 been 2 bores. 1 but 1 came 1 chapter 1 college 1 communicative 1 consequence 1 criticizing 1 curious 1 deal 1 Reduce EESI Group Meeting (February 2015) An Introduction to MapReduce

Least Mean Squares (LMS) Algorithm If we deﬁne a decision
rule to be hθ(x) = n i=0 θi xi and a cost function to be J(θ) = 1 2 m j=1 (hθ(x(j)) − y(j))2, we can use the Widrow-Hoff learning rule to ﬁnd θ, which is given by: θi ← θi + η m j=1 (hθ(x(j)) − y(j))x(j) i Time consuming for large m . Map: compute the errors for each point and their update contribution (key) Reduce: sum the returned values EESI Group Meeting (February 2015) An Introduction to MapReduce

Hadoop MapReduce Environment MapReduce environments takes care of: Partitioning of
the input data Scheduling the program’s execution across a – potentially large – set of machines Perform the group by key step Handling node failures Managing required inter-machine communication User’s need to choose M map tasks and R reduce tasks. Rule of thumb: M > # of nodes in the cluster and R < M. EESI Group Meeting (February 2015) An Introduction to MapReduce

Failures within MapReduce Mapper node failure Map tasks at the
node that were completed or idle are all reset to idle The idle tasks are eventually rescheduled on other workers Reducer node failure Only the in-progress tasks are reset to idle. Completed tasks on the worker are saved to the distributed file system – not the local file system. Idle reduce tasks are restarted on other workers Master node failure The MapReduce task is aborted and the client is notified. Remember, node failures are rare! EESI Group Meeting (February 2015) An Introduction to MapReduce

The NPFS Algorithm1 D Dataset Map D1 D2 Dn A
(Dn , k) A (D2 , k) A (D1 , k) X:,2 X:,1 X:,n … 2 6 6 6 6 6 4 1 1 0 · · · 1 1 0 1 0 · · · 0 0 1 0 1 · · · 1 1 . . . . . . . . . ... . . . . . . 1 1 1 · · · 1 1 3 7 7 7 7 7 5 # features # of runs Reduce & Inference X i Xj,i ⇣crit ! ! if feature is relevant j X Λ(Z) = P(T(Z)|H1 ) P(T(Z)|H0 ) H1 ≷ H0 ζcrit → n z pz 1 (1 − p1 )n−z n z pz 0 (1 − p0 )n−z H1 ≷ H0 ζcrit α = P(T(Z) > ζcrit |H0 ) 1G. Ditzler, R. Polikar, and G. Rosen, “A bootstrap based Neyman-Pearson test for identifying variable importance,” IEEE Transactions on Neural Networks and Learning Systems, 2014. EESI Group Meeting (February 2015) An Introduction to MapReduce

The NPFS Algorithm NPFS Pseudo Code 1 Run a FS
algorithm A on n independently sampled data sets. Form a matrix X ∈ {0, 1}K×n where {X}il is the Bernoulli random variable for feature i on trial l. 2 Compute ζ crit using equation (1), which requires n, p0, and the Binomial inverse cumulative distribution function. P(z > ζ crit |H0) = 1 − P(z ≤ ζ crit |H0) cumulative distribution function = α (1) 3 Let {z}i = n l=1 {X}il. If {z}i > ζ crit then feature belongs in the relevant set, otherwise the feature is deemed non-relevant. Concentration Inequality on |ˆ p − p| (Hoeffding’s bound) If X1, . . . , Xn ∼ Bernoulli(p), then for any > 0, we have P(|ˆ p − p| ≥ ) ≤ 2e−2n 2 where ˆ p = 1 n Zn. How would get translate this into MapReduce code? EESI Group Meeting (February 2015) An Introduction to MapReduce

NPFS in MapReduce might look something like # mapper.py import
base_selection as base import sys data,labels = [], [] for line in sys.stdin: data_line = [float(x) for x in line.split("\t")] labels.append(data_line[0]) data.append(data_line[1:]) sel_feat = base(data, labels) # sel_feat is 1 or 0 for n,val in enumerate(sel_feat): print "(feature"+str(n)+","+str(val)+")" EESI Group Meeting (February 2015) An Introduction to MapReduce

NPFS in MapReduce might look something like # reducer.py from
operator import itemgetter import sys current_feature, current_count, feature = None, 0, None feature = None for line in sys.stdin: feature, count = line.split(’\t’, 1) count = int(count) if current_feature == feature: current_count += count else: print ’%s\t%s’ % (current_feature, current_count) current_count = count current_feature = feature # do not forget to output the last feature if needed! if current_feature == feature: print ’%s\t%s’ % (current_feature, current_count) EESI Group Meeting (February 2015) An Introduction to MapReduce

NPFS in MapReduce might look something like # at the
command line on the Hadoop cluster $ bin/hadoop dfs -copyFromLocal ˜/massivefile.txt /user/gregd/ $ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \ -file /home/gregd/mapper.py \ -mapper /home/gregd/mapper.py \ -file /home/gregd/reducer.py \ -reducer /home/gregd/reducer.py \ -input /home/gregd/massivefile.txt \ -output /home/gregd/output/ EESI Group Meeting (February 2015) An Introduction to MapReduce

Apache Mahout EESI Group Meeting (February 2015) An Introduction to
MapReduce

Mahout Apache’s Machine Learning Libraries using MapReduce Mahout is a
collection of machine learning algorithms, most of which use MapReduce, and run on Hadoop What’s there?: collaborative ﬁlters, classiﬁcation (LR, NB, RF, HMM & MLP), clustering, dimensionality reduction, topic models, and more! Apache has many more Hadoop tools: http://projects.apache.org/indexes/category.html EESI Group Meeting (February 2015) An Introduction to MapReduce

Spark (https://spark.apache.org/) What’s the deal with Spark? Spark has an
advanced DAG execution engine that supports cyclic data flow and in-memory computing Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Write applications quickly in Java, Scala or Python Spark runs on Hadoop, Mesos, standalone, or in the cloud. Applications that can run on Spark MLib: Machine learning library that fits into Spark’s APIs and interoperates with NumPy in Python. Spark SQL: Lets you query structured data as a distributed dataset (RDD) in Spark GraphX: Unifies Extract, Transform and Load (ETL), exploratory analysis, and iterative graph computation within a single system Spark Streaming: lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state EESI Group Meeting (February 2015) An Introduction to MapReduce

Amazon Elastic MapReduce Amazon Elastic MapReduce Amazon EMR uses Hadoop
or Spark to distribute your data and processing across a resizable cluster of Amazon EC2 instances. Web GUI allows to to conﬁgure the cluster and transfer data to the cloud using AWS Works with code written in Perl, Python, R, PHP, C++, Pif, Java or Hive Cluster can be shutdown after the program has been completed to avoid wasting $$ The idea is that Amazon takes care of the management of the Hadoop cluster and you spend your time analyzing the data. Amazon EMR allows a user to choose nodes optimized for different levels of computation, memory, or features, such as a GPU A 10-node Hadoop cluster can be used for approximately $0.15 per hour EESI Group Meeting (February 2015) An Introduction to MapReduce

References and Getting More Information References J. Leskovec, A. Rajaraman,
and J. D. Ullman, “Mining of Massive Datasets,” Cambridge University Press, 2nd Ed., 2014. J. Dean and S. Ghemawat, “MapReduce: Simpliﬁed Data Processing on Large Clusters,” Symposium on Operating System Design and Implementation, 2004. Coursera, “Mining Massive Datasets,” 2015. EESI Group Meeting (February 2015) An Introduction to MapReduce

Thats all folks! Questions? EESI Group Meeting (February 2015) An
Introduction to MapReduce

MapReduce Dre

MapReduce Dre

Gregory Ditzler

More Decks by Gregory Ditzler

Featured

Transcript

An Introduction to MapReduce Gregory Ditzler Drexel University Department of

Modern Computing & Data Analytics Instrumentation & Data Collection Storage

Modern Computing & Data Analytics Instrumentation & Data Collection Storage

Challenges “Mo Data, Mo Problems” – The Notorious B.I.G. (?)

Architecture of a Cluster 2-10 Gbps backbone between racks 1

The Essence of the Problem The central problem with working

Apache Hadoop EESI Group Meeting (February 2015) An Introduction to

What is Hadoop? Apache Hadoop Hadoop is open source software

Why Hadoop? Apache Hadoop Scalable: Nodes can be added as

Distributed File Systems (DFS) Why Should We Distribute the File

File A File B A1 A1 A1 A1 A2 A2

The MapReduce Computational and Data Management Model MapReduce MapReduce addresses

MapReduce Example Case 1: The ﬁle is too large for

MapReduce on (part of) the Great Gatsby In my younger

Least Mean Squares (LMS) Algorithm If we deﬁne a decision

Hadoop MapReduce Environment MapReduce environments takes care of: Partitioning of

Failures within MapReduce Mapper node failure Map tasks at the

The NPFS Algorithm1 D Dataset Map D1 D2 Dn A

The NPFS Algorithm NPFS Pseudo Code 1 Run a FS

NPFS in MapReduce might look something like # mapper.py import

NPFS in MapReduce might look something like # reducer.py from

NPFS in MapReduce might look something like # at the

Apache Mahout EESI Group Meeting (February 2015) An Introduction to

Mahout Apache’s Machine Learning Libraries using MapReduce Mahout is a

Spark (https://spark.apache.org/) What’s the deal with Spark? Spark has an

Amazon Elastic MapReduce Amazon Elastic MapReduce Amazon EMR uses Hadoop

References and Getting More Information References J. Leskovec, A. Rajaraman,

Thats all folks! Questions? EESI Group Meeting (February 2015) An