Map-Reduce patterns

Map-Reduce design patterns 4Developers – Warsaw, 7th April 2014 Wojciech
Sznapka

About the speaker Doing software since 2004. Head of Development
at Cherry Poland. Loves sophisticated architectures and implementing them. Data science and sensors geek after hours.

About the talk Definitions Tools Design patterns Real-life use cases
Conclusion

Definitions

Map - Reduce A programming model for processing large data
sets with a parallel, distributed algorithm on a cluster. A programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Developed by Google, primarily for indexing. Source: Wikipedia

Mapper Converts chunks of input into intermediate key-value pairs.

Reducer Transforms intermediate key-value aggregates into any number of output
pairs. Reducer receives all values for given key emitted by mapper.

Shuffle & sort Transparent stage between map and reduce phases,
handled entirely by MR framework.

Key features 1. Distributed 2. Fault-tolerant 3. Suitable for big
data

Big data? Usually amount of data, that crashes Excel or
makes your DBA cries. Mainly unstructured, flat files (csv) split by some criterion.

Apache Hadoop Dominant platform for Map-Reduce engineering. Shipped with HDFS
– distributed file system, which acts as input and output for MR jobs. Streaming API allows to write MR functions in any language (Python, PHP) – not only Java. Map-Reduce ecosystem produced many tools on top of it (Mahout, Hive, Pig).

MongoDB Offers Map-Reduce engine to operate with stored documents and
work similarly to normal queries.

R R is programming language and environment for statistical computing.
There are R packages to implement Map-Reduce model.

Lightweight implementations Octo.py Mincemeat.py ← used for examples in this
talk ...

Design patterns

Filtering patterns •Keep or throw away the record, without modifying
it. •Equivalent to SQL WHERE condition. •Extremely useful in data science for cleaning data set before computations.

Filtering example For given CSV files contain temperatures and time
for given sensor (as file name) clean datasets by removing values grater by WEIRD_TEMPERATURE (45 degrees).

Filtering example (1/4) def mapfn(index, input_path): f = open(input_path, 'r')
WEIRD_TEMPERATURE = 45.0 for line in f: try: date, temperature = line.strip().split(',') temperature = float(temperature) if temperature < WEIRD_TEMPERATURE: yield input_path, line except ValueError: pass # for CSV headers or dirty data f.close()

Filtering example (2/4) def reducefn(input_path, filtered_lines): output = 'output/filter/%s' %
input_path.split('/')[-1] f = open(output, 'w+') f.write(''.join(filtered_lines)) f.close return output

Filtering example (3/4) import mincemeat import glob s = mincemeat.Server()
s.datasource = dict(enumerate(glob.glob('data/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results

Filtering example (4/4) $ ./filter.py $ mincemeat.py -p 4dev localhost
$ wc -l data/* 59201 data/sensor-office.csv 59201 data/sensor-outdoor-west.csv 118402 total $ wc -l output/filter/* 36221 output/filter/sensor-office.csv 54068 output/filter/sensor-outdoor-west.csv 90289 total

Generating Top – N list •Each mapper looks for TOP
values in given chunk of data, reducer does the same for results. •Similar to sorting and limiting result set in SQL database. •Very efficient – big dataset is being split into chunks and everything goes in parallel.

Top – N example Across all sensors datasets (in distinct
csv files) find 5 highest detected temperatures.

TOP-N example (1/3) def mapfn(index, input_path): TOP_N = 5 f
= open(input_path, 'r') temperatures = [] for line in f: try: date, temperature = line.strip().split(',') temperatures.append(float(temperature)) except ValueError: pass # for CSV headers or dirty data f.close() temperatures = list(set(temperatures)) # make it unique temperatures.sort() for temperature in temperatures[-TOP_N:]: yield 1, temperature

TOP-N example (2/3) def reducefn(index, top_temperatures): TOP_N = 5 top_temperatures.sort()
return top_temperatures[-TOP_N:]

TOP-N example (3/3) import mincemeat import glob s = mincemeat.Server()
s.datasource = dict(enumerate(glob.glob('output/filter/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {1: [44.1, 44.3, 44.4, 44.7, 44.8]}

Counting •Counts occurrence of some value or pattern. •Emits searched
value as a key and 1 as a value. •Can be optimized to pre-count values in mapper.

Counting example Find number of request made by Internet Explorer
family browsers. As a datasets use gzipped daily Apache access logs.

Counting example (1/3) def mapfn(index, input_path): import csv import gzip
with gzip.open(input_path, 'r') as f: reader = csv.reader(f, delimiter=' ') for row in reader: if 'MSIE' in row[9]: # row with index 9 is user agent yield 'MSIE', 1

Counting example (2/3) def reducefn(browser_family, counts): return sum(counts)

Counting example (3/3) import mincemeat import glob s = mincemeat.Server()
s.datasource = dict(enumerate(glob.glob('logs/*.gz'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {'MSIE': 7499}

Numerical summarizations •Because of programmatic nature of Map-Reduce, all kind
of calculations can be applied in reducer. •Mapper groups data by some criterion, reducer does all the math. •Very helpful for reporting.

Summarizations example For each sensor calculate mean, stdev, min and
max temperatures on monthly basis.

Summarizations example (1/3) def mapfn(index, input_path): import datetime f =
open(input_path, 'r') sensor_name = input_path.split('/')[-1].split('.')[0] for line in f: try: date, temperature = line.strip().split(',') temperature = float(temperature) date = datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ") yield '%s-%d-%d' % (sensor_name, date.year, date.month), temperature except ValueError: pass # for CSV headers or dirty data f.close()

Summarizations example (2/3) def reducefn(key, values): import numpy return {
'mean': round(numpy.mean(values), 2), 'stdev': round(numpy.std(values), 2), 'min': round(numpy.min(values), 2), 'max': round(numpy.max(values), 2) }

Summarizations example (3/3) import mincemeat import glob s = mincemeat.Server()
s.datasource = dict(enumerate(glob.glob('data/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {'sensor-office-2013-7': {'mean': 24.32, 'max': 32.1, 'min': 20.0, 'stdev': 1.65}, [...]

Indexing •The famous inverted index. •It's like an index of
terms in the end of a book. •Can map terms to URLs, files, wherever they came from. •Mapper extracts terms from input and yields them one by one with origin (URL or location) for example. •Remember to exclude stop-words.

Indexing example For given set of text files (Pan Tadeusz
fragments) create an inverted index and figure out in which files word 'Wojski' occurs. Exclude common Polish stop-words.

Indexing example (1/3) def mapfn(index, input_path): import string # list
is way longer, shortened for brevity stop_words = ['a', 'aby', 'ach', 'acz', 'aczkolwiek'] f = open(input_path, 'r') text = f.read() f.close() punctuation = set(string.punctuation) text = ''.join(ch for ch in text if ch not in punctuation) for word in set(text.lower().split(' ')): if word not in stop_words: yield word, input_path

Indexing example (2/3) def reducefn(key, values): return values

Indexing example (3/3) # -*- coding: utf-8 -*- import mincemeat
import glob s = mincemeat.Server() s.datasource = dict(enumerate(glob.glob('lorem-ipsum/*.txt'))) s.mapfn = mapfn s.reducefn = reducefn index = s.run_server(password="4dev") print index['wojski'] # prints: ['lorem-ipsum/part4.txt', 'lorem-ipsum/part2.txt']

Combining datasets •Joining different types of data sets. •All data
sets goes to mapper. •Mapper recognizes type of input (based on columns count or data characteristic). •Mapper emits common key (like foreign key in SQL databases) and data to join. •Reducer joins them and returns.

… and many more •Graph processing, •Sorting, •Distinct Values, •Bucketing,
•Cross-Correlation, •Processing binary data (pdf, images, video) •All kinds of data processing for machine learning (recommendations, clustering, classifications).

Real-life use cases

NY Times public domain articles digitalization “As part of eliminating
TimesSelect, The New York Times has decided to make all the public domain articles from 1851–1922 available free of charge. These articles are all in the form of images scanned from the original paper. In fact from 1851–1980, all 11 million articles are available as images in PDF format. To generate a PDF version of the article takes quite a bit of work — each article is actually composed of numerous smaller TIFF images that need to be scaled and glued together in a coherent fashion.” “I was ready to deploy Hadoop. […] I churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3. ” source: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/

PageRank by Google “PageRank is an algorithm used by Google
Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.” Google utilized Map-Reduce idea to calculate value of PageRank for pages stored on GFS (HDFS ancestor), Source: http://pl.wikipedia.org/wiki/PageRank

Google Maps data generation Locating roads connected to a given
intersection, rendering map tiles, finding POI for address. Lot of interesting algorithms implemented with Map-Reduce. Source: https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf Source: http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC_files/v3_document.htm

Conclusion

Map-Reduce approach allows to implement most of SQL database operations,
as long as dataset contains data we need to do those.

Both map and reduce steps are high-level language functions, so
their usage is limited only by imagination of developer, therefore it's very flexible.

Map-Reduce approach is scalable and proper for huge amounts of
data.

Thank you!

Contact [email protected] https://twitter.com/sznapka http://sznapka.pl

Join me at Cherry Poland If you're Senior PHP developer,
you're excited about high load and non-CRUD business logic, you like to work in international team, we should talk! #mongo #php #python #zf2 #angularJS #redis #memcached #rabbitMQ #scalability #datawarehouse #amazon #redshift #continous-delivery #scrum

Sources https://docs.google.com/a/knowlabs.com/document/d/1OV91FO5FejjbvUw2XMi6J3emkgnrnlFB7nnharOttHg/view http://cecs.wright.edu/~tkprasad/courses/cs707/ProgrammingHadoop.pdf http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC_files/v3_document.htm https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf http://courses.cs.washington.edu/courses/cse524/08wi/slides/mapreduce-lec.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf http://www.josemalvarez.es/web/wp-content/uploads/2013/04/MapReduce.png http://www.webforefront.com/bigdata/mapreduceintro.html http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/
http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2008/Notes/mapreduce.pdf http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Map-Reduce patterns

Map-Reduce patterns

More Decks by Wojciech Sznapka

Other Decks in Programming

Featured

Transcript