Slide 1

Slide 1 text

Map-Reduce design patterns 4Developers – Warsaw, 7th April 2014 Wojciech Sznapka

Slide 2

Slide 2 text

About the speaker Doing software since 2004. Head of Development at Cherry Poland. Loves sophisticated architectures and implementing them. Data science and sensors geek after hours.

Slide 3

Slide 3 text

About the talk Definitions Tools Design patterns Real-life use cases Conclusion

Slide 4

Slide 4 text

Definitions

Slide 5

Slide 5 text

Map - Reduce A programming model for processing large data sets with a parallel, distributed algorithm on a cluster. A programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Developed by Google, primarily for indexing. Source: Wikipedia

Slide 6

Slide 6 text

Mapper Converts chunks of input into intermediate key-value pairs.

Slide 7

Slide 7 text

Reducer Transforms intermediate key-value aggregates into any number of output pairs. Reducer receives all values for given key emitted by mapper.

Slide 8

Slide 8 text

Shuffle & sort Transparent stage between map and reduce phases, handled entirely by MR framework.

Slide 9

Slide 9 text

Key features 1. Distributed 2. Fault-tolerant 3. Suitable for big data

Slide 10

Slide 10 text

Big data? Usually amount of data, that crashes Excel or makes your DBA cries. Mainly unstructured, flat files (csv) split by some criterion.

Slide 11

Slide 11 text

Tools

Slide 12

Slide 12 text

Apache Hadoop Dominant platform for Map-Reduce engineering. Shipped with HDFS – distributed file system, which acts as input and output for MR jobs. Streaming API allows to write MR functions in any language (Python, PHP) – not only Java. Map-Reduce ecosystem produced many tools on top of it (Mahout, Hive, Pig).

Slide 13

Slide 13 text

MongoDB Offers Map-Reduce engine to operate with stored documents and work similarly to normal queries.

Slide 14

Slide 14 text

R R is programming language and environment for statistical computing. There are R packages to implement Map-Reduce model.

Slide 15

Slide 15 text

Lightweight implementations Octo.py Mincemeat.py ← used for examples in this talk ...

Slide 16

Slide 16 text

Design patterns

Slide 17

Slide 17 text

Filtering patterns ●Keep or throw away the record, without modifying it. ●Equivalent to SQL WHERE condition. ●Extremely useful in data science for cleaning data set before computations.

Slide 18

Slide 18 text

Filtering example For given CSV files contain temperatures and time for given sensor (as file name) clean datasets by removing values grater by WEIRD_TEMPERATURE (45 degrees).

Slide 19

Slide 19 text

Filtering example (1/4) def mapfn(index, input_path): f = open(input_path, 'r') WEIRD_TEMPERATURE = 45.0 for line in f: try: date, temperature = line.strip().split(',') temperature = float(temperature) if temperature < WEIRD_TEMPERATURE: yield input_path, line except ValueError: pass # for CSV headers or dirty data f.close()

Slide 20

Slide 20 text

Filtering example (2/4) def reducefn(input_path, filtered_lines): output = 'output/filter/%s' % input_path.split('/')[-1] f = open(output, 'w+') f.write(''.join(filtered_lines)) f.close return output

Slide 21

Slide 21 text

Filtering example (3/4) import mincemeat import glob s = mincemeat.Server() s.datasource = dict(enumerate(glob.glob('data/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results

Slide 22

Slide 22 text

Filtering example (4/4) $ ./filter.py $ mincemeat.py -p 4dev localhost $ wc -l data/* 59201 data/sensor-office.csv 59201 data/sensor-outdoor-west.csv 118402 total $ wc -l output/filter/* 36221 output/filter/sensor-office.csv 54068 output/filter/sensor-outdoor-west.csv 90289 total

Slide 23

Slide 23 text

Generating Top – N list ●Each mapper looks for TOP values in given chunk of data, reducer does the same for results. ●Similar to sorting and limiting result set in SQL database. ●Very efficient – big dataset is being split into chunks and everything goes in parallel.

Slide 24

Slide 24 text

Top – N example Across all sensors datasets (in distinct csv files) find 5 highest detected temperatures.

Slide 25

Slide 25 text

TOP-N example (1/3) def mapfn(index, input_path): TOP_N = 5 f = open(input_path, 'r') temperatures = [] for line in f: try: date, temperature = line.strip().split(',') temperatures.append(float(temperature)) except ValueError: pass # for CSV headers or dirty data f.close() temperatures = list(set(temperatures)) # make it unique temperatures.sort() for temperature in temperatures[-TOP_N:]: yield 1, temperature

Slide 26

Slide 26 text

TOP-N example (2/3) def reducefn(index, top_temperatures): TOP_N = 5 top_temperatures.sort() return top_temperatures[-TOP_N:]

Slide 27

Slide 27 text

TOP-N example (3/3) import mincemeat import glob s = mincemeat.Server() s.datasource = dict(enumerate(glob.glob('output/filter/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {1: [44.1, 44.3, 44.4, 44.7, 44.8]}

Slide 28

Slide 28 text

Counting ●Counts occurrence of some value or pattern. ●Emits searched value as a key and 1 as a value. ●Can be optimized to pre-count values in mapper.

Slide 29

Slide 29 text

Counting example Find number of request made by Internet Explorer family browsers. As a datasets use gzipped daily Apache access logs.

Slide 30

Slide 30 text

Counting example (1/3) def mapfn(index, input_path): import csv import gzip with gzip.open(input_path, 'r') as f: reader = csv.reader(f, delimiter=' ') for row in reader: if 'MSIE' in row[9]: # row with index 9 is user agent yield 'MSIE', 1

Slide 31

Slide 31 text

Counting example (2/3) def reducefn(browser_family, counts): return sum(counts)

Slide 32

Slide 32 text

Counting example (3/3) import mincemeat import glob s = mincemeat.Server() s.datasource = dict(enumerate(glob.glob('logs/*.gz'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {'MSIE': 7499}

Slide 33

Slide 33 text

Numerical summarizations ●Because of programmatic nature of Map-Reduce, all kind of calculations can be applied in reducer. ●Mapper groups data by some criterion, reducer does all the math. ●Very helpful for reporting.

Slide 34

Slide 34 text

Summarizations example For each sensor calculate mean, stdev, min and max temperatures on monthly basis.

Slide 35

Slide 35 text

Summarizations example (1/3) def mapfn(index, input_path): import datetime f = open(input_path, 'r') sensor_name = input_path.split('/')[-1].split('.')[0] for line in f: try: date, temperature = line.strip().split(',') temperature = float(temperature) date = datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ") yield '%s-%d-%d' % (sensor_name, date.year, date.month), temperature except ValueError: pass # for CSV headers or dirty data f.close()

Slide 36

Slide 36 text

Summarizations example (2/3) def reducefn(key, values): import numpy return { 'mean': round(numpy.mean(values), 2), 'stdev': round(numpy.std(values), 2), 'min': round(numpy.min(values), 2), 'max': round(numpy.max(values), 2) }

Slide 37

Slide 37 text

Summarizations example (3/3) import mincemeat import glob s = mincemeat.Server() s.datasource = dict(enumerate(glob.glob('data/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {'sensor-office-2013-7': {'mean': 24.32, 'max': 32.1, 'min': 20.0, 'stdev': 1.65}, [...]

Slide 38

Slide 38 text

Indexing ●The famous inverted index. ●It's like an index of terms in the end of a book. ●Can map terms to URLs, files, wherever they came from. ●Mapper extracts terms from input and yields them one by one with origin (URL or location) for example. ●Remember to exclude stop-words.

Slide 39

Slide 39 text

Indexing example For given set of text files (Pan Tadeusz fragments) create an inverted index and figure out in which files word 'Wojski' occurs. Exclude common Polish stop-words.

Slide 40

Slide 40 text

Indexing example (1/3) def mapfn(index, input_path): import string # list is way longer, shortened for brevity stop_words = ['a', 'aby', 'ach', 'acz', 'aczkolwiek'] f = open(input_path, 'r') text = f.read() f.close() punctuation = set(string.punctuation) text = ''.join(ch for ch in text if ch not in punctuation) for word in set(text.lower().split(' ')): if word not in stop_words: yield word, input_path

Slide 41

Slide 41 text

Indexing example (2/3) def reducefn(key, values): return values

Slide 42

Slide 42 text

Indexing example (3/3) # -*- coding: utf-8 -*- import mincemeat import glob s = mincemeat.Server() s.datasource = dict(enumerate(glob.glob('lorem-ipsum/*.txt'))) s.mapfn = mapfn s.reducefn = reducefn index = s.run_server(password="4dev") print index['wojski'] # prints: ['lorem-ipsum/part4.txt', 'lorem-ipsum/part2.txt']

Slide 43

Slide 43 text

Combining datasets ●Joining different types of data sets. ●All data sets goes to mapper. ●Mapper recognizes type of input (based on columns count or data characteristic). ●Mapper emits common key (like foreign key in SQL databases) and data to join. ●Reducer joins them and returns.

Slide 44

Slide 44 text

… and many more ●Graph processing, ●Sorting, ●Distinct Values, ●Bucketing, ●Cross-Correlation, ●Processing binary data (pdf, images, video) ●All kinds of data processing for machine learning (recommendations, clustering, classifications).

Slide 45

Slide 45 text

Real-life use cases

Slide 46

Slide 46 text

NY Times public domain articles digitalization “As part of eliminating TimesSelect, The New York Times has decided to make all the public domain articles from 1851–1922 available free of charge. These articles are all in the form of images scanned from the original paper. In fact from 1851–1980, all 11 million articles are available as images in PDF format. To generate a PDF version of the article takes quite a bit of work — each article is actually composed of numerous smaller TIFF images that need to be scaled and glued together in a coherent fashion.” “I was ready to deploy Hadoop. […] I churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3. ” source: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/

Slide 47

Slide 47 text

PageRank by Google “PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.” Google utilized Map-Reduce idea to calculate value of PageRank for pages stored on GFS (HDFS ancestor), Source: http://pl.wikipedia.org/wiki/PageRank

Slide 48

Slide 48 text

Google Maps data generation Locating roads connected to a given intersection, rendering map tiles, finding POI for address. Lot of interesting algorithms implemented with Map-Reduce. Source: https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf Source: http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC_files/v3_document.htm

Slide 49

Slide 49 text

Conclusion

Slide 50

Slide 50 text

Map-Reduce approach allows to implement most of SQL database operations, as long as dataset contains data we need to do those.

Slide 51

Slide 51 text

Both map and reduce steps are high-level language functions, so their usage is limited only by imagination of developer, therefore it's very flexible.

Slide 52

Slide 52 text

Map-Reduce approach is scalable and proper for huge amounts of data.

Slide 53

Slide 53 text

Thank you!

Slide 54

Slide 54 text

Contact [email protected] https://twitter.com/sznapka http://sznapka.pl

Slide 55

Slide 55 text

Join me at Cherry Poland If you're Senior PHP developer, you're excited about high load and non-CRUD business logic, you like to work in international team, we should talk! #mongo #php #python #zf2 #angularJS #redis #memcached #rabbitMQ #scalability #datawarehouse #amazon #redshift #continous-delivery #scrum

Slide 56

Slide 56 text

Sources https://docs.google.com/a/knowlabs.com/document/d/1OV91FO5FejjbvUw2XMi6J3emkgnrnlFB7nnharOttHg/view http://cecs.wright.edu/~tkprasad/courses/cs707/ProgrammingHadoop.pdf http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC_files/v3_document.htm https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf http://courses.cs.washington.edu/courses/cse524/08wi/slides/mapreduce-lec.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf http://www.josemalvarez.es/web/wp-content/uploads/2013/04/MapReduce.png http://www.webforefront.com/bigdata/mapreduceintro.html http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/ http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2008/Notes/mapreduce.pdf http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/