Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Map-Reduce patterns

Map-Reduce patterns

Domination of relational databases and SQL, caused that in developers think about operating and analyzing data as of design patterns. Unfortunately, for large sets of data relational database could be insufficient. Luckily we aren't doomed. Thanks to non-relational Big Data tools, such as combination of storage and computations, which is Apache Hadoop, we can analyze giant data sets in easy and effective fashion. It requires changing mindset from SQL to Map-Reduce and being familiar with few patterns, which helps solve typical problems, which we used to cope with SQL.

Wojciech Sznapka

June 10, 2014
Tweet

More Decks by Wojciech Sznapka

Other Decks in Programming

Transcript

  1. About the speaker Doing software since 2004. Head of Development

    at Cherry Poland. Loves sophisticated architectures and implementing them. Data science and sensors geek after hours.
  2. Map - Reduce A programming model for processing large data

    sets with a parallel, distributed algorithm on a cluster. A programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Developed by Google, primarily for indexing. Source: Wikipedia
  3. Reducer Transforms intermediate key-value aggregates into any number of output

    pairs. Reducer receives all values for given key emitted by mapper.
  4. Big data? Usually amount of data, that crashes Excel or

    makes your DBA cries. Mainly unstructured, flat files (csv) split by some criterion.
  5. Apache Hadoop Dominant platform for Map-Reduce engineering. Shipped with HDFS

    – distributed file system, which acts as input and output for MR jobs. Streaming API allows to write MR functions in any language (Python, PHP) – not only Java. Map-Reduce ecosystem produced many tools on top of it (Mahout, Hive, Pig).
  6. R R is programming language and environment for statistical computing.

    There are R packages to implement Map-Reduce model.
  7. Filtering patterns •Keep or throw away the record, without modifying

    it. •Equivalent to SQL WHERE condition. •Extremely useful in data science for cleaning data set before computations.
  8. Filtering example For given CSV files contain temperatures and time

    for given sensor (as file name) clean datasets by removing values grater by WEIRD_TEMPERATURE (45 degrees).
  9. Filtering example (1/4) def mapfn(index, input_path): f = open(input_path, 'r')

    WEIRD_TEMPERATURE = 45.0 for line in f: try: date, temperature = line.strip().split(',') temperature = float(temperature) if temperature < WEIRD_TEMPERATURE: yield input_path, line except ValueError: pass # for CSV headers or dirty data f.close()
  10. Filtering example (2/4) def reducefn(input_path, filtered_lines): output = 'output/filter/%s' %

    input_path.split('/')[-1] f = open(output, 'w+') f.write(''.join(filtered_lines)) f.close return output
  11. Filtering example (3/4) import mincemeat import glob s = mincemeat.Server()

    s.datasource = dict(enumerate(glob.glob('data/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results
  12. Filtering example (4/4) $ ./filter.py $ mincemeat.py -p 4dev localhost

    $ wc -l data/* 59201 data/sensor-office.csv 59201 data/sensor-outdoor-west.csv 118402 total $ wc -l output/filter/* 36221 output/filter/sensor-office.csv 54068 output/filter/sensor-outdoor-west.csv 90289 total
  13. Generating Top – N list •Each mapper looks for TOP

    values in given chunk of data, reducer does the same for results. •Similar to sorting and limiting result set in SQL database. •Very efficient – big dataset is being split into chunks and everything goes in parallel.
  14. Top – N example Across all sensors datasets (in distinct

    csv files) find 5 highest detected temperatures.
  15. TOP-N example (1/3) def mapfn(index, input_path): TOP_N = 5 f

    = open(input_path, 'r') temperatures = [] for line in f: try: date, temperature = line.strip().split(',') temperatures.append(float(temperature)) except ValueError: pass # for CSV headers or dirty data f.close() temperatures = list(set(temperatures)) # make it unique temperatures.sort() for temperature in temperatures[-TOP_N:]: yield 1, temperature
  16. TOP-N example (3/3) import mincemeat import glob s = mincemeat.Server()

    s.datasource = dict(enumerate(glob.glob('output/filter/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {1: [44.1, 44.3, 44.4, 44.7, 44.8]}
  17. Counting •Counts occurrence of some value or pattern. •Emits searched

    value as a key and 1 as a value. •Can be optimized to pre-count values in mapper.
  18. Counting example Find number of request made by Internet Explorer

    family browsers. As a datasets use gzipped daily Apache access logs.
  19. Counting example (1/3) def mapfn(index, input_path): import csv import gzip

    with gzip.open(input_path, 'r') as f: reader = csv.reader(f, delimiter=' ') for row in reader: if 'MSIE' in row[9]: # row with index 9 is user agent yield 'MSIE', 1
  20. Counting example (3/3) import mincemeat import glob s = mincemeat.Server()

    s.datasource = dict(enumerate(glob.glob('logs/*.gz'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {'MSIE': 7499}
  21. Numerical summarizations •Because of programmatic nature of Map-Reduce, all kind

    of calculations can be applied in reducer. •Mapper groups data by some criterion, reducer does all the math. •Very helpful for reporting.
  22. Summarizations example (1/3) def mapfn(index, input_path): import datetime f =

    open(input_path, 'r') sensor_name = input_path.split('/')[-1].split('.')[0] for line in f: try: date, temperature = line.strip().split(',') temperature = float(temperature) date = datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ") yield '%s-%d-%d' % (sensor_name, date.year, date.month), temperature except ValueError: pass # for CSV headers or dirty data f.close()
  23. Summarizations example (2/3) def reducefn(key, values): import numpy return {

    'mean': round(numpy.mean(values), 2), 'stdev': round(numpy.std(values), 2), 'min': round(numpy.min(values), 2), 'max': round(numpy.max(values), 2) }
  24. Summarizations example (3/3) import mincemeat import glob s = mincemeat.Server()

    s.datasource = dict(enumerate(glob.glob('data/*.csv'))) s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="4dev") print results {'sensor-office-2013-7': {'mean': 24.32, 'max': 32.1, 'min': 20.0, 'stdev': 1.65}, [...]
  25. Indexing •The famous inverted index. •It's like an index of

    terms in the end of a book. •Can map terms to URLs, files, wherever they came from. •Mapper extracts terms from input and yields them one by one with origin (URL or location) for example. •Remember to exclude stop-words.
  26. Indexing example For given set of text files (Pan Tadeusz

    fragments) create an inverted index and figure out in which files word 'Wojski' occurs. Exclude common Polish stop-words.
  27. Indexing example (1/3) def mapfn(index, input_path): import string # list

    is way longer, shortened for brevity stop_words = ['a', 'aby', 'ach', 'acz', 'aczkolwiek'] f = open(input_path, 'r') text = f.read() f.close() punctuation = set(string.punctuation) text = ''.join(ch for ch in text if ch not in punctuation) for word in set(text.lower().split(' ')): if word not in stop_words: yield word, input_path
  28. Indexing example (3/3) # -*- coding: utf-8 -*- import mincemeat

    import glob s = mincemeat.Server() s.datasource = dict(enumerate(glob.glob('lorem-ipsum/*.txt'))) s.mapfn = mapfn s.reducefn = reducefn index = s.run_server(password="4dev") print index['wojski'] # prints: ['lorem-ipsum/part4.txt', 'lorem-ipsum/part2.txt']
  29. Combining datasets •Joining different types of data sets. •All data

    sets goes to mapper. •Mapper recognizes type of input (based on columns count or data characteristic). •Mapper emits common key (like foreign key in SQL databases) and data to join. •Reducer joins them and returns.
  30. … and many more •Graph processing, •Sorting, •Distinct Values, •Bucketing,

    •Cross-Correlation, •Processing binary data (pdf, images, video) •All kinds of data processing for machine learning (recommendations, clustering, classifications).
  31. NY Times public domain articles digitalization “As part of eliminating

    TimesSelect, The New York Times has decided to make all the public domain articles from 1851–1922 available free of charge. These articles are all in the form of images scanned from the original paper. In fact from 1851–1980, all 11 million articles are available as images in PDF format. To generate a PDF version of the article takes quite a bit of work — each article is actually composed of numerous smaller TIFF images that need to be scaled and glued together in a coherent fashion.” “I was ready to deploy Hadoop. […] I churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3. ” source: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/
  32. PageRank by Google “PageRank is an algorithm used by Google

    Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.” Google utilized Map-Reduce idea to calculate value of PageRank for pages stored on GFS (HDFS ancestor), Source: http://pl.wikipedia.org/wiki/PageRank
  33. Google Maps data generation Locating roads connected to a given

    intersection, rendering map tiles, finding POI for address. Lot of interesting algorithms implemented with Map-Reduce. Source: https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf Source: http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC_files/v3_document.htm
  34. Map-Reduce approach allows to implement most of SQL database operations,

    as long as dataset contains data we need to do those.
  35. Both map and reduce steps are high-level language functions, so

    their usage is limited only by imagination of developer, therefore it's very flexible.
  36. Join me at Cherry Poland If you're Senior PHP developer,

    you're excited about high load and non-CRUD business logic, you like to work in international team, we should talk! #mongo #php #python #zf2 #angularJS #redis #memcached #rabbitMQ #scalability #datawarehouse #amazon #redshift #continous-delivery #scrum