Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Map-Reduce patterns

Map-Reduce patterns

Domination of relational databases and SQL, caused that in developers think about operating and analyzing data as of design patterns. Unfortunately, for large sets of data relational database could be insufficient. Luckily we aren't doomed. Thanks to non-relational Big Data tools, such as combination of storage and computations, which is Apache Hadoop, we can analyze giant data sets in easy and effective fashion. It requires changing mindset from SQL to Map-Reduce and being familiar with few patterns, which helps solve typical problems, which we used to cope with SQL.

Wojciech Sznapka

June 10, 2014
Tweet

More Decks by Wojciech Sznapka

Other Decks in Programming

Transcript

  1. Map-Reduce
    design patterns
    4Developers – Warsaw, 7th April 2014
    Wojciech Sznapka

    View Slide

  2. About the speaker
    Doing software since 2004.
    Head of Development at Cherry Poland.
    Loves sophisticated architectures and implementing
    them.
    Data science and sensors geek after hours.

    View Slide

  3. About the talk
    Definitions
    Tools
    Design patterns
    Real-life use cases
    Conclusion

    View Slide

  4. Definitions

    View Slide

  5. Map - Reduce
    A programming model for processing large data sets
    with a parallel, distributed algorithm on a cluster.
    A programming model for processing large data sets
    with a parallel, distributed algorithm on a cluster.
    Developed by Google, primarily for indexing.
    Source: Wikipedia

    View Slide

  6. Mapper
    Converts chunks of input into intermediate key-value
    pairs.

    View Slide

  7. Reducer
    Transforms intermediate key-value aggregates into
    any number of output pairs.
    Reducer receives all values for given key emitted by
    mapper.

    View Slide

  8. Shuffle & sort
    Transparent stage between map and reduce phases,
    handled entirely by MR framework.

    View Slide

  9. Key features
    1. Distributed
    2. Fault-tolerant
    3. Suitable for big data

    View Slide

  10. Big data?
    Usually amount of data, that crashes Excel or makes
    your DBA cries.
    Mainly unstructured, flat files (csv) split by some
    criterion.

    View Slide

  11. Tools

    View Slide

  12. Apache Hadoop
    Dominant platform for Map-Reduce engineering. Shipped with
    HDFS – distributed file system, which acts as input and output for
    MR jobs.
    Streaming API allows to write MR functions in any language
    (Python, PHP) – not only Java.
    Map-Reduce ecosystem produced many tools on top of it
    (Mahout, Hive, Pig).

    View Slide

  13. MongoDB
    Offers Map-Reduce engine to operate with stored
    documents and work similarly to normal queries.

    View Slide

  14. R
    R is programming language and environment for
    statistical computing.
    There are R packages to implement Map-Reduce
    model.

    View Slide

  15. Lightweight implementations
    Octo.py
    Mincemeat.py ← used for examples in this talk
    ...

    View Slide

  16. Design patterns

    View Slide

  17. Filtering patterns
    ●Keep or throw away the record, without modifying it.
    ●Equivalent to SQL WHERE condition.
    ●Extremely useful in data science for cleaning data
    set before computations.

    View Slide

  18. Filtering example
    For given CSV files contain temperatures and time for
    given sensor (as file name) clean datasets by
    removing values grater by WEIRD_TEMPERATURE (45
    degrees).

    View Slide

  19. Filtering example (1/4)
    def mapfn(index, input_path):
    f = open(input_path, 'r')
    WEIRD_TEMPERATURE = 45.0
    for line in f:
    try:
    date, temperature = line.strip().split(',')
    temperature = float(temperature)
    if temperature < WEIRD_TEMPERATURE:
    yield input_path, line
    except ValueError:
    pass # for CSV headers or dirty data
    f.close()

    View Slide

  20. Filtering example (2/4)
    def reducefn(input_path, filtered_lines):
    output = 'output/filter/%s' % input_path.split('/')[-1]
    f = open(output, 'w+')
    f.write(''.join(filtered_lines))
    f.close
    return output

    View Slide

  21. Filtering example (3/4)
    import mincemeat
    import glob
    s = mincemeat.Server()
    s.datasource = dict(enumerate(glob.glob('data/*.csv')))
    s.mapfn = mapfn
    s.reducefn = reducefn
    results = s.run_server(password="4dev")
    print results

    View Slide

  22. Filtering example (4/4)
    $ ./filter.py
    $ mincemeat.py -p 4dev localhost
    $ wc -l data/*
    59201 data/sensor-office.csv
    59201 data/sensor-outdoor-west.csv
    118402 total
    $ wc -l output/filter/*
    36221 output/filter/sensor-office.csv
    54068 output/filter/sensor-outdoor-west.csv
    90289 total

    View Slide

  23. Generating Top – N list
    ●Each mapper looks for TOP values in given chunk of
    data, reducer does the same for results.
    ●Similar to sorting and limiting result set in SQL
    database.
    ●Very efficient – big dataset is being split into chunks
    and everything goes in parallel.

    View Slide

  24. Top – N example
    Across all sensors datasets (in distinct csv files) find
    5 highest detected temperatures.

    View Slide

  25. TOP-N example (1/3)
    def mapfn(index, input_path):
    TOP_N = 5
    f = open(input_path, 'r')
    temperatures = []
    for line in f:
    try:
    date, temperature = line.strip().split(',')
    temperatures.append(float(temperature))
    except ValueError:
    pass # for CSV headers or dirty data
    f.close()
    temperatures = list(set(temperatures)) # make it unique
    temperatures.sort()
    for temperature in temperatures[-TOP_N:]:
    yield 1, temperature

    View Slide

  26. TOP-N example (2/3)
    def reducefn(index, top_temperatures):
    TOP_N = 5
    top_temperatures.sort()
    return top_temperatures[-TOP_N:]

    View Slide

  27. TOP-N example (3/3)
    import mincemeat
    import glob
    s = mincemeat.Server()
    s.datasource = dict(enumerate(glob.glob('output/filter/*.csv')))
    s.mapfn = mapfn
    s.reducefn = reducefn
    results = s.run_server(password="4dev")
    print results
    {1: [44.1, 44.3, 44.4, 44.7, 44.8]}

    View Slide

  28. Counting
    ●Counts occurrence of some value or pattern.
    ●Emits searched value as a key and 1 as a value.
    ●Can be optimized to pre-count values in mapper.

    View Slide

  29. Counting example
    Find number of request made by Internet Explorer
    family browsers.
    As a datasets use gzipped daily Apache access logs.

    View Slide

  30. Counting example (1/3)
    def mapfn(index, input_path):
    import csv
    import gzip
    with gzip.open(input_path, 'r') as f:
    reader = csv.reader(f, delimiter=' ')
    for row in reader:
    if 'MSIE' in row[9]: # row with index 9 is user agent
    yield 'MSIE', 1

    View Slide

  31. Counting example (2/3)
    def reducefn(browser_family, counts):
    return sum(counts)

    View Slide

  32. Counting example (3/3)
    import mincemeat
    import glob
    s = mincemeat.Server()
    s.datasource = dict(enumerate(glob.glob('logs/*.gz')))
    s.mapfn = mapfn
    s.reducefn = reducefn
    results = s.run_server(password="4dev")
    print results
    {'MSIE': 7499}

    View Slide

  33. Numerical summarizations
    ●Because of programmatic nature of Map-Reduce, all
    kind of calculations can be applied in reducer.
    ●Mapper groups data by some criterion, reducer does
    all the math.
    ●Very helpful for reporting.

    View Slide

  34. Summarizations example
    For each sensor calculate mean, stdev, min and max
    temperatures on monthly basis.

    View Slide

  35. Summarizations example (1/3)
    def mapfn(index, input_path):
    import datetime
    f = open(input_path, 'r')
    sensor_name = input_path.split('/')[-1].split('.')[0]
    for line in f:
    try:
    date, temperature = line.strip().split(',')
    temperature = float(temperature)
    date = datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ")
    yield '%s-%d-%d' % (sensor_name, date.year, date.month), temperature
    except ValueError:
    pass # for CSV headers or dirty data
    f.close()

    View Slide

  36. Summarizations example (2/3)
    def reducefn(key, values):
    import numpy
    return {
    'mean': round(numpy.mean(values), 2),
    'stdev': round(numpy.std(values), 2),
    'min': round(numpy.min(values), 2),
    'max': round(numpy.max(values), 2)
    }

    View Slide

  37. Summarizations example (3/3)
    import mincemeat
    import glob
    s = mincemeat.Server()
    s.datasource = dict(enumerate(glob.glob('data/*.csv')))
    s.mapfn = mapfn
    s.reducefn = reducefn
    results = s.run_server(password="4dev")
    print results
    {'sensor-office-2013-7': {'mean': 24.32, 'max': 32.1, 'min': 20.0,
    'stdev': 1.65}, [...]

    View Slide

  38. Indexing
    ●The famous inverted index.
    ●It's like an index of terms in the end of a book.
    ●Can map terms to URLs, files, wherever they came
    from.
    ●Mapper extracts terms from input and yields them
    one by one with origin (URL or location) for example.
    ●Remember to exclude stop-words.

    View Slide

  39. Indexing example
    For given set of text files (Pan Tadeusz fragments)
    create an inverted index and figure out in which files
    word 'Wojski' occurs. Exclude common Polish
    stop-words.

    View Slide

  40. Indexing example (1/3)
    def mapfn(index, input_path):
    import string
    # list is way longer, shortened for brevity
    stop_words = ['a', 'aby', 'ach', 'acz', 'aczkolwiek']
    f = open(input_path, 'r')
    text = f.read()
    f.close()
    punctuation = set(string.punctuation)
    text = ''.join(ch for ch in text if ch not in punctuation)
    for word in set(text.lower().split(' ')):
    if word not in stop_words:
    yield word, input_path

    View Slide

  41. Indexing example (2/3)
    def reducefn(key, values):
    return values

    View Slide

  42. Indexing example (3/3)
    # -*- coding: utf-8 -*-
    import mincemeat
    import glob
    s = mincemeat.Server()
    s.datasource = dict(enumerate(glob.glob('lorem-ipsum/*.txt')))
    s.mapfn = mapfn
    s.reducefn = reducefn
    index = s.run_server(password="4dev")
    print index['wojski']
    # prints: ['lorem-ipsum/part4.txt', 'lorem-ipsum/part2.txt']

    View Slide

  43. Combining datasets
    ●Joining different types of data sets.
    ●All data sets goes to mapper.
    ●Mapper recognizes type of input (based on columns
    count or data characteristic).
    ●Mapper emits common key (like foreign key in SQL
    databases) and data to join.
    ●Reducer joins them and returns.

    View Slide

  44. … and many more
    ●Graph processing,
    ●Sorting,
    ●Distinct Values,
    ●Bucketing,
    ●Cross-Correlation,
    ●Processing binary data (pdf, images, video)
    ●All kinds of data processing for machine learning
    (recommendations, clustering, classifications).

    View Slide

  45. Real-life use cases

    View Slide

  46. NY Times public domain articles
    digitalization
    “As part of eliminating TimesSelect, The New York Times has decided to make all the public domain articles
    from 1851–1922 available free of charge. These articles are all in the form of images scanned from the original
    paper. In fact from 1851–1980, all 11 million articles are available as images in PDF format. To generate a PDF
    version of the article takes quite a bit of work — each article is actually composed of numerous smaller TIFF
    images that need to be scaled and glued together in a coherent fashion.”
    “I was ready to deploy Hadoop. […] I churned through all 11 million articles in just under 24 hours using 100 EC2
    instances, and generated another 1.5TB of data to store in S3. ”
    source: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/

    View Slide

  47. PageRank by Google
    “PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank
    was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance
    of website pages. According to Google:
    PageRank works by counting the number and quality of links to a page to determine a rough estimate of how
    important the website is. The underlying assumption is that more important websites are likely to receive
    more links from other websites.”
    Google utilized Map-Reduce idea to calculate value of PageRank for pages stored on GFS (HDFS ancestor),
    Source: http://pl.wikipedia.org/wiki/PageRank

    View Slide

  48. Google Maps data generation
    Locating roads connected to a given intersection, rendering map tiles, finding POI for address.
    Lot of interesting algorithms implemented with Map-Reduce.
    Source: https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf
    Source: http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC_files/v3_document.htm

    View Slide

  49. Conclusion

    View Slide

  50. Map-Reduce approach allows to implement most of
    SQL database operations, as long as dataset contains
    data we need to do those.

    View Slide

  51. Both map and reduce steps are high-level language
    functions, so their usage is limited only by
    imagination of developer, therefore it's very flexible.

    View Slide

  52. Map-Reduce approach is scalable and proper for huge
    amounts of data.

    View Slide

  53. Thank you!

    View Slide

  54. Contact
    [email protected]
    https://twitter.com/sznapka
    http://sznapka.pl

    View Slide

  55. Join me at Cherry Poland
    If you're Senior PHP developer,
    you're excited about high load and non-CRUD business logic,
    you like to work in international team,
    we should talk!
    #mongo #php #python #zf2 #angularJS #redis #memcached #rabbitMQ
    #scalability #datawarehouse #amazon #redshift #continous-delivery #scrum

    View Slide

  56. Sources
    https://docs.google.com/a/knowlabs.com/document/d/1OV91FO5FejjbvUw2XMi6J3emkgnrnlFB7nnharOttHg/view
    http://cecs.wright.edu/~tkprasad/courses/cs707/ProgrammingHadoop.pdf
    http://web.cs.wpi.edu/~cs4513/d08/OtherStuff/MapReduce-TeamC_files/v3_document.htm
    https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf
    http://courses.cs.washington.edu/courses/cse524/08wi/slides/mapreduce-lec.pdf
    http://www.umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
    http://www.josemalvarez.es/web/wp-content/uploads/2013/04/MapReduce.png
    http://www.webforefront.com/bigdata/mapreduceintro.html
    http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/
    http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2008/Notes/mapreduce.pdf
    http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

    View Slide