$30 off During Our Annual Pro Sale. View Details »

Using Functional Programming for efficient Data Processing and Analysis

Using Functional Programming for efficient Data Processing and Analysis

A PyCon workshop on Functional Programming
Video: https://www.youtube.com/watch?v=9kDUTJahXBM
Code: https://github.com/reubano/pycon17-tute

Reuben Cummings

May 17, 2017
Tweet

More Decks by Reuben Cummings

Other Decks in Programming

Transcript

  1. Using Functional Programming for
    efficient Data Processing and Analysis
    PyCon — Portland, Oregon — May 17, 2017
    by Reuben Cummings
    @reubano

    View Slide

  2. Part I

    View Slide

  3. • Managing Director, Nerevu Development
    • Founder of Arusha Coders
    • Author of several popular Python packages
    Who am I?

    View Slide

  4. Hands-on workshop
    (don't be a spectator)
    Image Credit
    www.flickr.com/photos/16210667@N02

    View Slide

  5. what are you looking to get
    out of this workshop?

    View Slide

  6. What is data? Image Credit
    www.flickr.com/photos/147437926@N08

    View Slide

  7. Organization
    room presenter
    1 matt
    3 james
    6 reuben
    You can't
    afford to have
    security be an
    optional or
    "nice-to-
    have"...
    structured unstructured

    View Slide

  8. Storage
    flat binary
    type,day
    tutorial,wed
    talk,fri
    poster,sun
    keynote,fri
    00103e0 b0e6 04...
    00105f0 e4e7 03...
    0010600 0be8 04...
    00105b0 c4e4 02...
    00106e0 b0e9 04...

    View Slide

  9. Organization vs Storage
    flat
    binary
    structured
    unstructured

    View Slide

  10. How do you process data? Image Credit
    www.flickr.com/photos/sugagaga

    View Slide

  11. E — extract
    T — transform
    L — load

    View Slide

  12. ETL: Extract
    sources Python

    View Slide

  13. ETL: Extract
    sources Python

    View Slide

  14. ETL: Transform
    Python Python

    View Slide

  15. ETL: Transform
    Python Python

    View Slide

  16. ETL: Load
    Python destination

    View Slide

  17. ETL: Load
    Python destination

    View Slide

  18. What is functional programming? Image Credit
    www.flickr.com/photos/shonk

    View Slide

  19. Let's make a rectangle!

    View Slide

  20. Rectangle (imperative)
    class Rectangle(object):
    def __init__(self, length, width):
    self.length = length
    self.width = width
    @property
    def area(self):
    return self.length * self.width
    def grow(self, amount):
    self.length *= amount

    View Slide

  21. Rectangle (imperative)
    >>> r = Rectangle(2, 3)
    >>> r.length
    2
    >>> r.area
    6
    >>> r.grow(2)
    >>> r.length
    4
    >>> r.area
    12

    View Slide

  22. Expensive Rectangle (imperative)
    from time import sleep
    class ExpensiveRectangle(Rectangle):
    @property
    def area(self):
    sleep(5)
    return self.length * self.width

    View Slide

  23. Expensive Rectangle (imperative)
    >>> r = ExpensiveRectangle(2, 3)
    >>> r.area
    6
    >>> r.area
    6

    View Slide

  24. Infinite Squares (imperative)
    def sum_area(rects):
    area = 0
    for r in rects:
    area += r.area
    return area

    View Slide

  25. Infinite Squares (imperative)
    >>> from itertools import count
    >>>
    >>> squares = (
    ... Rectangle(x, x) for x in count(1))
    >>> squares
    at 0x11233ca40>
    >>> next(squares)
    <__main__.Rectangle at 0x1123a8400>

    View Slide

  26. Infinite Squares (imperative)
    >>> sum_area(squares)
    KeyboardInterrupt
    Traceback (most recent call last)
    in ()
    ----> 1 sum_area(squares)
    in
    sum_area(rects)
    3
    4 for r in rects:
    ----> 5 area += r.area

    View Slide

  27. Now let's get functional!

    View Slide

  28. Rectangle (functional)
    def make_rect(length, width):
    return (length, width)
    def grow_rect(rect, amount):
    return (rect[0] * amount, rect[1])
    def get_length (rect):
    return rect[0]
    def get_area (rect):
    return rect[0] * rect[1]

    View Slide

  29. >>> grow_rect(r, 2)
    (4, 3)
    >>> get_length(r)
    2
    >>> get_area(r)
    6
    Rectangle (functional)
    >>> r = make_rect(2, 3)
    >>> get_length(r)
    2
    >>> get_area(r)
    6

    View Slide

  30. Rectangle (functional)
    >>> big_r = grow_rect(r, 2)
    >>> get_length(big_r)
    4
    >>> get_area(big_r)
    12

    View Slide

  31. Expensive Rectangle (functional)
    from functools import lru_cache
    @lru_cache()
    def exp_get_area (rect):
    sleep(5)
    return rect[0] * rect[1]

    View Slide

  32. Expensive Rectangle (functional)
    >>> r = make_rect(2, 3)
    >>> exp_get_area(r)
    6
    >>> exp_get_area(r)
    6

    View Slide

  33. Infinite Squares (functional)
    def accumulate_area(rects):
    accum = 0
    for r in rects:
    accum += get_area(r)
    yield accum

    View Slide

  34. Infinite Squares (functional)
    >>> from itertools import islice
    >>>
    >>> squares = (
    ... make_rect(x, x) for x in count(1))
    >>>
    >>> area = accumulate_area(squares)
    >>> next(islice(area, 6, 7))
    140
    >>> next(area)
    204

    View Slide

  35. Infinite Squares (functional)
    >>> from itertools import accumulate
    >>>
    >>> squares = (
    ... make_rect(x, x) for x in count(1))
    >>>
    >>> area = accumulate(map(get_area, squares))
    >>> next(islice(area, 6, 7))
    140
    >>> next(area)
    204

    View Slide

  36. Exercise #1 Image Credit: Me

    View Slide

  37. Exercise #1 (Problem)
    x
    y z

    View Slide

  38. Exercise #1 (Problem)
    x ₒ factor
    y h

    View Slide

  39. Exercise #1 (Problem)
    x
    y z
    x ₒ factor
    y h

    View Slide

  40. Exercise #1 (Problem)
    (z ÷ h)
    z
    h

    View Slide

  41. Exercise #1 (Problem)
    x
    y z
    x ₒ factor
    y h
    ratio =

    View Slide

  42. Exercise #1 (Problem)
    z = √(x2 + y2 )
    ratio = function1(x, y, factor)
    hyp = function2(rectangle)

    View Slide

  43. Exercise #1 (Problem)
    z = √(x2 + y2 )
    x
    y z
    x ₒ factor
    y h
    ratio = function1(x, y, factor)
    hyp = function2(rectangle)
    >>> get_ratio(1, 2, 2)
    0.7905694150420948

    View Slide

  44. Exercise #1 (Solution)
    from math import sqrt, pow
    def get_hyp(rect):
    sum_s = sum(pow(r, 2) for r in rect)
    return sqrt(sum_s)
    def get_ratio(length, width, factor=1):
    rect = make_rect(length, width)
    big_rect = grow_rect(rect, factor)
    return get_hyp(rect) / get_hyp(big_rect)

    View Slide

  45. Exercise #1 (Solution)
    >>> get_ratio(1, 2, 2)
    0.7905694150420948
    >>> get_ratio(1, 2, 3)
    0.6201736729460423
    >>> get_ratio(3, 4, 2)
    0.6933752452815365
    >>> get_ratio(3, 4, 3)
    0.5076730825668095

    View Slide

  46. Part II

    View Slide

  47. You might not need pandas Image Credit
    www.flickr.com/photos/harlequeen

    View Slide

  48. Obtaining data

    View Slide

  49. csv data
    >>> from csv import DictReader
    >>> from io import StringIO
    >>>
    >>> csv_str = 'Type,Day\ntutorial,wed\ntalk,fri'
    >>> csv_str += '\nposter,sun'
    >>> f = StringIO(csv_str)
    >>> data = DictReader(f)
    >>> dict(next(data))
    {'Day': 'wed', 'Type': 'tutorial'}

    View Slide

  50. JSON data
    >>> from urllib.request import urlopen
    >>> from ijson import items
    >>>
    >>> json_url = 'https://api.github.com/users'
    >>> f = urlopen(json_url)
    >>> data = items(f, 'item')
    >>> next(data)
    {'avatar_url': 'https://avatars3.githubuserco…',
    'events_url': 'https://api.github.com/users/…',
    'followers_url': 'https://api.github.com/use…',
    'following_url': 'https://api.github.com/use…',

    View Slide

  51. pip install xlrd

    View Slide

  52. xls(x) data
    >>> from urllib.request import urlretrieve
    >>> from xlrd import open_workbook
    >>>
    >>> xl_url = 'https://github.com/reubano/meza'
    >>> xl_url += '/blob/master/data/test/test.xlsx'
    >>> xl_url += '?raw=true'
    >>> xl_path = urlretrieve(xl_url)[0]
    >>> book = open_workbook(xl_path)
    >>> sheet = book.sheet_by_index(0)
    >>> header = sheet.row_values(0)

    View Slide

  53. xls(x) data
    >>> nrows = range(1, sheet.nrows)
    >>> rows = (sheet.row_values(x) for x in nrows)
    >>> data = (
    ... dict(zip(header, row)) for row in rows)
    >>>
    >>> next(data)
    {' ': ' ',
    'Some Date': 30075.0,
    'Some Value': 234.0,
    'Sparse Data': 'Iñtërnâtiônàližætiøn',
    'Unicode Test': 'Ādam'}

    View Slide

  54. Transforming data

    View Slide

  55. grouping data
    >>> import itertools as it
    >>> from operator import itemgetter
    >>>
    >>> records = [
    ... {'item': 'a', 'amount': 200},
    ... {'item': 'b', 'amount': 200},
    ... {'item': 'c', 'amount': 400}]
    >>>
    >>> keyfunc = itemgetter('amount')
    >>> _sorted = sorted(records, key=keyfunc)
    >>> groups = it.groupby(_sorted, keyfunc)

    View Slide

  56. grouping data
    >>> data = ((key, list(g)) for key, g in groups)
    >>> next(data)
    (200, [{'amount': 200, 'item': 'a'},
    {'amount': 200, 'item': 'b'}])

    View Slide

  57. aggregating data
    >>> key = 'amount'
    >>> value = sum(r.get(key, 0) for r in records)
    >>> {**records[0], key: value}
    {'a': 'item', 'amount': 800}

    View Slide

  58. Storing data

    View Slide

  59. csv files
    >>> from csv import DictWriter
    >>>
    >>> records = [
    ... {'item': 'a', 'amount': 200},
    ... {'item': 'b', 'amount': 400}]
    >>>
    >>> header = list(records[0].keys())
    >>> with open('output.csv', 'w') as f:
    ... w = DictWriter(f, header)
    ... w.writeheader()
    ... w.writerows(records)

    View Slide

  60. Introducing meza

    View Slide

  61. pip install meza

    View Slide

  62. Obtaining data

    View Slide

  63. csv data
    >>> from meza.io import read
    >>>
    >>> records = read('output.csv')
    >>> next(records)
    {'amount': '200', 'item': 'a'}

    View Slide

  64. JSON data
    >>> from meza.io import read_json
    >>>
    >>> f = urlopen(json_url)
    >>> records = read_json(f, path='item')
    >>> next(records)
    {'avatar_url': 'https://avatars3.githubuserco…',
    'events_url': 'https://api.github.com/users/…',
    'followers_url': 'https://api.github.com/use…',
    'following_url': 'https://api.github.com/use…',

    }

    View Slide

  65. xlsx data
    >>> from meza.io import read_xls
    >>>
    >>> records = read_xls(xl_path)
    >>> next(records)
    {'Some Date': '1982-05-04',
    'Some Value': '234.0',
    'Sparse Data': 'Iñtërnâtiônàližætiøn',
    'Unicode Test': 'Ādam'}

    View Slide

  66. Transforming data

    View Slide

  67. aggregation
    >>> from meza.process import aggregate
    >>>
    >>> records = [
    ... {'a': 'item', 'amount': 200},
    ... {'a': 'item', 'amount': 300},
    ... {'a': 'item', 'amount': 400}]
    ...
    >>> aggregate(records, 'amount', sum)
    {'a': 'item', 'amount': 900}

    View Slide

  68. merging
    >>> from meza.process import merge
    >>>
    >>> records = [
    ... {'a': 200}, {'b': 300}, {'c': 400}]
    >>>
    >>> merge(records)
    {'a': 200, 'b': 300, 'c': 400}

    View Slide

  69. grouping
    >>> from meza.process import group
    >>>
    >>> records = [
    ... {'item': 'a', 'amount': 200},
    ... {'item': 'a', 'amount': 200},
    ... {'item': 'b', 'amount': 400}]
    >>>
    >>> groups = group(records, 'item')
    >>> next(groups)

    View Slide

  70. normalization
    >>> from meza.process import normalize
    >>>
    >>> records = [
    ... {
    ... 'color': 'blue', 'setosa': 5,
    ... 'versi': 6
    ... }, {
    ... 'color': 'red', 'setosa': 3,
    ... 'versi': 5
    ... }]

    View Slide

  71. normalization
    >>> kwargs = {
    ... 'data': 'length', 'column':'species',
    ... 'rows': ['setosa', 'versi']}
    >>>
    >>> data = normalize(records, **kwargs)
    >>> next(data)
    {'color': 'blue', 'length': 5, 'species':
    'setosa'}

    View Slide

  72. normalization
    before after
    color setosa versi
    blue 5 6
    red 3 5
    color length species
    blue 5 setosa
    blue 6 versi
    red 3 setosa
    red 5 versi

    View Slide

  73. Storing data

    View Slide

  74. csv files
    >>> from meza import convert as cv
    >>> from meza.io import write
    >>>
    >>> records = [
    ... {'item': 'a', 'amount': 200},
    ... {'item': 'b', 'amount': 400}]
    >>>
    >>> csv = cv.records2csv(records)
    >>> write('output.csv', csv)

    View Slide

  75. JSON files
    >>> json = cv.records2json(records)
    >>> write('output.json', json)

    View Slide

  76. Exercise #2 Image Credit: Me

    View Slide

  77. Exercise #2 (Problem)
    • create a list of dicts with keys "factor", "length",
    "width", and "ratio" (for factors 1 - 20)

    View Slide

  78. Exercise #2 (Problem)
    records = [
    {
    'factor': 1, 'length': 2, 'width': 2,
    'ratio': 1.0
    }, {
    'factor': 2, 'length': 2, 'width': 2,
    'ratio': 0.6324…
    }, {
    'factor': 3, 'length': 2, 'width': 2,
    'ratio': 0.4472…}
    ]

    View Slide

  79. Exercise #2 (Problem)
    • create a list of dicts with keys "factor", "length",
    "width", and "ratio" (for factors 1 - 20)
    • group the records by quartiles of the "ratio" value,
    and aggregate each group by the median "ratio"

    View Slide

  80. Exercise #2 (Problem)
    from statistics import median
    from meza.process import group
    records[0]['ratio'] // .25

    View Slide

  81. Exercise #2 (Problem)
    • create a list of dicts with keys "factor", "length",
    "width", and "ratio" (for factors 1 - 20)
    • group the records by quartiles of the "ratio" value,
    and aggregate each group by the median "ratio"
    • write the records out to a csv file (1 row per group)

    View Slide

  82. Exercise #2 (Problem)
    from meza.convert import records2csv
    from meza.io import write
    key median
    0 0.108…
    1 0.343…

    View Slide

  83. Exercise #2 (Solution)
    >>> length = width = 2
    >>> records = [
    ... {
    ... 'length': length,
    ... 'width': width,
    ... 'factor': f,
    ... 'ratio': get_ratio(length, width, f)
    ... }
    ...
    ... for f in range(1, 21)]

    View Slide

  84. Exercise #2 (Solution)
    >>> from statistics import median
    >>> from meza import process as pr
    >>>
    >>> def aggregator(group):
    ... ratios = (g['ratio'] for g in group)
    ... return median(ratios)
    >>>
    >>> kwargs = {'aggregator': aggregator}
    >>> gkeyfunc = lambda r: r['ratio'] // .25
    >>> groups = pr.group(
    ... records, gkeyfunc, **kwargs)

    View Slide

  85. Exercise #2 (Solution)
    >>> from meza import convert as cv
    >>> from meza.io import write
    >>>
    >>> results = [
    ... {'key': k, 'median': g}
    ... for k, g in groups]
    >>>
    >>> csv = cv.records2csv(results)
    >>> write('results.csv', csv)

    View Slide

  86. Exercise #2 (Solution)
    $ csvlook results.csv
    | key | median |
    | --- | ------ |
    | 0 | 0.108… |
    | 1 | 0.343… |
    | 2 | 0.632… |
    | 4 | 1.000… |

    View Slide

  87. Part III

    View Slide

  88. Introducing riko

    View Slide

  89. pip install riko

    View Slide

  90. Obtaining data

    View Slide

  91. Python Events
    Calendar
    https://www.python.org/events/python-
    events/

    View Slide

  92. Python Events
    Calendar
    https://www.python.org/events/python-
    events/

    View Slide

  93. Python Events Calendar
    >>> from riko.collections import SyncPipe
    >>>
    >>> url = 'www.python.org/events/python-events/'
    >>> _xpath = '/html/body/div/div[3]/div/section'
    >>> xpath = '{}/div/div/ul/li'.format(_xpath)
    >>> xconf = {'url': url, 'xpath': xpath}
    >>> kwargs = {'emit': False, 'token_key': None}
    >>> epath = 'h3.a.content'
    >>> lpath = 'p.span.content'
    >>> rrule = [{'field': 'h3'}, {'field': 'p'}]

    View Slide

  94. Python Events Calendar
    >>> flow = (
    ... SyncPipe('xpathfetchpage', conf=xconf)
    ... .subelement(
    ... conf={'path': epath},
    ... assign='event', **kwargs)
    ... .subelement(
    ... conf={'path': lpath},
    ... assign='location', **kwargs)
    ... .rename(conf={'rule': rrule}))

    View Slide

  95. Python Events Calendar
    >>> stream = flow.output
    >>> next(stream)
    {'event': 'PyDataBCN 2017',
    'location': 'Barcelona, Spain'}
    >>> next(stream)
    {'event': 'PyConWEB 2017',
    'location': 'Munich, Germany'}

    View Slide

  96. Transforming data

    View Slide

  97. Python Events Calendar
    >>> dpath = 'p.time.datetime'
    >>> frule = {
    ... 'field': 'date', 'op': 'after',
    ... 'value':'2017-06-01'}
    >>>
    >>> flow = (
    ... SyncPipe('xpathfetchpage', conf=xconf)
    ... .subelement(
    ... conf={'path': epath},
    ... assign='event', **kwargs)

    View Slide

  98. Python Events Calendar
    ... .subelement(
    ... conf={'path': lpath},
    ... assign='location', **kwargs)
    ... .subelement(
    ... conf={'path': dpath},
    ... assign='date', **kwargs)
    ... .rename(conf={'rule': rrule})
    ... .filter(conf={'rule': frule}))

    View Slide

  99. Python Events Calendar
    >>> stream = flow.output
    >>> next(stream)
    {'date': '2017-06-06T00:00:00+00:00',
    'event': 'PyCon Taiwan 2017',
    'location': 'Academia Sinica, 128 Academia
    Road, Section 2, Nankang, Taipei 11529, Taiwan'}

    View Slide

  100. Parallel processing

    View Slide

  101. Python Events Calendar
    >>> from meza.process import merge
    >>> from riko.collections import SyncCollection
    >>>
    >>> _type = 'xpathfetchpage'
    >>> source = {'url': url, 'type': _type}
    >>> xpath2 = '{}/div/ul/li'.format(_xpath)
    >>> sources = [
    ... merge([source, {'xpath': xpath}]),
    ... merge([source, {'xpath': xpath2}])]

    View Slide

  102. Python Events Calendar
    >>> sc = SyncCollection(sources, parallel=True)
    >>> flow = (sc.pipe()
    ... .subelement(
    ... conf={'path': epath},
    ... assign='event', **kwargs)
    ... .rename(conf={'rule': rrule}))
    >>>
    >>> stream = flow.list
    >>> stream[0]
    {'event': 'PyDataBCN 2017'}

    View Slide

  103. Python Events Calendar
    >>> stream[-1]
    {'event': 'PyDays Vienna 2017'}

    View Slide

  104. Exercise #3 Image Credit: Me

    View Slide

  105. Exercise #3 (Problem)
    • fetch the Python jobs rss feed
    • tokenize the "summary" field by newlines ("\n")
    • use "subelement" to extract the location (the first
    "token")
    • filter for jobs located in the U.S.

    View Slide

  106. Exercise #3 (Problem)
    from riko.collections import SyncPipe
    url = 'https://www.python.org/jobs/feed/rss'
    # use the 'fetch', 'tokenizer', 'subelement',
    # and 'filter' pipes

    View Slide

  107. Exercise #3 (Problem)
    • write the 'link', 'location', and 'title' fields of each
    record to a json file

    View Slide

  108. Exercise #3 (Problem)
    from meza.fntools import dfilter
    from meza.convert import records2json
    from meza.io import write

    View Slide

  109. Exercise #3 (Solution)
    >>> from riko.collections import SyncPipe
    >>>
    >>> url = 'https://www.python.org/jobs/feed/rss'
    >>> fetch_conf = {'url': url}
    >>> tconf = {'delimiter': '\n'}
    >>> rule = {
    ... 'field': 'location', 'op': 'contains'}
    >>> vals = ['usa', 'united states']
    >>> frule = [
    ... merge([rule, {'value': v}])
    ... for v in vals]

    View Slide

  110. Exercise #3 (Solution)
    >>> fconf = {'rule': frule, 'combine': 'or'}
    >>> kwargs = {'emit': False, 'token_key': None}
    >>> path = 'location.content.0'
    >>> rrule = [
    ... {'field': 'summary'},
    ... {'field': 'summary_detail'},
    ... {'field': 'author'},
    ... {'field': 'links'}]

    View Slide

  111. Exercise #3 (Solution)
    >>> flow = (SyncPipe('fetch', conf=fetch_conf)
    ... .tokenizer(
    ... conf=tconf, field='summary',
    ... assign='location')
    ... .subelement(
    ... conf={'path': path},
    ... assign='location', **kwargs)
    ... .filter(conf=fconf)
    ... .rename(conf={'rule': rrule}))

    View Slide

  112. Exercise #3 (Solution)
    >>> stream = flow.list
    >>> stream[0]
    {'dc:creator': None,
    'id': 'https://python.org/jobs/2570/',
    'link': 'https://python.org/jobs/2570/',
    'location': 'College Park,MD,USA',
    'title': 'Python Developer - MarketSmart',
    'title_detail': 'Python Developer -
    MarketSmart',
    'y:published': None,
    'y:title': 'Python Developer - MarketSmart'}

    View Slide

  113. Exercise #3 (Solution)
    >>> from meza import convert as cv
    >>> from meza.fntools import dfilter
    >>> from meza.io import write
    >>>
    >>> fields = ['link', 'location', 'title']
    >>> records = [
    ... dfilter(
    ... item, blacklist=fields,
    ... inverse=True)
    ... for item in stream]

    View Slide

  114. Exercise #3 (Solution)
    >>> json = cv.records2json(records)
    >>> write('pyjobs.json', json)
    $ head -n7 pyjobs.json
    [
    {
    "link": "https://python.org/jobs/2570/",
    "location": "College Park,MD,USA",
    "title": "Python Developer - MarketSmart"
    },
    {

    View Slide

  115. Thank you!
    Reuben Cummings
    @reubano

    View Slide

  116. Extra Slides Image Credit
    www.flickr.com/photos/jeremybrooks

    View Slide

  117. Infinite Squares (functional)
    def accumulate_area2(rects, accum=0):
    it = iter(rects)
    try:
    area = get_area(next(it))
    except StopIteration:
    return
    accum += area
    yield accum
    yield from accumulate_area2(it, accum)

    View Slide