Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Functional Programming Approach To Data Processing In Python

A Functional Programming Approach To Data Processing In Python

LambdaConf Workshop on Functional Programming with Python

Reuben Cummings

May 26, 2017
Tweet

More Decks by Reuben Cummings

Other Decks in Programming

Transcript

  1. A FUNCTIONAL PROGRAMMING
    APPROACH TO
    LambdaConf — Boulder, Colorado — May 26, 2017
    DATA PROCESSING IN PYTHON
    By Reuben Cummings

    View full-size slide

  2. Reuben Cummings λ @reubano λ #LambdaConf
    Who am I?
    Managing Director, Nerevu Development
    Founder, Arusha Coders
    Author of several popular Python packages

    View full-size slide

  3. WHAT IS DATA?
    I dare you, I double dare you!
    Image Credit: www.emaze.com
    SAY BIG DATA
    ONE MORE TIME

    View full-size slide

  4. Reuben Cummings λ @reubano λ #LambdaConf
    Organization
    language presenter
    mercury alex
    scala gleb
    haskell michael
    "This session
    seeks to
    entertain and
    teach the
    developer who
    is already..."
    structured unstructured

    View full-size slide

  5. Reuben Cummings λ @reubano λ #LambdaConf
    Storage
    type,duration
    leap,360
    hop,120
    de novo,60
    inspire,10
    00103e0 b0e6 04...
    00105f0 e4e7 03...
    0010600 0be8 04...
    00105b0 c4e4 02...
    00106e0 b0e9 04...
    flat/text binary

    View full-size slide

  6. Reuben Cummings λ @reubano λ #LambdaConf
    Organization vs Storage
    flat/text
    binary
    structured
    unstructured

    View full-size slide

  7. What is it
    data processing
    good for?

    View full-size slide

  8. Spotify's Discovery
    Weekly
    playlist of new songs you like
    adapts to user's shifting
    musical tastes
    handles outliers and
    seasonality Image Credit: www.spotify.com/int/discoverweekly/

    View full-size slide

  9. What is
    functional
    programming
    good for?

    View full-size slide

  10. Om
    rapid UI re-renders
    serializable application state
    time travel/undo
    Image Credit: circleci.com

    View full-size slide

  11. A BRIEF INTRO TO PYTHON
    Ooouuu, stickers!
    Image Credit: www.pythongear.com

    View full-size slide

  12. Reuben Cummings λ @reubano λ #LambdaConf
    Presentation: GitHub repo
    github.com/reubano/lambdaconf-
    tutorial

    View full-size slide

  13. Reuben Cummings λ @reubano λ #LambdaConf
    Presentation: Jupyter Notebook
    beta.mybinder.org/v2/gh/reubano/
    lambdaconf-tutorial/master
    (presentation.ipybn)

    View full-size slide

  14. Naive
    Image Credit: (Alika Seu) www.flickr.com

    View full-size slide

  15. Reading data

    View full-size slide

  16. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data (naive)
    from urllib.request import urlopen
    from json import loads
    BASE = 'https://api.github.com/search'
    _url1 = '{}/repositories?q={}'
    q = 'data&per_page=100'
    url1 = _url1.format(BASE, q)
    f = urlopen(url1)

    View full-size slide

  17. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data (naive)
    from urllib.request import urlopen
    from json import loads
    BASE = 'https://api.github.com/search'
    _url1 = '{}/repositories?q={}'
    q = 'data&per_page=100'
    url1 = _url1.format(BASE, q)
    f = urlopen(url1)

    View full-size slide

  18. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data (naive)
    from urllib.request import urlopen
    from json import loads
    BASE = 'https://api.github.com/search'
    _url1 = '{}/repositories?q={}'
    q = 'data&per_page=100'
    url1 = _url1.format(BASE, q)
    f = urlopen(url1)

    View full-size slide

  19. GitHub API
    Image Credit: https://api.github.com/search/repositories?q=data

    View full-size slide

  20. Reuben Cummings λ @reubano λ #LambdaConf
    >>> data = loads(f.read().decode('utf-8'))
    Reading data (naive)

    View full-size slide

  21. Reuben Cummings λ @reubano λ #LambdaConf
    >>> repos = data['items']
    >>> repos[0]['description']
    'Jargon from the functional programming world
    in simple terms!'
    >>> repos[0]['full_name']
    'hemanth/functional-programming-jargon'
    >>> data = loads(f.read().decode('utf-8'))
    Reading data (naive)

    View full-size slide

  22. Processing data

    View full-size slide

  23. Reuben Cummings λ @reubano λ #LambdaConf
    Processing data (naive)
    def rate(repos):
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated

    View full-size slide

  24. Reuben Cummings λ @reubano λ #LambdaConf
    Processing data (naive)
    def rate(repos):
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated

    View full-size slide

  25. Reuben Cummings λ @reubano λ #LambdaConf
    Processing data (naive)
    def rate(repos):
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated

    View full-size slide

  26. Reuben Cummings λ @reubano λ #LambdaConf
    Processing data (naive)
    def rate(repos):
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated

    View full-size slide

  27. Reuben Cummings λ @reubano λ #LambdaConf
    Processing data (naive)
    >>> rate(repos)[:5]
    [36520, 30174, 28576, 26842, 24092]

    View full-size slide

  28. Reuben Cummings λ @reubano λ #LambdaConf
    >>> from itertools import count
    >>>
    >>> inf_repos = (
    ... {'watchers': c} for c in count())
    Processing infinite data (naive)

    View full-size slide

  29. Reuben Cummings λ @reubano λ #LambdaConf
    >>> from itertools import count
    >>>
    >>> inf_repos = (
    ... {'watchers': c} for c in count())
    >>>
    >>> rate(inf_repos)
    Processing infinite data (naive)

    View full-size slide

  30. Reuben Cummings λ @reubano λ #LambdaConf
    KeyboardInterrupt
    Traceback (most recent call last)
    in ()
    >>> from itertools import count
    >>>
    >>> inf_repos = (
    ... {'watchers': c} for c in count())
    >>>
    >>> rate(inf_repos)
    Processing infinite data (naive)

    View full-size slide

  31. Reuben Cummings λ @reubano λ #LambdaConf
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated
    def rate(repos):
    Processing expensive data (naive)

    View full-size slide

  32. Reuben Cummings λ @reubano λ #LambdaConf
    def exp_rate(repos):
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated
    Processing expensive data (naive)

    View full-size slide

  33. Reuben Cummings λ @reubano λ #LambdaConf
    from time import sleep
    def exp_rate(repos):
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated
    Processing expensive data (naive)

    View full-size slide

  34. Reuben Cummings λ @reubano λ #LambdaConf
    from time import sleep
    def exp_rate(repos):
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated
    sleep(5)
    Processing expensive data (naive)

    View full-size slide

  35. Reuben Cummings λ @reubano λ #LambdaConf
    >>> exp_rate(repos)[:5]
    Processing expensive data (naive)

    View full-size slide

  36. Reuben Cummings λ @reubano λ #LambdaConf
    [36520, 30174, 28576, 26842, 24092]
    >>> exp_rate(repos)[:5]
    Processing expensive data (naive)

    View full-size slide

  37. Lazy evaluation
    Image Credit: (Mark Turnauckas) www.flickr.com

    View full-size slide

  38. Reuben Cummings λ @reubano λ #LambdaConf
    >>> next(lazy_list)
    0
    >>> eager_list = list(range(5))
    >>> eager_list
    [0, 1, 2, 3, 4]
    >>> lazy_list = iter(eager_list)
    >>> lazy_list

    Iterators

    View full-size slide

  39. Reuben Cummings λ @reubano λ #LambdaConf
    >>> next(lazy_list)
    StopIteration
    Traceback (most recent call last)
    in ()
    ----> 1 next(lazy_list)
    Iterators
    >>> list(lazy_list)
    [1, 2, 3, 4]

    View full-size slide

  40. Reading data

    View full-size slide

  41. Reuben Cummings λ @reubano λ #LambdaConf
    $ pip install ijson
    Reading data (lazy evaluation)

    View full-size slide

  42. Reuben Cummings λ @reubano λ #LambdaConf
    >>> from ijson import items
    >>>
    >>> f = urlopen(url1)
    >>> repos = items(f, 'items.item')
    >>> repos

    >>> repo = next(repos)
    >>> repo['full_name']
    'hemanth/functional-programming-jargon'
    Reading data (lazy evaluation)

    View full-size slide

  43. Processing data

    View full-size slide

  44. Reuben Cummings λ @reubano λ #LambdaConf
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated
    def rate(repos):
    Processing data (lazy evaluation)

    View full-size slide

  45. Reuben Cummings λ @reubano λ #LambdaConf
    rated = []
    for repo in repos:
    rated.append(repo['watchers'] * 2)
    return rated
    def gen_rates(repos):
    Processing data (lazy evaluation)

    View full-size slide

  46. Reuben Cummings λ @reubano λ #LambdaConf
    def gen_rates(repos):
    for repo in repos:
    yield repo['watchers'] * 2
    Processing data (lazy evaluation)

    View full-size slide

  47. Reuben Cummings λ @reubano λ #LambdaConf
    >>> rates = gen_rates(repos)
    >>> next(rates)
    36520
    >>> next(rates)
    30174
    >>> gen_rates(repos)

    Processing data (lazy evaluation)

    View full-size slide

  48. Reuben Cummings λ @reubano λ #LambdaConf
    Processing infinite data
    (lazy evaluation)
    >>> rates = gen_rates(inf_repos)
    >>> next(rates)
    42220156

    View full-size slide

  49. Reuben Cummings λ @reubano λ #LambdaConf
    Processing expensive data
    (lazy evaluation)
    def gen_exp_rates(repos):
    for repo in repos:
    sleep(5)
    yield repo['watchers'] * 2

    View full-size slide

  50. Reuben Cummings λ @reubano λ #LambdaConf
    Processing expensive data
    (lazy evaluation)
    def gen_exp_rates(repos):
    for repo in repos:
    sleep(5)
    yield repo['watchers'] * 2

    View full-size slide

  51. Reuben Cummings λ @reubano λ #LambdaConf
    >>> list(result)
    >>> from itertools import islice
    >>>
    >>> rates = gen_exp_rates(repos)
    >>> result = islice(rates, 5)
    Processing expensive data
    (lazy evaluation)

    View full-size slide

  52. Reuben Cummings λ @reubano λ #LambdaConf
    [36520, 30174, 28576, 26842, 24092]
    >>> list(result)
    >>> from itertools import islice
    >>>
    >>> rates = gen_exp_rates(repos)
    >>> result = islice(rates, 5)
    Processing expensive data
    (lazy evaluation)

    View full-size slide

  53. Reuben Cummings λ @reubano λ #LambdaConf
    >>> from itertools import islice
    >>>
    >>> rates = gen_exp_rates(repos)
    >>> result = islice(rates, 5)
    >>> list(result)
    [36520, 30174, 28576, 26842, 24092]
    >>> next(rates)
    648
    Processing expensive data
    (lazy evaluation)

    View full-size slide

  54. Grouping data

    View full-size slide

  55. Reuben Cummings λ @reubano λ #LambdaConf
    Grouping data
    >>> f = urlopen(url1)
    >>> repos = items(f, 'items.item')
    >>> repo = next(repos)
    >>> repo.keys()
    dict_keys(['id', 'name', 'full_name', 'owner',
    'private', 'html_url',
    'description', 'fork', 'url',
    'forks_url', 'keys_url', ...])

    View full-size slide

  56. Reuben Cummings λ @reubano λ #LambdaConf
    Grouping data
    >>> repo['has_issues']
    True

    View full-size slide

  57. Reuben Cummings λ @reubano λ #LambdaConf
    Grouping data
    >>> import itertools as it
    >>> from operator import itemgetter
    >>>
    >>> keyfunc = itemgetter('has_issues')
    >>> sorted_repos = sorted(repos, key=keyfunc)
    >>> grouped = it.groupby(
    ... sorted_repos, keyfunc)
    >>> data = (
    ... (k, len(list(g))) for k, g in grouped)

    View full-size slide

  58. Reuben Cummings λ @reubano λ #LambdaConf
    Grouping data
    >>> next(data)
    (False, 3)
    >>> next(data)
    (True, 96)

    View full-size slide

  59. Memoization
    Image Credit: (olho wodzynski) www.flickr.com

    View full-size slide

  60. Processing data

    View full-size slide

  61. Reuben Cummings λ @reubano λ #LambdaConf
    def gen_exp_rates(repos):
    for repo in repos:
    sleep(5)
    yield repo['watchers'] * 2
    Processing expensive data (memoization)

    View full-size slide

  62. Reuben Cummings λ @reubano λ #LambdaConf
    def calc_rate(watchers):
    sleep(5)
    return watchers * 2
    def gen_exp_rates(repos):
    for repo in repos:
    yield calc_rate(repo['watchers'])
    Processing expensive data (memoization)

    View full-size slide

  63. Reuben Cummings λ @reubano λ #LambdaConf
    def _calc_rate(watchers):
    cacher = lru_cache()
    calc_rate = cacher(_calc_rate)
    from functools import lru_cache
    sleep(5)
    return watchers * 2
    Processing expensive data (memoization)

    View full-size slide

  64. Reuben Cummings λ @reubano λ #LambdaConf
    @lru_cache()
    from functools import lru_cache
    def calc_rate(watchers):
    sleep(5)
    return watchers * 2
    def gen_exp_rates(repos):
    for repo in repos:
    yield calc_rate(repo['watchers'])
    Processing expensive data (memoization)

    View full-size slide

  65. Reuben Cummings λ @reubano λ #LambdaConf
    [10, 10, 10, 10, 10]
    >>> list(result)
    >>> repos = it.repeat({'watchers': 5})
    >>> rates = gen_exp_rates(repos)
    >>> result = islice(rates, 5)
    Processing expensive data (memoization)

    View full-size slide

  66. EXERCISE #1
    Mount Meru — Arusha, Tanzania
    Image Credit: Reuben Cummings

    View full-size slide

  67. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #1: Problem
    display the total # of
    watchers per language
    (ignore repos w/o a language)

    View full-size slide

  68. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #1: Result
    C# 32
    C++ 63
    HTML 349
    JavaScript 3881
    Jupyter Notebook 5481
    PHP 201
    Python 37007
    R 18

    View full-size slide

  69. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #1: Data source
    https://api.github.com/search/
    repositories?q=data

    View full-size slide

  70. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #1: Jupyter Notebook
    beta.mybinder.org/v2/gh/reubano/
    lambdaconf-tutorial/master
    (exercises.ipybn)

    View full-size slide

  71. Reuben Cummings λ @reubano λ #LambdaConf
    from urllib.request import urlopen
    from itertools import groupby
    from operator import itemgetter
    from ijson import items
    url2 = '{}/repositories?q=data'.format(BASE)
    f = urlopen(url2)
    repos = items(f, 'items.item')
    Exercise #1: Solution

    View full-size slide

  72. Reuben Cummings λ @reubano λ #LambdaConf
    keyfunc = itemgetter('language')
    cleaned = filter(keyfunc, repos)
    records = sorted(cleaned, key=keyfunc)
    grouped = groupby(records, keyfunc)
    for key, group in grouped:
    cnt = sum(g['watchers'] for g in group)
    print(key, cnt)
    Exercise #1: Solution

    View full-size slide

  73. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #1: Solution
    beta.mybinder.org/v2/gh/reubano/
    lambdaconf-tutorial/master
    (solutions.ipybn)

    View full-size slide

  74. INTRODUCING MEZA
    Because you might not need Pandas
    Image Credit: github.com/reubano/meza

    View full-size slide

  75. Reuben Cummings λ @reubano λ #LambdaConf
    $ pip install meza
    Meza demo

    View full-size slide

  76. Reuben Cummings λ @reubano λ #LambdaConf
    Meza demo: Jupyter Notebook
    beta.mybinder.org/v2/gh/reubano/
    lambdaconf-tutorial/master
    (presentation.ipybn)

    View full-size slide

  77. Reading data

    View full-size slide

  78. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data
    >>> from urllib.request import urlopen
    >>> from meza.io import read_json
    >>>
    >>> f = urlopen(url2)
    >>> records = read_json(f, path='items.item')
    >>> repo = next(records)
    >>> repo['full_name']
    'emberjs/data'

    View full-size slide

  79. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data
    >>> len(list(records))
    29

    View full-size slide

  80. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data
    >>> from io import StringIO
    >>> from meza.io import read_csv
    >>>
    >>> f = StringIO(
    ... 'greeting,location\nhello,world\n')
    >>>
    >>> next(read_csv(f))
    {'greeting': 'hello', 'location': 'world'}

    View full-size slide

  81. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data
    >>> from os import path as p
    >>> from meza.io import join
    >>>
    >>> url3 = '{}&page=2'.format(url2)
    >>> files = map(urlopen, [url2, url3])
    >>> records = join(
    ... *files, ext='json', path='items.item')

    View full-size slide

  82. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data
    >>> repo = next(records)
    >>> repo['full_name']
    'emberjs/data'

    View full-size slide

  83. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data
    >>> repo = next(records)
    >>> repo['full_name']
    'emberjs/data'
    >>> repo['language']
    'JavaScript'

    View full-size slide

  84. Reuben Cummings λ @reubano λ #LambdaConf
    Reading data
    >>> repo = next(records)
    >>> repo['full_name']
    'emberjs/data'
    >>> repo['language']
    'JavaScript'
    >>> len(list(records))
    59

    View full-size slide

  85. Transforming data

    View full-size slide

  86. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> from meza.process import merge
    >>>
    >>> records = [
    ... {'a': 200}, {'b': 300}, {'c': 400}]
    >>>
    >>> merge(records)
    {'a': 200, 'b': 300, 'c': 400}

    View full-size slide

  87. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> from meza.process import group
    >>>
    >>> records = [
    ... {'item': 'a', 'amount': 200},
    ... {'item': 'a', 'amount': 200},
    ... {'item': 'b', 'amount': 400}]
    >>>
    >>> grouped = group(records, 'item')

    View full-size slide

  88. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> key, _group = next(grouped)
    >>> key
    'a'
    >>> _group
    [{'amount': 200, 'item': 'a'},
    {'amount': 200, 'item': 'a'}]

    View full-size slide

  89. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> from meza import process as pr
    >>>
    >>> f = urlopen(url2)
    >>> raw = read_json(f, path='items.item')
    >>> fields = [
    ... 'full_name', 'language', 'watchers',
    ... 'score', 'has_wiki']
    >>>
    >>> cut = pr.cut(raw, fields)

    View full-size slide

  90. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> cut
    . at
    0x10b0410f8>
    >>> cut, preview = pr.peek(cut)
    >>> cut

    >>> len(preview)
    5

    View full-size slide

  91. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> preview[0]
    {'full_name': 'substance/data',
    'has_wiki': True,
    'language': 'JavaScript',
    'score': Decimal('72.90926'),
    'watchers': 678}

    View full-size slide

  92. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> filled = pr.fillempty(
    ... raw, value='', fields=['language'])
    >>>
    >>> pivoted = pr.pivot(
    ... filled, 'score', 'language',
    ... rows=['has_wiki'], op=min)

    View full-size slide

  93. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> next(pivoted)
    {'HTML': Decimal('73.52254'),
    'JavaScript': Decimal('53.48755'),
    'PHP': Decimal('41.3122'),
    'Python': Decimal('42.49319'),
    'has_wiki': False}

    View full-size slide

  94. Reuben Cummings λ @reubano λ #LambdaConf
    Transforming data
    >>> next(pivoted)
    {'': Decimal('44.83392'),
    'C#': Decimal('47.793495'),
    'HTML': Decimal('69.20008'),
    'JavaScript': Decimal('70.15174'),
    'PHP': Decimal('44.251198'),
    'Python': Decimal('45.78215'),
    'R': Decimal('46.23451'),
    'has_wiki': True}

    View full-size slide

  95. Reuben Cummings λ @reubano λ #LambdaConf
    | full_name | language | score | has_wiki |
    | --------- | ---------- | ------ | -------- |
    | 'aptnote…' | '' | 76.11… | True |
    | 'GSA/dat…' | 'HTML' | 73.52… | False |
    | 'substan…' | 'JavaScr…' | 72.83… | True |
    | 'GoogleT…' | 'JavaScr…' | 70.15… | True |
    | 'curran/…' | 'HTML' | 69.20… | True |
    Transforming data (before)

    View full-size slide

  96. Reuben Cummings λ @reubano λ #LambdaConf
    | has_wiki | '' | HTML | JavaScript |
    | -------- | -------- | -------- | ---------- |
    | False | | 73.52254 | |
    | True | 76.11933 | 69.20008 | 70.15174 |
    Transforming data (after)

    View full-size slide

  97. EXERCISE #2
    Image Credit: Reuben Cummings
    Mount Kilimanjaro — Kilimanjaro Region, Tanzania

    View full-size slide

  98. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #2: Problem
    display the language with the
    most # of watchers per
    owner_type per has_pages

    View full-size slide

  99. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #2: Result (partial)
    {'has_pages': True,
    'language': 'JavaScript',
    'owner_type': 'Organization',
    'watchers': 128605}

    View full-size slide

  100. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #2: Data source
    https://api.github.com/search/
    repositories?
    q=data&sort=stars&order=desc

    View full-size slide

  101. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #2: Hint
    from meza.fntools import flatten
    # and one of the following
    from meza.process import normalize # this
    from meza.process import aggregate # or this

    View full-size slide

  102. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #2: Jupyter Notebook
    beta.mybinder.org/v2/gh/reubano/
    lambdaconf-tutorial/master
    (exercises.ipybn)

    View full-size slide

  103. Reuben Cummings λ @reubano λ #LambdaConf
    from urllib.request import urlopen
    from operator import itemgetter
    from functools import partial
    from meza import process as pr, fntools as ft
    from meza.io import read_json
    q = 'data&sort=stars&order=desc'
    url4 = '{}/repositories?q={}'.format(BASE, q)
    f = urlopen(url4)
    Exercise #2: Solution

    View full-size slide

  104. Reuben Cummings λ @reubano λ #LambdaConf
    records = read_json(f, path='items.item')
    filled = pr.fillempty(
    records, value='', fields=['language'])
    flat = (dict(ft.flatten(r)) for r in filled)
    args = ('watchers', 'language')
    rows = ['has_pages', 'owner_type']
    Exercise #2: Solution

    View full-size slide

  105. Reuben Cummings λ @reubano λ #LambdaConf
    spun = pr.pivot(
    flat, *args, rows=rows, op=sum)
    spun, preview = pr.peek(spun)
    Exercise #2: Solution

    View full-size slide

  106. Reuben Cummings λ @reubano λ #LambdaConf
    >>> preview[0]
    {'C#': 7675,
    'C++': 55602,
    'Go': 13223,
    'Objective-C': 10556,

    'has_pages': False,
    'owner_type': 'Organization'}
    Exercise #2: Solution

    View full-size slide

  107. Reuben Cummings λ @reubano λ #LambdaConf
    >>> kw = {'rows': rows, 'invert': True}
    >>> normal = pr.normalize(spun, *args, **kw)
    >>> normal, preview = pr.peek(normal)
    >>> preview[0]
    {'has_pages': False,
    'language': 'Objective-C',
    'owner_type': 'Organization',
    'watchers': 10556}
    Exercise #2: Solution

    View full-size slide

  108. Reuben Cummings λ @reubano λ #LambdaConf
    akeyfunc = itemgetter('watchers')
    gkeyfunc = lambda x: tuple(x[r] for r in rows)
    aggregator = partial(max, key=akeyfunc)
    kwargs = {
    'tupled': False, 'aggregator': aggregator}
    grouped = pr.group(normal, gkeyfunc, **kwargs)
    Exercise #2: Solution

    View full-size slide

  109. Reuben Cummings λ @reubano λ #LambdaConf
    >>> grouped, preview = pr.peek(grouped)
    >>> preview[0]
    {'has_pages': False,
    'language': 'C++',
    'owner_type': 'Organization',
    'watchers': 55602}
    Exercise #2: Solution

    View full-size slide

  110. Reuben Cummings λ @reubano λ #LambdaConf
    sgrouped = sorted(
    grouped, key=akeyfunc, reverse=True)
    for record in sgrouped:
    print(record)
    Exercise #2: Solution

    View full-size slide

  111. Reuben Cummings λ @reubano λ #LambdaConf
    | language | watchers | owner_ty… | has_pages |
    | -------- | -------- | --------- | --------- |
    | 'JavaS…' | 128605 | 'Organi…' | True |
    | 'C++' | 55602 | 'Organi…' | False |
    | 'Python' | 54269 | 'User' | False |
    | 'Jupyte…'| 12046 | 'User' | True |
    Exercise #2: Result (full)

    View full-size slide

  112. Reuben Cummings λ @reubano λ #LambdaConf
    Exercise #2: Solution
    beta.mybinder.org/v2/gh/reubano/
    lambdaconf-tutorial/master
    (solutions.ipybn)

    View full-size slide

  113. Thanks!
    Reuben Cummings
    @reubano

    View full-size slide

  114. Extra Slides

    View full-size slide

  115. Reuben Cummings λ @reubano λ #LambdaConf
    def gen_rates(repos):
    for repo in repos:
    yield repo['watchers'] * 2
    Processing data (lazy evaluation)

    View full-size slide

  116. Reuben Cummings λ @reubano λ #LambdaConf
    def gen_rates(repos):
    return (
    r['watchers'] * 2 for r in repos)
    Processing data (lazy evaluation)

    View full-size slide

  117. Reuben Cummings λ @reubano λ #LambdaConf
    from urllib.request import urlopen
    from operator import itemgetter
    from functools import partial
    from meza import process as pr, fntools as ft
    from meza.io import read_json
    q = 'data&sort=stars&order=desc'
    url4 = '{}/repositories?q={}'.format(BASE, q)
    f = urlopen(url4)
    Exercise #2: Alt. solution

    View full-size slide

  118. Reuben Cummings λ @reubano λ #LambdaConf
    records = read_json(f, path='items.item')
    filled = pr.fillempty(
    records, value='', fields=['language'])
    flat = (dict(ft.flatten(r)) for r in filled)
    akeyfunc = itemgetter('watchers')
    Exercise #2: Alt. solution

    View full-size slide

  119. Reuben Cummings λ @reubano λ #LambdaConf
    rows = [
    'has_pages', 'owner_type', 'language',
    'watchers']
    def grouper(records, rows, aggregator):
    kwargs = {'aggregator': aggregator}
    key = lambda x: tuple(x[r] for r in rows)
    _grouper = partial(pr.group, tupled=False)
    return _grouper(records, key, **kwargs)
    Exercise #2: Alt. solution

    View full-size slide

  120. Reuben Cummings λ @reubano λ #LambdaConf
    def agg1(records):
    args = (records, 'watchers', sum)
    return pr.aggregate(*args)
    grouped = grouper(flat, rows[:3], agg1)
    agg2 = partial(max, key=akeyfunc)
    regrouped = grouper(grouped, rows[:2], agg2)
    cut = pr.cut(regrouped, rows)
    Exercise #2: Alt. solution

    View full-size slide

  121. Reuben Cummings λ @reubano λ #LambdaConf
    >>> cut, preview = pr.peek(cut)
    >>> preview[0]
    {'has_pages': False,
    'language': 'C++',
    'owner_type': 'Organization',
    'watchers': 55602}
    Exercise #2: Alt. solution

    View full-size slide

  122. Reuben Cummings λ @reubano λ #LambdaConf
    sgrouped = sorted(
    cut, key=akeyfunc, reverse=True)
    for record in sgrouped:
    print(record)
    Exercise #2: Alt. solution

    View full-size slide