Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Functional Programming Approach To Data Processing In Python

A Functional Programming Approach To Data Processing In Python

LambdaConf Workshop on Functional Programming with Python

Reuben Cummings

May 26, 2017
Tweet

More Decks by Reuben Cummings

Other Decks in Programming

Transcript

  1. A FUNCTIONAL PROGRAMMING APPROACH TO LambdaConf — Boulder, Colorado —

    May 26, 2017 DATA PROCESSING IN PYTHON By Reuben Cummings
  2. Reuben Cummings λ @reubano λ #LambdaConf Who am I? Managing

    Director, Nerevu Development Founder, Arusha Coders Author of several popular Python packages
  3. WHAT IS DATA? I dare you, I double dare you!

    Image Credit: www.emaze.com SAY BIG DATA ONE MORE TIME
  4. Reuben Cummings λ @reubano λ #LambdaConf Organization language presenter mercury

    alex scala gleb haskell michael "This session seeks to entertain and teach the developer who is already..." structured unstructured
  5. Reuben Cummings λ @reubano λ #LambdaConf Storage type,duration leap,360 hop,120

    de novo,60 inspire,10 00103e0 b0e6 04... 00105f0 e4e7 03... 0010600 0be8 04... 00105b0 c4e4 02... 00106e0 b0e9 04... flat/text binary
  6. Spotify's Discovery Weekly playlist of new songs you like adapts

    to user's shifting musical tastes handles outliers and seasonality Image Credit: www.spotify.com/int/discoverweekly/
  7. Reuben Cummings λ @reubano λ #LambdaConf Reading data (naive) from

    urllib.request import urlopen from json import loads BASE = 'https://api.github.com/search' _url1 = '{}/repositories?q={}' q = 'data&per_page=100' url1 = _url1.format(BASE, q) f = urlopen(url1)
  8. Reuben Cummings λ @reubano λ #LambdaConf Reading data (naive) from

    urllib.request import urlopen from json import loads BASE = 'https://api.github.com/search' _url1 = '{}/repositories?q={}' q = 'data&per_page=100' url1 = _url1.format(BASE, q) f = urlopen(url1)
  9. Reuben Cummings λ @reubano λ #LambdaConf Reading data (naive) from

    urllib.request import urlopen from json import loads BASE = 'https://api.github.com/search' _url1 = '{}/repositories?q={}' q = 'data&per_page=100' url1 = _url1.format(BASE, q) f = urlopen(url1)
  10. Reuben Cummings λ @reubano λ #LambdaConf >>> repos = data['items']

    >>> repos[0]['description'] 'Jargon from the functional programming world in simple terms!' >>> repos[0]['full_name'] 'hemanth/functional-programming-jargon' >>> data = loads(f.read().decode('utf-8')) Reading data (naive)
  11. Reuben Cummings λ @reubano λ #LambdaConf Processing data (naive) def

    rate(repos): rated = [] for repo in repos: rated.append(repo['watchers'] * 2) return rated
  12. Reuben Cummings λ @reubano λ #LambdaConf Processing data (naive) def

    rate(repos): rated = [] for repo in repos: rated.append(repo['watchers'] * 2) return rated
  13. Reuben Cummings λ @reubano λ #LambdaConf Processing data (naive) def

    rate(repos): rated = [] for repo in repos: rated.append(repo['watchers'] * 2) return rated
  14. Reuben Cummings λ @reubano λ #LambdaConf Processing data (naive) def

    rate(repos): rated = [] for repo in repos: rated.append(repo['watchers'] * 2) return rated
  15. Reuben Cummings λ @reubano λ #LambdaConf Processing data (naive) >>>

    rate(repos)[:5] [36520, 30174, 28576, 26842, 24092]
  16. Reuben Cummings λ @reubano λ #LambdaConf >>> from itertools import

    count >>> >>> inf_repos = ( ... {'watchers': c} for c in count()) Processing infinite data (naive)
  17. Reuben Cummings λ @reubano λ #LambdaConf >>> from itertools import

    count >>> >>> inf_repos = ( ... {'watchers': c} for c in count()) >>> >>> rate(inf_repos) Processing infinite data (naive)
  18. Reuben Cummings λ @reubano λ #LambdaConf KeyboardInterrupt Traceback (most recent

    call last) <ipython-input-212-e2ea27b0be2f> in <module>() >>> from itertools import count >>> >>> inf_repos = ( ... {'watchers': c} for c in count()) >>> >>> rate(inf_repos) Processing infinite data (naive)
  19. Reuben Cummings λ @reubano λ #LambdaConf rated = [] for

    repo in repos: rated.append(repo['watchers'] * 2) return rated def rate(repos): Processing expensive data (naive)
  20. Reuben Cummings λ @reubano λ #LambdaConf def exp_rate(repos): rated =

    [] for repo in repos: rated.append(repo['watchers'] * 2) return rated Processing expensive data (naive)
  21. Reuben Cummings λ @reubano λ #LambdaConf from time import sleep

    def exp_rate(repos): rated = [] for repo in repos: rated.append(repo['watchers'] * 2) return rated Processing expensive data (naive)
  22. Reuben Cummings λ @reubano λ #LambdaConf from time import sleep

    def exp_rate(repos): rated = [] for repo in repos: rated.append(repo['watchers'] * 2) return rated sleep(5) Processing expensive data (naive)
  23. Reuben Cummings λ @reubano λ #LambdaConf [36520, 30174, 28576, 26842,

    24092] >>> exp_rate(repos)[:5] Processing expensive data (naive)
  24. Reuben Cummings λ @reubano λ #LambdaConf >>> next(lazy_list) 0 >>>

    eager_list = list(range(5)) >>> eager_list [0, 1, 2, 3, 4] >>> lazy_list = iter(eager_list) >>> lazy_list <list_iterator at 0x10c2af978> Iterators
  25. Reuben Cummings λ @reubano λ #LambdaConf >>> next(lazy_list) StopIteration Traceback

    (most recent call last) <ipython-input-68-898b6387b693> in <module>() ----> 1 next(lazy_list) Iterators >>> list(lazy_list) [1, 2, 3, 4]
  26. Reuben Cummings λ @reubano λ #LambdaConf >>> from ijson import

    items >>> >>> f = urlopen(url1) >>> repos = items(f, 'items.item') >>> repos <generator object items at 0x110c70db0> >>> repo = next(repos) >>> repo['full_name'] 'hemanth/functional-programming-jargon' Reading data (lazy evaluation)
  27. Reuben Cummings λ @reubano λ #LambdaConf rated = [] for

    repo in repos: rated.append(repo['watchers'] * 2) return rated def rate(repos): Processing data (lazy evaluation)
  28. Reuben Cummings λ @reubano λ #LambdaConf rated = [] for

    repo in repos: rated.append(repo['watchers'] * 2) return rated def gen_rates(repos): Processing data (lazy evaluation)
  29. Reuben Cummings λ @reubano λ #LambdaConf def gen_rates(repos): for repo

    in repos: yield repo['watchers'] * 2 Processing data (lazy evaluation)
  30. Reuben Cummings λ @reubano λ #LambdaConf >>> rates = gen_rates(repos)

    >>> next(rates) 36520 >>> next(rates) 30174 >>> gen_rates(repos) <generator object gen_rate at 0x160c70db0> Processing data (lazy evaluation)
  31. Reuben Cummings λ @reubano λ #LambdaConf Processing infinite data (lazy

    evaluation) >>> rates = gen_rates(inf_repos) >>> next(rates) 42220156
  32. Reuben Cummings λ @reubano λ #LambdaConf Processing expensive data (lazy

    evaluation) def gen_exp_rates(repos): for repo in repos: sleep(5) yield repo['watchers'] * 2
  33. Reuben Cummings λ @reubano λ #LambdaConf Processing expensive data (lazy

    evaluation) def gen_exp_rates(repos): for repo in repos: sleep(5) yield repo['watchers'] * 2
  34. Reuben Cummings λ @reubano λ #LambdaConf >>> list(result) >>> from

    itertools import islice >>> >>> rates = gen_exp_rates(repos) >>> result = islice(rates, 5) Processing expensive data (lazy evaluation)
  35. Reuben Cummings λ @reubano λ #LambdaConf [36520, 30174, 28576, 26842,

    24092] >>> list(result) >>> from itertools import islice >>> >>> rates = gen_exp_rates(repos) >>> result = islice(rates, 5) Processing expensive data (lazy evaluation)
  36. Reuben Cummings λ @reubano λ #LambdaConf >>> from itertools import

    islice >>> >>> rates = gen_exp_rates(repos) >>> result = islice(rates, 5) >>> list(result) [36520, 30174, 28576, 26842, 24092] >>> next(rates) 648 Processing expensive data (lazy evaluation)
  37. Reuben Cummings λ @reubano λ #LambdaConf Grouping data >>> f

    = urlopen(url1) >>> repos = items(f, 'items.item') >>> repo = next(repos) >>> repo.keys() dict_keys(['id', 'name', 'full_name', 'owner', 'private', 'html_url', 'description', 'fork', 'url', 'forks_url', 'keys_url', ...])
  38. Reuben Cummings λ @reubano λ #LambdaConf Grouping data >>> import

    itertools as it >>> from operator import itemgetter >>> >>> keyfunc = itemgetter('has_issues') >>> sorted_repos = sorted(repos, key=keyfunc) >>> grouped = it.groupby( ... sorted_repos, keyfunc) >>> data = ( ... (k, len(list(g))) for k, g in grouped)
  39. Reuben Cummings λ @reubano λ #LambdaConf def gen_exp_rates(repos): for repo

    in repos: sleep(5) yield repo['watchers'] * 2 Processing expensive data (memoization)
  40. Reuben Cummings λ @reubano λ #LambdaConf def calc_rate(watchers): sleep(5) return

    watchers * 2 def gen_exp_rates(repos): for repo in repos: yield calc_rate(repo['watchers']) Processing expensive data (memoization)
  41. Reuben Cummings λ @reubano λ #LambdaConf def _calc_rate(watchers): cacher =

    lru_cache() calc_rate = cacher(_calc_rate) from functools import lru_cache sleep(5) return watchers * 2 Processing expensive data (memoization)
  42. Reuben Cummings λ @reubano λ #LambdaConf @lru_cache() from functools import

    lru_cache def calc_rate(watchers): sleep(5) return watchers * 2 def gen_exp_rates(repos): for repo in repos: yield calc_rate(repo['watchers']) Processing expensive data (memoization)
  43. Reuben Cummings λ @reubano λ #LambdaConf [10, 10, 10, 10,

    10] >>> list(result) >>> repos = it.repeat({'watchers': 5}) >>> rates = gen_exp_rates(repos) >>> result = islice(rates, 5) Processing expensive data (memoization)
  44. Reuben Cummings λ @reubano λ #LambdaConf Exercise #1: Problem display

    the total # of watchers per language (ignore repos w/o a language)
  45. Reuben Cummings λ @reubano λ #LambdaConf Exercise #1: Result C#

    32 C++ 63 HTML 349 JavaScript 3881 Jupyter Notebook 5481 PHP 201 Python 37007 R 18
  46. Reuben Cummings λ @reubano λ #LambdaConf Exercise #1: Data source

    https://api.github.com/search/ repositories?q=data
  47. Reuben Cummings λ @reubano λ #LambdaConf Exercise #1: Jupyter Notebook

    beta.mybinder.org/v2/gh/reubano/ lambdaconf-tutorial/master (exercises.ipybn)
  48. Reuben Cummings λ @reubano λ #LambdaConf from urllib.request import urlopen

    from itertools import groupby from operator import itemgetter from ijson import items url2 = '{}/repositories?q=data'.format(BASE) f = urlopen(url2) repos = items(f, 'items.item') Exercise #1: Solution
  49. Reuben Cummings λ @reubano λ #LambdaConf keyfunc = itemgetter('language') cleaned

    = filter(keyfunc, repos) records = sorted(cleaned, key=keyfunc) grouped = groupby(records, keyfunc) for key, group in grouped: cnt = sum(g['watchers'] for g in group) print(key, cnt) Exercise #1: Solution
  50. Reuben Cummings λ @reubano λ #LambdaConf Meza demo: Jupyter Notebook

    beta.mybinder.org/v2/gh/reubano/ lambdaconf-tutorial/master (presentation.ipybn)
  51. Reuben Cummings λ @reubano λ #LambdaConf Reading data >>> from

    urllib.request import urlopen >>> from meza.io import read_json >>> >>> f = urlopen(url2) >>> records = read_json(f, path='items.item') >>> repo = next(records) >>> repo['full_name'] 'emberjs/data'
  52. Reuben Cummings λ @reubano λ #LambdaConf Reading data >>> from

    io import StringIO >>> from meza.io import read_csv >>> >>> f = StringIO( ... 'greeting,location\nhello,world\n') >>> >>> next(read_csv(f)) {'greeting': 'hello', 'location': 'world'}
  53. Reuben Cummings λ @reubano λ #LambdaConf Reading data >>> from

    os import path as p >>> from meza.io import join >>> >>> url3 = '{}&page=2'.format(url2) >>> files = map(urlopen, [url2, url3]) >>> records = join( ... *files, ext='json', path='items.item')
  54. Reuben Cummings λ @reubano λ #LambdaConf Reading data >>> repo

    = next(records) >>> repo['full_name'] 'emberjs/data'
  55. Reuben Cummings λ @reubano λ #LambdaConf Reading data >>> repo

    = next(records) >>> repo['full_name'] 'emberjs/data' >>> repo['language'] 'JavaScript'
  56. Reuben Cummings λ @reubano λ #LambdaConf Reading data >>> repo

    = next(records) >>> repo['full_name'] 'emberjs/data' >>> repo['language'] 'JavaScript' >>> len(list(records)) 59
  57. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> from

    meza.process import merge >>> >>> records = [ ... {'a': 200}, {'b': 300}, {'c': 400}] >>> >>> merge(records) {'a': 200, 'b': 300, 'c': 400}
  58. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> from

    meza.process import group >>> >>> records = [ ... {'item': 'a', 'amount': 200}, ... {'item': 'a', 'amount': 200}, ... {'item': 'b', 'amount': 400}] >>> >>> grouped = group(records, 'item')
  59. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> key,

    _group = next(grouped) >>> key 'a' >>> _group [{'amount': 200, 'item': 'a'}, {'amount': 200, 'item': 'a'}]
  60. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> from

    meza import process as pr >>> >>> f = urlopen(url2) >>> raw = read_json(f, path='items.item') >>> fields = [ ... 'full_name', 'language', 'watchers', ... 'score', 'has_wiki'] >>> >>> cut = pr.cut(raw, fields)
  61. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> cut

    <generator object cut.<locals>.<genexpr> at 0x10b0410f8> >>> cut, preview = pr.peek(cut) >>> cut <itertools.chain at 0x10c2ad5f8> >>> len(preview) 5
  62. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> preview[0]

    {'full_name': 'substance/data', 'has_wiki': True, 'language': 'JavaScript', 'score': Decimal('72.90926'), 'watchers': 678}
  63. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> filled

    = pr.fillempty( ... raw, value='', fields=['language']) >>> >>> pivoted = pr.pivot( ... filled, 'score', 'language', ... rows=['has_wiki'], op=min)
  64. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> next(pivoted)

    {'HTML': Decimal('73.52254'), 'JavaScript': Decimal('53.48755'), 'PHP': Decimal('41.3122'), 'Python': Decimal('42.49319'), 'has_wiki': False}
  65. Reuben Cummings λ @reubano λ #LambdaConf Transforming data >>> next(pivoted)

    {'': Decimal('44.83392'), 'C#': Decimal('47.793495'), 'HTML': Decimal('69.20008'), 'JavaScript': Decimal('70.15174'), 'PHP': Decimal('44.251198'), 'Python': Decimal('45.78215'), 'R': Decimal('46.23451'), 'has_wiki': True}
  66. Reuben Cummings λ @reubano λ #LambdaConf | full_name | language

    | score | has_wiki | | --------- | ---------- | ------ | -------- | | 'aptnote…' | '' | 76.11… | True | | 'GSA/dat…' | 'HTML' | 73.52… | False | | 'substan…' | 'JavaScr…' | 72.83… | True | | 'GoogleT…' | 'JavaScr…' | 70.15… | True | | 'curran/…' | 'HTML' | 69.20… | True | Transforming data (before)
  67. Reuben Cummings λ @reubano λ #LambdaConf | has_wiki | ''

    | HTML | JavaScript | | -------- | -------- | -------- | ---------- | | False | | 73.52254 | | | True | 76.11933 | 69.20008 | 70.15174 | Transforming data (after)
  68. Reuben Cummings λ @reubano λ #LambdaConf Exercise #2: Problem display

    the language with the most # of watchers per owner_type per has_pages
  69. Reuben Cummings λ @reubano λ #LambdaConf Exercise #2: Result (partial)

    {'has_pages': True, 'language': 'JavaScript', 'owner_type': 'Organization', 'watchers': 128605}
  70. Reuben Cummings λ @reubano λ #LambdaConf Exercise #2: Data source

    https://api.github.com/search/ repositories? q=data&sort=stars&order=desc
  71. Reuben Cummings λ @reubano λ #LambdaConf Exercise #2: Hint from

    meza.fntools import flatten # and one of the following from meza.process import normalize # this from meza.process import aggregate # or this
  72. Reuben Cummings λ @reubano λ #LambdaConf Exercise #2: Jupyter Notebook

    beta.mybinder.org/v2/gh/reubano/ lambdaconf-tutorial/master (exercises.ipybn)
  73. Reuben Cummings λ @reubano λ #LambdaConf from urllib.request import urlopen

    from operator import itemgetter from functools import partial from meza import process as pr, fntools as ft from meza.io import read_json q = 'data&sort=stars&order=desc' url4 = '{}/repositories?q={}'.format(BASE, q) f = urlopen(url4) Exercise #2: Solution
  74. Reuben Cummings λ @reubano λ #LambdaConf records = read_json(f, path='items.item')

    filled = pr.fillempty( records, value='', fields=['language']) flat = (dict(ft.flatten(r)) for r in filled) args = ('watchers', 'language') rows = ['has_pages', 'owner_type'] Exercise #2: Solution
  75. Reuben Cummings λ @reubano λ #LambdaConf spun = pr.pivot( flat,

    *args, rows=rows, op=sum) spun, preview = pr.peek(spun) Exercise #2: Solution
  76. Reuben Cummings λ @reubano λ #LambdaConf >>> preview[0] {'C#': 7675,

    'C++': 55602, 'Go': 13223, 'Objective-C': 10556, … 'has_pages': False, 'owner_type': 'Organization'} Exercise #2: Solution
  77. Reuben Cummings λ @reubano λ #LambdaConf >>> kw = {'rows':

    rows, 'invert': True} >>> normal = pr.normalize(spun, *args, **kw) >>> normal, preview = pr.peek(normal) >>> preview[0] {'has_pages': False, 'language': 'Objective-C', 'owner_type': 'Organization', 'watchers': 10556} Exercise #2: Solution
  78. Reuben Cummings λ @reubano λ #LambdaConf akeyfunc = itemgetter('watchers') gkeyfunc

    = lambda x: tuple(x[r] for r in rows) aggregator = partial(max, key=akeyfunc) kwargs = { 'tupled': False, 'aggregator': aggregator} grouped = pr.group(normal, gkeyfunc, **kwargs) Exercise #2: Solution
  79. Reuben Cummings λ @reubano λ #LambdaConf >>> grouped, preview =

    pr.peek(grouped) >>> preview[0] {'has_pages': False, 'language': 'C++', 'owner_type': 'Organization', 'watchers': 55602} Exercise #2: Solution
  80. Reuben Cummings λ @reubano λ #LambdaConf sgrouped = sorted( grouped,

    key=akeyfunc, reverse=True) for record in sgrouped: print(record) Exercise #2: Solution
  81. Reuben Cummings λ @reubano λ #LambdaConf | language | watchers

    | owner_ty… | has_pages | | -------- | -------- | --------- | --------- | | 'JavaS…' | 128605 | 'Organi…' | True | | 'C++' | 55602 | 'Organi…' | False | | 'Python' | 54269 | 'User' | False | | 'Jupyte…'| 12046 | 'User' | True | Exercise #2: Result (full)
  82. Reuben Cummings λ @reubano λ #LambdaConf def gen_rates(repos): for repo

    in repos: yield repo['watchers'] * 2 Processing data (lazy evaluation)
  83. Reuben Cummings λ @reubano λ #LambdaConf def gen_rates(repos): return (

    r['watchers'] * 2 for r in repos) Processing data (lazy evaluation)
  84. Reuben Cummings λ @reubano λ #LambdaConf from urllib.request import urlopen

    from operator import itemgetter from functools import partial from meza import process as pr, fntools as ft from meza.io import read_json q = 'data&sort=stars&order=desc' url4 = '{}/repositories?q={}'.format(BASE, q) f = urlopen(url4) Exercise #2: Alt. solution
  85. Reuben Cummings λ @reubano λ #LambdaConf records = read_json(f, path='items.item')

    filled = pr.fillempty( records, value='', fields=['language']) flat = (dict(ft.flatten(r)) for r in filled) akeyfunc = itemgetter('watchers') Exercise #2: Alt. solution
  86. Reuben Cummings λ @reubano λ #LambdaConf rows = [ 'has_pages',

    'owner_type', 'language', 'watchers'] def grouper(records, rows, aggregator): kwargs = {'aggregator': aggregator} key = lambda x: tuple(x[r] for r in rows) _grouper = partial(pr.group, tupled=False) return _grouper(records, key, **kwargs) Exercise #2: Alt. solution
  87. Reuben Cummings λ @reubano λ #LambdaConf def agg1(records): args =

    (records, 'watchers', sum) return pr.aggregate(*args) grouped = grouper(flat, rows[:3], agg1) agg2 = partial(max, key=akeyfunc) regrouped = grouper(grouped, rows[:2], agg2) cut = pr.cut(regrouped, rows) Exercise #2: Alt. solution
  88. Reuben Cummings λ @reubano λ #LambdaConf >>> cut, preview =

    pr.peek(cut) >>> preview[0] {'has_pages': False, 'language': 'C++', 'owner_type': 'Organization', 'watchers': 55602} Exercise #2: Alt. solution
  89. Reuben Cummings λ @reubano λ #LambdaConf sgrouped = sorted( cut,

    key=akeyfunc, reverse=True) for record in sgrouped: print(record) Exercise #2: Alt. solution