Using Functional Programming for efficient Data Processing and Analysis

Slide 1

Slide 1 text

Using Functional Programming for efﬁcient Data Processing and Analysis PyCon — Portland, Oregon — May 17, 2017 by Reuben Cummings @reubano

Slide 2

Slide 2 text

Part I

Slide 3

Slide 3 text

• Managing Director, Nerevu Development • Founder of Arusha Coders • Author of several popular Python packages Who am I?

Slide 4

Slide 4 text

Hands-on workshop (don't be a spectator) Image Credit www.ﬂickr.com/photos/16210667@N02

Slide 5

Slide 5 text

what are you looking to get out of this workshop?

Slide 6

Slide 6 text

What is data? Image Credit www.ﬂickr.com/photos/147437926@N08

Slide 7

Slide 7 text

Organization room presenter 1 matt 3 james 6 reuben You can't afford to have security be an optional or "nice-to- have"... structured unstructured

Slide 8

Slide 8 text

Storage ﬂat binary type,day tutorial,wed talk,fri poster,sun keynote,fri 00103e0 b0e6 04... 00105f0 e4e7 03... 0010600 0be8 04... 00105b0 c4e4 02... 00106e0 b0e9 04...

Slide 9

Slide 9 text

Organization vs Storage ﬂat binary structured unstructured

Slide 10

Slide 10 text

How do you process data? Image Credit www.ﬂickr.com/photos/sugagaga

Slide 11

Slide 11 text

E — extract T — transform L — load

Slide 12

Slide 12 text

ETL: Extract sources Python

Slide 13

Slide 13 text

ETL: Extract sources Python

Slide 14

Slide 14 text

ETL: Transform Python Python

Slide 15

Slide 15 text

ETL: Transform Python Python

Slide 16

Slide 16 text

ETL: Load Python destination

Slide 17

Slide 17 text

ETL: Load Python destination

Slide 18

Slide 18 text

What is functional programming? Image Credit www.ﬂickr.com/photos/shonk

Slide 19

Slide 19 text

Let's make a rectangle!

Slide 20

Slide 20 text

Rectangle (imperative) class Rectangle(object): def __init__(self, length, width): self.length = length self.width = width @property def area(self): return self.length * self.width def grow(self, amount): self.length *= amount

Slide 21

Slide 21 text

Rectangle (imperative) >>> r = Rectangle(2, 3) >>> r.length 2 >>> r.area 6 >>> r.grow(2) >>> r.length 4 >>> r.area 12

Slide 22

Slide 22 text

Expensive Rectangle (imperative) from time import sleep class ExpensiveRectangle(Rectangle): @property def area(self): sleep(5) return self.length * self.width

Slide 23

Slide 23 text

Expensive Rectangle (imperative) >>> r = ExpensiveRectangle(2, 3) >>> r.area 6 >>> r.area 6

Slide 24

Slide 24 text

Inﬁnite Squares (imperative) def sum_area(rects): area = 0 for r in rects: area += r.area return area

Slide 25

Slide 25 text

Inﬁnite Squares (imperative) >>> from itertools import count >>> >>> squares = ( ... Rectangle(x, x) for x in count(1)) >>> squares at 0x11233ca40> >>> next(squares) <__main__.Rectangle at 0x1123a8400>

Slide 26

Slide 26 text

Inﬁnite Squares (imperative) >>> sum_area(squares) KeyboardInterrupt Traceback (most recent call last) in () ----> 1 sum_area(squares) in sum_area(rects) 3 4 for r in rects: ----> 5 area += r.area

Slide 27

Slide 27 text

Now let's get functional!

Slide 28

Slide 28 text

Rectangle (functional) def make_rect(length, width): return (length, width) def grow_rect(rect, amount): return (rect[0] * amount, rect[1]) def get_length (rect): return rect[0] def get_area (rect): return rect[0] * rect[1]

Slide 29

Slide 29 text

>>> grow_rect(r, 2) (4, 3) >>> get_length(r) 2 >>> get_area(r) 6 Rectangle (functional) >>> r = make_rect(2, 3) >>> get_length(r) 2 >>> get_area(r) 6

Slide 30

Slide 30 text

Rectangle (functional) >>> big_r = grow_rect(r, 2) >>> get_length(big_r) 4 >>> get_area(big_r) 12

Slide 31

Slide 31 text

Expensive Rectangle (functional) from functools import lru_cache @lru_cache() def exp_get_area (rect): sleep(5) return rect[0] * rect[1]

Slide 32

Slide 32 text

Expensive Rectangle (functional) >>> r = make_rect(2, 3) >>> exp_get_area(r) 6 >>> exp_get_area(r) 6

Slide 33

Slide 33 text

Inﬁnite Squares (functional) def accumulate_area(rects): accum = 0 for r in rects: accum += get_area(r) yield accum

Slide 34

Slide 34 text

Inﬁnite Squares (functional) >>> from itertools import islice >>> >>> squares = ( ... make_rect(x, x) for x in count(1)) >>> >>> area = accumulate_area(squares) >>> next(islice(area, 6, 7)) 140 >>> next(area) 204

Slide 35

Slide 35 text

Inﬁnite Squares (functional) >>> from itertools import accumulate >>> >>> squares = ( ... make_rect(x, x) for x in count(1)) >>> >>> area = accumulate(map(get_area, squares)) >>> next(islice(area, 6, 7)) 140 >>> next(area) 204

Slide 36

Slide 36 text

Exercise #1 Image Credit: Me

Slide 37

Slide 37 text

Exercise #1 (Problem) x y z

Slide 38

Slide 38 text

Exercise #1 (Problem) x ₒ factor y h

Slide 39

Slide 39 text

Exercise #1 (Problem) x y z x ₒ factor y h

Slide 40

Slide 40 text

Exercise #1 (Problem) (z ÷ h) z h

Slide 41

Slide 41 text

Exercise #1 (Problem) x y z x ₒ factor y h ratio =

Slide 42

Slide 42 text

Exercise #1 (Problem) z = √(x2 + y2 ) ratio = function1(x, y, factor) hyp = function2(rectangle)

Slide 43

Slide 43 text

Exercise #1 (Problem) z = √(x2 + y2 ) x y z x ₒ factor y h ratio = function1(x, y, factor) hyp = function2(rectangle) >>> get_ratio(1, 2, 2) 0.7905694150420948

Slide 44

Slide 44 text

Exercise #1 (Solution) from math import sqrt, pow def get_hyp(rect): sum_s = sum(pow(r, 2) for r in rect) return sqrt(sum_s) def get_ratio(length, width, factor=1): rect = make_rect(length, width) big_rect = grow_rect(rect, factor) return get_hyp(rect) / get_hyp(big_rect)

Slide 45

Slide 45 text

Exercise #1 (Solution) >>> get_ratio(1, 2, 2) 0.7905694150420948 >>> get_ratio(1, 2, 3) 0.6201736729460423 >>> get_ratio(3, 4, 2) 0.6933752452815365 >>> get_ratio(3, 4, 3) 0.5076730825668095

Slide 46

Slide 46 text

Part II

Slide 47

Slide 47 text

You might not need pandas Image Credit www.ﬂickr.com/photos/harlequeen

Slide 48

Slide 48 text

Obtaining data

Slide 49

Slide 49 text

csv data >>> from csv import DictReader >>> from io import StringIO >>> >>> csv_str = 'Type,Day\ntutorial,wed\ntalk,fri' >>> csv_str += '\nposter,sun' >>> f = StringIO(csv_str) >>> data = DictReader(f) >>> dict(next(data)) {'Day': 'wed', 'Type': 'tutorial'}

Slide 50

Slide 50 text

JSON data >>> from urllib.request import urlopen >>> from ijson import items >>> >>> json_url = 'https://api.github.com/users' >>> f = urlopen(json_url) >>> data = items(f, 'item') >>> next(data) {'avatar_url': 'https://avatars3.githubuserco…', 'events_url': 'https://api.github.com/users/…', 'followers_url': 'https://api.github.com/use…', 'following_url': 'https://api.github.com/use…',

Slide 51

Slide 51 text

pip install xlrd

Slide 52

Slide 52 text

xls(x) data >>> from urllib.request import urlretrieve >>> from xlrd import open_workbook >>> >>> xl_url = 'https://github.com/reubano/meza' >>> xl_url += '/blob/master/data/test/test.xlsx' >>> xl_url += '?raw=true' >>> xl_path = urlretrieve(xl_url)[0] >>> book = open_workbook(xl_path) >>> sheet = book.sheet_by_index(0) >>> header = sheet.row_values(0)

Slide 53

Slide 53 text

xls(x) data >>> nrows = range(1, sheet.nrows) >>> rows = (sheet.row_values(x) for x in nrows) >>> data = ( ... dict(zip(header, row)) for row in rows) >>> >>> next(data) {' ': ' ', 'Some Date': 30075.0, 'Some Value': 234.0, 'Sparse Data': 'Iñtërnâtiônàližætiøn', 'Unicode Test': 'Ādam'}

Slide 54

Slide 54 text

Transforming data

Slide 55

Slide 55 text

grouping data >>> import itertools as it >>> from operator import itemgetter >>> >>> records = [ ... {'item': 'a', 'amount': 200}, ... {'item': 'b', 'amount': 200}, ... {'item': 'c', 'amount': 400}] >>> >>> keyfunc = itemgetter('amount') >>> _sorted = sorted(records, key=keyfunc) >>> groups = it.groupby(_sorted, keyfunc)

Slide 56

Slide 56 text

grouping data >>> data = ((key, list(g)) for key, g in groups) >>> next(data) (200, [{'amount': 200, 'item': 'a'}, {'amount': 200, 'item': 'b'}])

Slide 57

Slide 57 text

aggregating data >>> key = 'amount' >>> value = sum(r.get(key, 0) for r in records) >>> {**records[0], key: value} {'a': 'item', 'amount': 800}

Slide 58

Slide 58 text

Storing data

Slide 59

Slide 59 text

csv ﬁles >>> from csv import DictWriter >>> >>> records = [ ... {'item': 'a', 'amount': 200}, ... {'item': 'b', 'amount': 400}] >>> >>> header = list(records[0].keys()) >>> with open('output.csv', 'w') as f: ... w = DictWriter(f, header) ... w.writeheader() ... w.writerows(records)

Slide 60

Slide 60 text

Introducing meza

Slide 61

Slide 61 text

pip install meza

Slide 62

Slide 62 text

Obtaining data

Slide 63

Slide 63 text

csv data >>> from meza.io import read >>> >>> records = read('output.csv') >>> next(records) {'amount': '200', 'item': 'a'}

Slide 64

Slide 64 text

JSON data >>> from meza.io import read_json >>> >>> f = urlopen(json_url) >>> records = read_json(f, path='item') >>> next(records) {'avatar_url': 'https://avatars3.githubuserco…', 'events_url': 'https://api.github.com/users/…', 'followers_url': 'https://api.github.com/use…', 'following_url': 'https://api.github.com/use…', … }

Slide 65

Slide 65 text

xlsx data >>> from meza.io import read_xls >>> >>> records = read_xls(xl_path) >>> next(records) {'Some Date': '1982-05-04', 'Some Value': '234.0', 'Sparse Data': 'Iñtërnâtiônàližætiøn', 'Unicode Test': 'Ādam'}

Slide 66

Slide 66 text

Transforming data

Slide 67

Slide 67 text

aggregation >>> from meza.process import aggregate >>> >>> records = [ ... {'a': 'item', 'amount': 200}, ... {'a': 'item', 'amount': 300}, ... {'a': 'item', 'amount': 400}] ... >>> aggregate(records, 'amount', sum) {'a': 'item', 'amount': 900}

Slide 68

Slide 68 text

merging >>> from meza.process import merge >>> >>> records = [ ... {'a': 200}, {'b': 300}, {'c': 400}] >>> >>> merge(records) {'a': 200, 'b': 300, 'c': 400}

Slide 69

Slide 69 text

grouping >>> from meza.process import group >>> >>> records = [ ... {'item': 'a', 'amount': 200}, ... {'item': 'a', 'amount': 200}, ... {'item': 'b', 'amount': 400}] >>> >>> groups = group(records, 'item') >>> next(groups)

Slide 70

Slide 70 text

normalization >>> from meza.process import normalize >>> >>> records = [ ... { ... 'color': 'blue', 'setosa': 5, ... 'versi': 6 ... }, { ... 'color': 'red', 'setosa': 3, ... 'versi': 5 ... }]

Slide 71

Slide 71 text

normalization >>> kwargs = { ... 'data': 'length', 'column':'species', ... 'rows': ['setosa', 'versi']} >>> >>> data = normalize(records, **kwargs) >>> next(data) {'color': 'blue', 'length': 5, 'species': 'setosa'}

Slide 72

Slide 72 text

normalization before after color setosa versi blue 5 6 red 3 5 color length species blue 5 setosa blue 6 versi red 3 setosa red 5 versi

Slide 73

Slide 73 text

Storing data

Slide 74

Slide 74 text

csv ﬁles >>> from meza import convert as cv >>> from meza.io import write >>> >>> records = [ ... {'item': 'a', 'amount': 200}, ... {'item': 'b', 'amount': 400}] >>> >>> csv = cv.records2csv(records) >>> write('output.csv', csv)

Slide 75

Slide 75 text

JSON ﬁles >>> json = cv.records2json(records) >>> write('output.json', json)

Slide 76

Slide 76 text

Exercise #2 Image Credit: Me

Slide 77

Slide 77 text

Exercise #2 (Problem) • create a list of dicts with keys "factor", "length", "width", and "ratio" (for factors 1 - 20)

Slide 78

Slide 78 text

Exercise #2 (Problem) records = [ { 'factor': 1, 'length': 2, 'width': 2, 'ratio': 1.0 }, { 'factor': 2, 'length': 2, 'width': 2, 'ratio': 0.6324… }, { 'factor': 3, 'length': 2, 'width': 2, 'ratio': 0.4472…} ]

Slide 79

Slide 79 text

Exercise #2 (Problem) • create a list of dicts with keys "factor", "length", "width", and "ratio" (for factors 1 - 20) • group the records by quartiles of the "ratio" value, and aggregate each group by the median "ratio"

Slide 80

Slide 80 text

Exercise #2 (Problem) from statistics import median from meza.process import group records[0]['ratio'] // .25

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Exercise #2 (Problem) from meza.convert import records2csv from meza.io import write key median 0 0.108… 1 0.343…

Slide 83

Slide 83 text

Exercise #2 (Solution) >>> length = width = 2 >>> records = [ ... { ... 'length': length, ... 'width': width, ... 'factor': f, ... 'ratio': get_ratio(length, width, f) ... } ... ... for f in range(1, 21)]

Slide 84

Slide 84 text

Exercise #2 (Solution) >>> from statistics import median >>> from meza import process as pr >>> >>> def aggregator(group): ... ratios = (g['ratio'] for g in group) ... return median(ratios) >>> >>> kwargs = {'aggregator': aggregator} >>> gkeyfunc = lambda r: r['ratio'] // .25 >>> groups = pr.group( ... records, gkeyfunc, **kwargs)

Slide 85

Slide 85 text

Exercise #2 (Solution) >>> from meza import convert as cv >>> from meza.io import write >>> >>> results = [ ... {'key': k, 'median': g} ... for k, g in groups] >>> >>> csv = cv.records2csv(results) >>> write('results.csv', csv)

Slide 86

Slide 86 text

Exercise #2 (Solution) $ csvlook results.csv | key | median | | --- | ------ | | 0 | 0.108… | | 1 | 0.343… | | 2 | 0.632… | | 4 | 1.000… |

Slide 87

Slide 87 text

Part III

Slide 88

Slide 88 text

Introducing riko

Slide 89

Slide 89 text

pip install riko

Slide 90

Slide 90 text

Obtaining data

Slide 91

Slide 91 text

Python Events Calendar https://www.python.org/events/python- events/

Slide 92

Slide 92 text

Python Events Calendar https://www.python.org/events/python- events/

Slide 93

Slide 93 text

Python Events Calendar >>> from riko.collections import SyncPipe >>> >>> url = 'www.python.org/events/python-events/' >>> _xpath = '/html/body/div/div[3]/div/section' >>> xpath = '{}/div/div/ul/li'.format(_xpath) >>> xconf = {'url': url, 'xpath': xpath} >>> kwargs = {'emit': False, 'token_key': None} >>> epath = 'h3.a.content' >>> lpath = 'p.span.content' >>> rrule = [{'field': 'h3'}, {'field': 'p'}]

Slide 94

Slide 94 text

Python Events Calendar >>> flow = ( ... SyncPipe('xpathfetchpage', conf=xconf) ... .subelement( ... conf={'path': epath}, ... assign='event', **kwargs) ... .subelement( ... conf={'path': lpath}, ... assign='location', **kwargs) ... .rename(conf={'rule': rrule}))

Slide 95

Slide 95 text

Python Events Calendar >>> stream = flow.output >>> next(stream) {'event': 'PyDataBCN 2017', 'location': 'Barcelona, Spain'} >>> next(stream) {'event': 'PyConWEB 2017', 'location': 'Munich, Germany'}

Slide 96

Slide 96 text

Transforming data

Slide 97

Slide 97 text

Python Events Calendar >>> dpath = 'p.time.datetime' >>> frule = { ... 'field': 'date', 'op': 'after', ... 'value':'2017-06-01'} >>> >>> flow = ( ... SyncPipe('xpathfetchpage', conf=xconf) ... .subelement( ... conf={'path': epath}, ... assign='event', **kwargs)

Slide 98

Slide 98 text

Python Events Calendar ... .subelement( ... conf={'path': lpath}, ... assign='location', **kwargs) ... .subelement( ... conf={'path': dpath}, ... assign='date', **kwargs) ... .rename(conf={'rule': rrule}) ... .filter(conf={'rule': frule}))

Slide 99

Slide 99 text

Python Events Calendar >>> stream = flow.output >>> next(stream) {'date': '2017-06-06T00:00:00+00:00', 'event': 'PyCon Taiwan 2017', 'location': 'Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan'}

Slide 100

Slide 100 text

Parallel processing

Slide 101

Slide 101 text

Python Events Calendar >>> from meza.process import merge >>> from riko.collections import SyncCollection >>> >>> _type = 'xpathfetchpage' >>> source = {'url': url, 'type': _type} >>> xpath2 = '{}/div/ul/li'.format(_xpath) >>> sources = [ ... merge([source, {'xpath': xpath}]), ... merge([source, {'xpath': xpath2}])]

Slide 102

Slide 102 text

Python Events Calendar >>> sc = SyncCollection(sources, parallel=True) >>> flow = (sc.pipe() ... .subelement( ... conf={'path': epath}, ... assign='event', **kwargs) ... .rename(conf={'rule': rrule})) >>> >>> stream = flow.list >>> stream[0] {'event': 'PyDataBCN 2017'}

Slide 103

Slide 103 text

Python Events Calendar >>> stream[-1] {'event': 'PyDays Vienna 2017'}

Slide 104

Slide 104 text

Exercise #3 Image Credit: Me

Slide 105

Slide 105 text

Exercise #3 (Problem) • fetch the Python jobs rss feed • tokenize the "summary" field by newlines ("\n") • use "subelement" to extract the location (the first "token") • filter for jobs located in the U.S.

Slide 106

Slide 106 text

Exercise #3 (Problem) from riko.collections import SyncPipe url = 'https://www.python.org/jobs/feed/rss' # use the 'fetch', 'tokenizer', 'subelement', # and 'filter' pipes

Slide 107

Slide 107 text

Exercise #3 (Problem) • write the 'link', 'location', and 'title' ﬁelds of each record to a json ﬁle

Slide 108

Slide 108 text

Exercise #3 (Problem) from meza.fntools import dfilter from meza.convert import records2json from meza.io import write

Slide 109

Slide 109 text

Exercise #3 (Solution) >>> from riko.collections import SyncPipe >>> >>> url = 'https://www.python.org/jobs/feed/rss' >>> fetch_conf = {'url': url} >>> tconf = {'delimiter': '\n'} >>> rule = { ... 'field': 'location', 'op': 'contains'} >>> vals = ['usa', 'united states'] >>> frule = [ ... merge([rule, {'value': v}]) ... for v in vals]

Slide 110

Slide 110 text

Exercise #3 (Solution) >>> fconf = {'rule': frule, 'combine': 'or'} >>> kwargs = {'emit': False, 'token_key': None} >>> path = 'location.content.0' >>> rrule = [ ... {'field': 'summary'}, ... {'field': 'summary_detail'}, ... {'field': 'author'}, ... {'field': 'links'}]

Slide 111

Slide 111 text

Exercise #3 (Solution) >>> flow = (SyncPipe('fetch', conf=fetch_conf) ... .tokenizer( ... conf=tconf, field='summary', ... assign='location') ... .subelement( ... conf={'path': path}, ... assign='location', **kwargs) ... .filter(conf=fconf) ... .rename(conf={'rule': rrule}))

Slide 112

Slide 112 text

Exercise #3 (Solution) >>> stream = flow.list >>> stream[0] {'dc:creator': None, 'id': 'https://python.org/jobs/2570/', 'link': 'https://python.org/jobs/2570/', 'location': 'College Park,MD,USA', 'title': 'Python Developer - MarketSmart', 'title_detail': 'Python Developer - MarketSmart', 'y:published': None, 'y:title': 'Python Developer - MarketSmart'}

Slide 113

Slide 113 text

Exercise #3 (Solution) >>> from meza import convert as cv >>> from meza.fntools import dfilter >>> from meza.io import write >>> >>> fields = ['link', 'location', 'title'] >>> records = [ ... dfilter( ... item, blacklist=fields, ... inverse=True) ... for item in stream]

Slide 114

Slide 114 text

Exercise #3 (Solution) >>> json = cv.records2json(records) >>> write('pyjobs.json', json) $ head -n7 pyjobs.json [ { "link": "https://python.org/jobs/2570/", "location": "College Park,MD,USA", "title": "Python Developer - MarketSmart" }, {

Slide 115

Slide 115 text

Thank you! Reuben Cummings @reubano

Slide 116

Slide 116 text

Extra Slides Image Credit www.ﬂickr.com/photos/jeremybrooks

Slide 117

Slide 117 text

Inﬁnite Squares (functional) def accumulate_area2(rects, accum=0): it = iter(rects) try: area = get_area(next(it)) except StopIteration: return accum += area yield accum yield from accumulate_area2(it, accum)