Data Mining and Processing for fun and profit

Slide 1

Slide 1 text

Data Mining and Processing for fun and proﬁt PyConZA - Cape Town, SA Oct 7, 2016 by Reuben Cummings @reubano #PyConZA16

Slide 2

Slide 2 text

Who am I? •Managing Director, Nerevu Development •Lead organizer of Arusha Coders •Author of several popular packages @reubano #PyConZA16

Slide 3

Slide 3 text

Topics & Format •data and data mining •code samples and interactive exercises •hands-on (don't be a spectator) @reubano #PyConZA16

Slide 4

Slide 4 text

what is data?

Slide 5

Slide 5 text

Organization country capital S Africa Joburg Tanzania Dodoma Rwanda Kigali structured unstructured "O God of all creation. Bless this our land and nation. Justice be our shield..."

Slide 6

Slide 6 text

Storage ﬂat/text binary greeting,loc hello,world good bye,moon welcome,stars what's up,sky 00105e0 b0e6 04... 00105f0 e4e7 04... 0010600 0be8 04... 00105f0 e4e7 04... 00105e0 b0e6 04...

Slide 7

Slide 7 text

Organization vs Storage structured unstructured binary ﬂat/text good! bad!

Slide 8

Slide 8 text

what is data mining?

Slide 9

Slide 9 text

Obtaining Data APIs websites feeds databases ﬁlesystems

Slide 10

Slide 10 text

Normalizing Data S Africa Tanzania Rwanda Joburg Dodoma Kigali 1961 1961 1962

Slide 11

Slide 11 text

Normalizing Data country capital independence S Africa Joburg 1961 Tanzania Dodoma 1961 Rwanda Kigali 1962

Slide 12

Slide 12 text

Visualizing Data 0 25 50 75 100 1930 1940 1950 1960 Cumulative Independent African Countries* * Note: above data is made up

Slide 13

Slide 13 text

so what!

Slide 14

Slide 14 text

Mint (mint.com)

Slide 15

Slide 15 text

Plotly (plot.ly)

Slide 16

Slide 16 text

Parsely (parsely.com)

Slide 17

Slide 17 text

You might not need pandas

Slide 18

Slide 18 text

IPython Demo bit.ly/data-pyconza16 (examples)

Slide 19

Slide 19 text

obtaining json

Slide 20

Slide 20 text

Code for South Africa (data.code4sa.org)

Slide 21

Slide 21 text

>>> from urllib.request import urlopen >>> from ijson import items >>> >>> # crime-summary.json in repo >>> url = 'http://data.code4sa.org/resource/qtx7- xbrs.json' >>> f = urlopen(url) >>> data = items(f, 'item') >>> next(data) {'station': 'Aberdeen', 'sum_2014_2015': '1153'}

Slide 22

Slide 22 text

reading csv

Slide 23

Slide 23 text

>>> from csv import DictReader >>> from io import open >>> from os import path as p >>> {'Crime': 'All theft not mentioned elsewhere', 'Incidents': '3397', 'Police Station': 'Durban Central', 'Province': 'KZN', 'Year': '2014'} >>> url = p.abspath('filtered-crime-stats.csv') >>> f = open(url) >>> data = DictReader(f) >>> next(data)

Slide 24

Slide 24 text

reading excel

Slide 25

Slide 25 text

>>> from xlrd import open_workbook >>> >>> url = p.abspath('filtered-crime-stats.xlsx') >>> book = open_workbook(url) >>> sheet = book.sheet_by_index(0) >>> sheet.row_values(0) ['Province', 'Police Station', 'Crime', 'Year', 'Incidents'] >>> sheet.row_values(1) ['KZN', 'Durban Central', 'All theft not mentioned elsewhere', 2014.0, 3397.0]

Slide 26

Slide 26 text

screen scraping

Slide 27

Slide 27 text

>>> import requests >>> from bs4 import BeautifulSoup >>> >>> url = 'https://github.com/reubano/pyconza-tutorial/ raw/master/migrants.html' >>> r = requests.get(url) >>> soup = BeautifulSoup(r.text, 'html.parser') >>> >>> def get_data(table): ... for row in table.findAll('tr')[1:]: ... header = row.findAll('th') ... td = row.findAll('td') ... columns = header or td ... yield [c.getText() for c in columns]

Slide 28

Slide 28 text

>>> table = soup.find('table') >>> data = get_data(table) >>> next(data) ['Mediterranean', '82', '346', ... ] >>> next(data) ['', 'January', 'February', ... ]

Slide 29

Slide 29 text

aggregating data

Slide 30

Slide 30 text

>>> import itertools as it >>> >>> records = [ ... {'a': 'item', 'amount': 200}, ... {'a': 'item', 'amount': 300}, ... {'a': 'item', 'amount': 400}] ... {'a': 'item', 'amount': 900} >>> key = 'amount' >>> first = records[0] >>> value = sum(r.get(key, 0) for r in records) >>> dict(it.chain(first.items(), [(key, value)]))

Slide 31

Slide 31 text

grouping data

Slide 32

Slide 32 text

>>> import itertools as it >>> from operator import itemgetter >>> >>> records = [ ... {'item': 'a', 'amount': 200}, ... {'item': 'b', 'amount': 200}, ... {'item': 'c', 'amount': 400}] >>> (200, [{'amount': 200, 'item': 'a'}, {'amount': 200, 'item': 'b'}]) >>> keyfunc = itemgetter('amount') >>> sorted_records = sorted(records, key=keyfunc) >>> grouped = it.groupby(sorted_records, keyfunc) >>> data = ((key, list(group)) for key, group in grouped) >>> next(data)

Slide 33

Slide 33 text

exercise #1

Slide 34

Slide 34 text

IPython Demo bit.ly/data-pyconza16 (exercises)

Slide 35

Slide 35 text

lowest crime per province

Slide 36

Slide 36 text

FS ('All theft not mentioned elsewhere', 2940) GP ('Drug-related crime', 5229) KZN ('Drug-related crime', 4571) WC ('Common assault', 2188)

Slide 37

Slide 37 text

IPython Demo bit.ly/data-pyconza16 (solutions)

Slide 38

Slide 38 text

>>> url = p.abspath('filtered-crime-stats.csv') >>> f = open(url) >>> data = DictReader(f) >>> keyfunc = itemgetter('Province') >>> records = sorted(data, key=keyfunc) >>> grouped = groupby(records, keyfunc) >>> from csv import DictReader >>> from io import open >>> from os import path as p >>> from itertools import groupby >>> from operator import itemgetter >>>

Slide 39

Slide 39 text

>>> for key, group in grouped: ... print(key) ... keyfunc = itemgetter('Crime') ... sub_records = sorted(group, key=keyfunc) ... sub_grouped = groupby(sub_records, keyfunc) ... low_count, low_key = 0, None ... ... for sub_key, sg in sub_grouped: ... count = sum(int(s['Incidents']) for s in sg) ... if not low_count or count < low_count: ... low_count = count ... low_key = sub_key ... ... print((low_key, low_count))

Slide 40

Slide 40 text

introducing meza github.com/reubano/meza

Slide 41

Slide 41 text

IPython Demo bit.ly/data-pyconza16 (meza)

Slide 42

Slide 42 text

reading data

Slide 43

Slide 43 text

>>> from urllib.request import urlopen >>> from meza.io import read_json >>> >>> # crime-summary >>> url = 'http://data.code4sa.org/resource/qtx7- xbrs.json' >>> f = urlopen(url) >>> records = read_json(f) >>> next(records) {'station': 'Aberdeen', 'sum_2014_2015': '1153'} >>> next(records) {'station': 'Acornhoek', 'sum_2014_2015': '5047'}

Slide 44

Slide 44 text

>>> from io import StringIO >>> from meza.io import read_csv >>> >>> f = StringIO('greeting,location\nhello,world\n') >>> records = read_csv(f) >>> next(records) {'greeting': 'hello', 'location': 'world'}

Slide 45

Slide 45 text

reading more data

Slide 46

Slide 46 text

>>> from os import path as p >>> from meza import io >>> >>> url = p.abspath('crime-summary.json') >>> records = io.read(url) >>> next(records) {'station': 'Aberdeen', 'sum_2014_2015': '1153'} >>> url2 = p.abspath('filtered-crime-stats.csv') >>> records = io.join(url, url2) >>> next(records) {'station': 'Aberdeen', 'sum_2014_2015': '1153'}

Slide 47

Slide 47 text

reading excel

Slide 48

Slide 48 text

>>> from io import open >>> from meza.io import read_xls >>> >>> url = p.abspath('filtered-crime-stats.xlsx') >>> records = read_xls(url, sanitize=True) >>> next(records) {'crime': 'All theft not mentioned elsewhere', 'incidents': '3397.0', 'police_station': 'Durban Central', 'province': 'KZN', 'year': '2014.0'}

Slide 49

Slide 49 text

screen scraping

Slide 50

Slide 50 text

>>> from meza.io import read_html >>> >>> url = p.abspath('migrants.html') >>> records = read_html(url, sanitize=True) >>> next(records) {'': 'Mediterranean', 'april': '1,244', 'august': '684', 'december': '203', 'february': '346', 'january': '82', 'july': '230', 'june': '\xa010', ... 'total_to_date': '3,760'}

Slide 51

Slide 51 text

aggregating data

Slide 52

Slide 52 text

>>> from meza.process import aggregate >>> >>> records = [ ... {'a': 'item', 'amount': 200}, ... {'a': 'item', 'amount': 300}, ... {'a': 'item', 'amount': 400}] ... >>> aggregate(records, 'amount', sum) {'a': 'item', 'amount': 900}

Slide 53

Slide 53 text

grouping data

Slide 54

Slide 54 text

>>> from meza.process import group >>> >>> records = [ ... {'item': 'a', 'amount': 200}, ... {'item': 'b', 'amount': 200}, ... {'item': 'c', 'amount': 400}] >>> >>> grouped = group(records, 'amount') >>> next(grouped) (200, [{'amount': 200, 'item': 'a'}, {'amount': 200, 'item': 'b'}])

Slide 55

Slide 55 text

type casting

Slide 56

Slide 56 text

>>> from meza.io import read_csv >>> from meza.process import detect_types, type_cast >>> >>> url = p.abspath('filtered-crime-stats.csv') >>> raw = read_csv(url) >>> records, result = detect_types(raw) >>> result['types'] [{'id': 'Incidents', 'type': 'int'}, {'id': 'Crime', 'type': 'text'}, {'id': 'Province', 'type': 'text'}, {'id': 'Year', 'type': 'int'}, {'id': 'Police Station', 'type': 'text'}]

Slide 57

Slide 57 text

>>> casted = type_cast(records, **result) >>> next(casted) {'Crime': 'All theft not mentioned elsewhere', 'Incidents': 3397, 'Police Station': 'Durban Central', 'Province': 'KZN', 'Year': 2014}

Slide 58

Slide 58 text

normalizing data

Slide 59

Slide 59 text

>>> from meza.process import normalize >>> >>> records = [ ... {'color': 'blue', 'setosa': 5, 'versi': 6}, ... {'color': 'red', 'setosa': 5, 'versi': 6}] >>> kwargs = { ... 'data': 'length', ... 'column':'species', ... 'rows': ['setosa', 'versi']} >>> data = normalize(records, **kwargs) >>> next(data) {'color': 'blue', 'length': 5, 'species': 'setosa'}

Slide 60

Slide 60 text

head to head pandas meza installation complex simple size large small memory usage high low speed fast fast* functions very many many input/output many many

Slide 61

Slide 61 text

exercise #2

Slide 62

Slide 62 text

IPython Demo bit.ly/data-pyconza16 (exercises)

Slide 63

Slide 63 text

lowest crime per province

Slide 64

Slide 64 text

{'Police Station': 'Park Road', 'Incidents': 2940, 'Province': 'FS', 'Crime': 'All theft not mentioned elsewhere', 'Year': 2014} {'Police Station': 'Eldorado Park', 'Incidents': 5229, 'Province': 'GP', 'Crime': 'Drug-related crime', 'Year': 2014} {'Police Station': 'Durban Central', 'Incidents': 4571, 'Province': 'KZN', 'Crime': 'Drug-related crime', 'Year': 2014} {'Police Station': 'Mitchells Plain', 'Incidents': 2188, 'Province': 'WC', 'Crime': 'Common assault', 'Year': 2014}

Slide 65

Slide 65 text

IPython Demo bit.ly/data-pyconza16 (solutions)

Slide 66

Slide 66 text

>>> url = p.abspath('filtered-crime-stats.csv') >>> raw = read_csv(url) >>> records, result = detect_types(raw) >>> casted = type_cast(records, **result) >>> grouped = group(casted, 'Province') >>> from meza.io import read_csv >>> from meza.process import ( ... aggregate, group, detect_types, type_cast) >>>

Slide 67

Slide 67 text

>>> for key, _group in grouped: ... sub_grouped = group(_group, 'Crime') ... aggs = ( ... aggregate(sg[1], 'Incidents', sum) ... for sg in sub_grouped) ... ... keyfunc = itemgetter('Incidents') ... print(min(aggs, key=keyfunc))

Slide 68

Slide 68 text

exercise #3

Slide 69

Slide 69 text

incidents per person per province

Slide 70

Slide 70 text

Thanks! Reuben Cummings [email protected] https://github.com/reubano/meza #pyconza2016 @reubano