Slide 1

Slide 1 text

YOU MIGHT NOT NEED PANDAS DATA DAY MEXICO — MEXICO CITY MARCH 15, 2018 BY REUBEN CUMMINGS @REUBANO

Slide 2

Slide 2 text

WHO AM I? • Managing Director, Nerevu Development • Founder of Arusha Coders • Author of several popular Python packages • @reubano on Twitter and GitHub

Slide 3

Slide 3 text

WHO DOESN’T LOVE THESE CUTE CUDDLY CREATURES? WHY PANDAS? • It’s fast • It’s ubiquitous • It gets the job done Photo credit: Mathias Appel (@mathiasappel)

Slide 4

Slide 4 text

BECAUSE IT ISN’T THAT BAD AFTER ALL WHEN SHOULD YOU USE PANDAS? • Speed is paramount • You already have pandas code available • You don’t want have time to learn anything new

Slide 5

Slide 5 text

WELL, THEY NEVER CLAIMED TO BE GRACEFUL… WHY NOT PANDAS? • It’s complex • It’s large • It likes lots of memory Photo credit: CSBaltimore

Slide 6

Slide 6 text

BECAUSE SOMETIMES IT ISN’T THAT GREAT EITHER WHEN SHOULDN’T YOU USE PANDAS? • You don’t want all those dependencies • You like functional programming • You have tight RAM constraints

Slide 7

Slide 7 text

PANDAS ALTERNATIVES • Pure Python • meza (https://github.com/ reubano/meza) • Other libraries (csvkit, messytables, etc.)

Slide 8

Slide 8 text

LA MARIPOSA MONARCA CHOOSE YOUR OWN ADVENTURE ANALYSIS ANALYSIS Photo credit: Pablo Leautaud (@pleautaud)

Slide 9

Slide 9 text

DISCLAIMER

Slide 10

Slide 10 text

ALL THE THINGS I AM NOT DISCLAIMER •Pandas expert •Statistician •Lepidopterologist (butterfly scientist)

Slide 11

Slide 11 text

MONARCH BUTTERFLY HIBERNATING COLONIES WORLD WILDLIFE FUND MÉXICO

Slide 12

Slide 12 text

Photo credit: Adam Jones (@adam_jones)

Slide 13

Slide 13 text

THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML FILES HOW WILL YOU OBTAIN THE DATA? Year Occupied Forest (acres) 1993 15.39 1994 19.30 1995 31.16 1996 44.95 1997 14.26

Slide 14

Slide 14 text

THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML FILES HOW WILL YOU OBTAIN THE DATA? PARSE THE EXCEL FILE OR SCRAPE THE HTML FILE

Slide 15

Slide 15 text

PARSE THE EXCEL FILE (PANDAS) >>> from pandas import ExcelFile >>> >>> book = ExcelFile('data.xlsx') >>> hibernation = book.parse( >>> 'hibernation') >>> >>> hibernation.head() Year Forest Area (acres) 0 1993 15.39 1 1994 19.30 2 1995 31.16 3 1996 44.95 4 1997 14.26

Slide 16

Slide 16 text

PARSE THE EXCEL FILE (NOT PANDAS) >>> from xlrd import open_workbook >>> >>> book = open_workbook('data.xlsx') >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> hibernation.row_values(0) ['Year', 'Forest Area (acres)'] >>> hibernation.row_values(1) [1993.0, 15.39]

Slide 17

Slide 17 text

>>> from meza.io import read_xls >>> from meza.process import peek >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> hibernation, head = peek(hibernation) >>> head[:3] [{'forest_area_acres': '15.39', 'year': '1993.0'}, {'forest_area_acres': '19.3', 'year': '1994.0'}, {'forest_area_acres': '31.16', 'year': '1995.0'}] PARSE THE EXCEL FILE (MEZA)

Slide 18

Slide 18 text

¡FELICIDADES! YOU’VE OBTAINED ESTIMATES FOR THE FOREST AREA OCCUPIED BY BUTTERFLY COLONIES

Slide 19

Slide 19 text

YOU’VE OBTAINED THE FOREST AREA ESTIMATES ¡FELICIDADES! • The declining forest area worries you • You suspect de- forestation • You also suspect pesticides ACRES OF “BUTTERFLY OCCUPIED” FOREST 0 12.5 25 37.5 50 1993 1994 1995 1996 1997

Slide 20

Slide 20 text

YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN MEXICO AND PESTICIDE USAGE IN THE USA WHICH DATA SET WILL YOU INVESTIGATE FURTHER? Year Occupied Forest (acres) Deforestation (ha) Pesticides (millions of lbs) 2003* 23.0 140.4 476.5 2005* 10.0 480.4 488.2 2006 17.0 462.4 485.9 2007 11.4 244.6 503.8 2008 12.5 259.0 516.1 * two year averages

Slide 21

Slide 21 text

INVESTIGATE DEFORESTATION OR INVESTIGATE PESTICIDES WHICH DATA SET WILL YOU INVESTIGATE FURTHER? YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN MEXICO AND PESTICIDE USAGE IN THE USA

Slide 22

Slide 22 text

MONARCH BUTTERFLY RESERVE DEFORESTATION WORLD WILDLIFE FUND MÉXICO

Slide 23

Slide 23 text

INVESTIGATE DEFORESTATION (PANDAS) >>> from pandas import ExcelFile >>> >>> book = ExcelFile(‘data.xlsx’) >>> >>> deforestation = book.parse( >>> 'deforestation') >>> >>> hibernation = book.parse( >>> 'hibernation') >>> >>> df = deforestation.merge( >>> hibernation, on='Year') >>>

Slide 24

Slide 24 text

INVESTIGATE DEFORESTATION (PANDAS) >>> df.head() Year Deforested... Forest Area... 0 2003 140.372293 27.48 1 2005 480.449496 14.60 2 2006 462.360283 16.98 3 2007 244.566159 11.39 4 2008 259.037529 12.50

Slide 25

Slide 25 text

INVESTIGATE DEFORESTATION (PANDAS) >>> X = df['Deforested Area (ha)'] >>> X[:3] 0 140.372293 1 480.449496 2 462.360283 >>> Y = df['Forest Area (acres)'] >>> Y[:3] 0 27.48 1 14.60 2 16.98 >>> cor_coef = X.corr(Y) >>> cor_coef 0.5801873106352113

Slide 26

Slide 26 text

INVESTIGATE DEFORESTATION (NOT PANDAS) >>> from xlrd import open_workbook >>> >>> book = open_workbook('data.xlsx') >>> >>> deforestation = book.sheet_by_name( >>> 'deforestation') >>> >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> def_years = deforestation.col_values( >>> 0, start_rowx=1) >>> >>> hiber_years = hibernation.col_values( >>> 0, start_rowx=1)

Slide 27

Slide 27 text

INVESTIGATE DEFORESTATION (NOT PANDAS) >>> common = set(def_years).intersection( >>> hiber_years) >>> >>> common {2003.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0} >>> drows = deforestation.get_rows() >>> hrows = hibernation.get_rows() >>> >>> X = [ >>> r[2].value for r in drows >>> if r[0].value in common]

Slide 28

Slide 28 text

INVESTIGATE DEFORESTATION (NOT PANDAS) >>> Y = [ >>> r[1].value for r in hrows >>> if r[0].value in common]

Slide 29

Slide 29 text

INVESTIGATE DEFORESTATION (NOT PANDAS) >>> from statistics import mean, pstdev >>> from itertools import starmap >>> from operator import mul >>> >>> >>> def correlation(X, Y): >>> prod = starmap(mul, zip(X, Y)) >>> ave = sum(prod) / len(X) >>> covar = ave - mean(X) * mean(Y) >>> std_prod = pstdev(X) * pstdev(Y) >>> return covar / std_prod >>> >>> cor_coef = correlation(X, Y)

Slide 30

Slide 30 text

INVESTIGATE DEFORESTATION (NOT PANDAS) >>> cor_coef 0.5801873106352112

Slide 31

Slide 31 text

INVESTIGATE DEFORESTATION (MEZA) >>> from meza.io import read_xls >>> from meza.process import tfilter >>> >>> deforestation = read_xls( >>> 'data.xlsx', sheet=1, >>> sanitize=True) >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> pred = lambda y: float(y) in common >>> >>> drecords = tfilter( >>> deforestation, 'year', pred)

Slide 32

Slide 32 text

INVESTIGATE DEFORESTATION (MEZA) >>> hrecords = tfilter( >>> hibernation, 'year', pred) >>> >>> X = [ >>> float(r['deforested_area_ha']) >>> for r in drecords] >>> >>> Y = [ >>> float(r['forest_area_acres']) >>> for r in hrecords] >>> >>> cor_coef = correlation(X, Y) >>> cor_coef 0.5801873106352112

Slide 33

Slide 33 text

¡FELICIDADES! YOU’VE OBTAINED THE DEFORESTATION CORRELATIONS

Slide 34

Slide 34 text

¡FELICIDADES! YOU’VE OBTAINED THE DEFORESTATION CORRELATIONS DEFORESTATION VS OCCUPIED FOREST OCCUPIED FOREST (ACRES) 0 7.5 15 22.5 30 DEFORESTED AREA (HA) 0 125 250 375 500

Slide 35

Slide 35 text

YOUR COLLEAGUES WOULD LIKE TO ACCESS YOUR FINDINGS AS EITHER A CSV OR JSON FILE HOW DO YOU WANT TO SAVE YOUR RESULTS? SAVE TO CSV OR SAVE TO JSON

Slide 36

Slide 36 text

SAVE TO A JSON FILE (PANDAS) >>> from pandas import DataFrame >>> >>> metric = 'correlation coefficient' >>> data = { >>> 'metric': [metric], >>> 'value': [cor_coef]} >>> >>> df = DataFrame(data=data) >>> df metric value 0 correlation coefficient 0.311468 >>> df.to_json('results.json')

Slide 37

Slide 37 text

SAVE TO A JSON FILE (NOT PANDAS) >>> from json import dump >>> >>> row = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> with open('results.json', 'w') as f: >>> dump([row], f)

Slide 38

Slide 38 text

SAVE TO A JSON FILE (MEZA) >>> from meza.convert import records2json >>> from meza.io import write >>> >>> record = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> results = records2json([record]) >>> write('results.json', results) 68

Slide 39

Slide 39 text

¡FELICIDADES! YOU’RE NOW ABLE TO READ, MANIPULATE, AND SAVE DATA. WITHOUT EVER TOUCHING PANDAS!

Slide 40

Slide 40 text

¡GRACIAS!

Slide 41

Slide 41 text

¿PREGUNTAS? REUBEN CUMMINGS @REUBANO

Slide 42

Slide 42 text

EXTRA SLIDES

Slide 43

Slide 43 text

SCRAPE THE HTML FILE (PANDAS) >>> from pandas import read_html >>> >>> with open('hibernation.html') as f: >>> df = read_html(f)[0] >>> hibernation = df[1:] >>> >>> hibernation.head() Year Forest Area (acres) 1 1993 15.39 2 1994 19.3 3 1995 31.16 4 1996 44.95 5 1997 14.26

Slide 44

Slide 44 text

SCRAPE THE HTML FILE (NOT PANDAS) >>> from bs4 import BeautifulSoup >>> >>> with open('hibernation.html') as f: >>> soup = BeautifulSoup(f, 'lxml') >>> trs = soup.table.find_all('tr') >>> >>> def gen_rows(trs): >>> for tr in trs: >>> row = tr.find_all('th') >>> row = row or tr.find_all('td') >>> yield tuple( >>> el.text for el in row)

Slide 45

Slide 45 text

SCRAPE THE HTML FILE (NOT PANDAS) >>> from itertools import islice >>> >>> rows = gen_rows(trs) >>> list(islice(rows, 3)) [('Year', 'Forest Area (acres)'), ('1993', '15.39'), (‘1994’, '19.3')]

Slide 46

Slide 46 text

>>> from meza.io import read_html >>> from meza.process import peek >>> >>> hibernation = read_html( >>> 'hibernation.html', sanitize=True) >>> >>> hibernation, head = peek(hibernation) >>> head[:3] [{'forest_area_acres': '15.39', 'year': '1993'}, {'forest_area_acres': '19.3', 'year': '1994'}, {'forest_area_acres': '31.16', 'year': '1995'}] SCRAPE THE HTML FILE (MEZA)

Slide 47

Slide 47 text

PESTICIDE USE IN U.S. AGRICULTURE USDA

Slide 48

Slide 48 text

INVESTIGATE PESTICIDES (PANDAS) >>> from pandas import ExcelFile >>> >>> book = ExcelFile('data.xlsx') >>> >>> pesticides = book.parse( >>> 'pesticides') >>> >>> hibernation = book.parse( >>> 'hibernation') >>> >>> df = pesticides.merge( >>> hibernation, on='Year')

Slide 49

Slide 49 text

INVESTIGATE PESTICIDES (PANDAS) >>> df.head() Year Total... Forest Area... 0 1993 549.3853 15.39 1 1994 568.4952 19.30 2 1995 541.9101 31.16 3 1996 597.3228 44.95 4 1997 600.5113 14.26

Slide 50

Slide 50 text

INVESTIGATE PESTICIDES (PANDAS) >>> X = df['Total (millions of lbs)'] >>> X[:3] 0 549.3853 1 568.4952 2 541.9101 >>> Y = df['Forest Area (acres)'] >>> Y[:3] 0 15.39 1 19.30 2 31.16 >>> cor_coef = X.corr(Y) >>> cor_coef 0.3114682506273226

Slide 51

Slide 51 text

INVESTIGATE PESTICIDES (NOT PANDAS) >>> from xlrd import open_workbook >>> >>> book = open_workbook('data.xlsx') >>> >>> pesticides = book.sheet_by_name( >>> 'pesticides') >>> >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> pest_years = pesticides.col_values( >>> 0, start_rowx=1) >>> >>> hiber_years = hibernation.col_values( >>> 0, start_rowx=1)

Slide 52

Slide 52 text

INVESTIGATE PESTICIDES (NOT PANDAS) >>> common = set(pest_years).intersection( >>> hiber_years) >>> >>> common {1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, …, 2008.0} >>> prows = pesticides.get_rows() >>> hrows = hibernation.get_rows() >>> >>> X = [ >>> r[5].value for r in prows >>> if r[0].value in common]

Slide 53

Slide 53 text

INVESTIGATE PESTICIDES (NOT PANDAS) >>> Y = [ >>> r[1].value for r in hrows >>> if r[0].value in common]

Slide 54

Slide 54 text

INVESTIGATE PESTICIDES (NOT PANDAS) >>> from statistics import mean, pstdev >>> from itertools import starmap >>> from operator import mul >>> >>> >>> def correlation(X, Y): >>> prod = starmap(mul, zip(X, Y)) >>> ave = sum(prod) / len(X) >>> covar = ave - mean(X) * mean(Y) >>> std_prod = pstdev(X) * pstdev(Y) >>> return covar / std_prod >>> >>> cor_coef = correlation(X, Y)

Slide 55

Slide 55 text

INVESTIGATE PESTICIDES (NOT PANDAS) >>> cor_coef 0.3114682506273143

Slide 56

Slide 56 text

INVESTIGATE PESTICIDES (MEZA) >>> from meza.io import read_xls >>> from meza.process import tfilter >>> >>> pesticides = read_xls( >>> 'data.xlsx', sheet=2, >>> sanitize=True) >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> pred = lambda y: float(y) in common >>> >>> precords = tfilter( >>> pesticides, 'year', pred)

Slide 57

Slide 57 text

INVESTIGATE PESTICIDES (MEZA) >>> hrecords = tfilter( >>> hibernation, 'year', pred) >>> >>> X = [ >>> float(r['total_millions_of_lbs']) >>> for r in precords] >>> >>> Y = [ >>> float(r['forest_area_acres']) >>> for r in hrecords] >>> >>> cor_coef = correlation(X, Y) >>> cor_coef 0.3114682506273143

Slide 58

Slide 58 text

¡FELICIDADES! YOU’VE OBTAINED THE PESTICIDE CORRELATIONS

Slide 59

Slide 59 text

¡FELICIDADES! YOU’VE OBTAINED THE PESTICIDE CORRELATIONS US PESTICIDE USE VS OCCUPIED FOREST OCCUPIED FOREST (ACRES) 0 12.5 25 37.5 50 PESTICIDE USE (MILLIONS OF LBS) 0 175 350 525 700

Slide 60

Slide 60 text

SAVE TO A CSV FILE (PANDAS) >>> from pandas import DataFrame >>> >>> metric = 'correlation coefficient' >>> data = { >>> 'metric': [metric], >>> 'value': [cor_coef]} >>> >>> df = DataFrame(data=data) >>> df metric value 0 correlation coefficient 0.311468 >>> df.to_csv('results.csv')

Slide 61

Slide 61 text

SAVE TO A CSV FILE (NOT PANDAS) >>> from csv import DictWriter >>> >>> row = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> with open('results.csv', 'w') as f: >>> fieldnames = ['metric', 'value'] >>> writer = DictWriter( >>> f, fieldnames=fieldnames) >>> writer.writeheader() >>> writer.writerow(row)

Slide 62

Slide 62 text

SAVE TO A CSV FILE (MEZA) >>> from meza.convert import records2csv >>> from meza.io import write >>> >>> record = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> results = records2csv([record]) >>> write('results.csv', results) 58