Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You Might Not Need Pandas

You Might Not Need Pandas

While Pandas makes it convenient to perform most types of data analysis, it bears the cost of complex installation and extra dependencies. In many cases, you can write utility code to forgo Pandas all together. In this talk, you will learn pure Python alternatives to some of these common tasks.

Reuben Cummings

March 15, 2018
Tweet

More Decks by Reuben Cummings

Other Decks in Programming

Transcript

  1. YOU MIGHT NOT NEED PANDAS DATA DAY MEXICO — MEXICO

    CITY MARCH 15, 2018 BY REUBEN CUMMINGS @REUBANO
  2. WHO AM I? • Managing Director, Nerevu Development • Founder

    of Arusha Coders • Author of several popular Python packages • @reubano on Twitter and GitHub
  3. WHO DOESN’T LOVE THESE CUTE CUDDLY CREATURES? WHY PANDAS? •

    It’s fast • It’s ubiquitous • It gets the job done Photo credit: Mathias Appel (@mathiasappel)
  4. BECAUSE IT ISN’T THAT BAD AFTER ALL WHEN SHOULD YOU

    USE PANDAS? • Speed is paramount • You already have pandas code available • You don’t want have time to learn anything new
  5. WELL, THEY NEVER CLAIMED TO BE GRACEFUL… WHY NOT PANDAS?

    • It’s complex • It’s large • It likes lots of memory Photo credit: CSBaltimore
  6. BECAUSE SOMETIMES IT ISN’T THAT GREAT EITHER WHEN SHOULDN’T YOU

    USE PANDAS? • You don’t want all those dependencies • You like functional programming • You have tight RAM constraints
  7. ALL THE THINGS I AM NOT DISCLAIMER •Pandas expert •Statistician

    •Lepidopterologist (butterfly scientist)
  8. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML

    FILES HOW WILL YOU OBTAIN THE DATA? Year Occupied Forest (acres) 1993 15.39 1994 19.30 1995 31.16 1996 44.95 1997 14.26
  9. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML

    FILES HOW WILL YOU OBTAIN THE DATA? PARSE THE EXCEL FILE OR SCRAPE THE HTML FILE
  10. PARSE THE EXCEL FILE (PANDAS) >>> from pandas import ExcelFile

    >>> >>> book = ExcelFile('data.xlsx') >>> hibernation = book.parse( >>> 'hibernation') >>> >>> hibernation.head() Year Forest Area (acres) 0 1993 15.39 1 1994 19.30 2 1995 31.16 3 1996 44.95 4 1997 14.26
  11. PARSE THE EXCEL FILE (NOT PANDAS) >>> from xlrd import

    open_workbook >>> >>> book = open_workbook('data.xlsx') >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> hibernation.row_values(0) ['Year', 'Forest Area (acres)'] >>> hibernation.row_values(1) [1993.0, 15.39]
  12. >>> from meza.io import read_xls >>> from meza.process import peek

    >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> hibernation, head = peek(hibernation) >>> head[:3] [{'forest_area_acres': '15.39', 'year': '1993.0'}, {'forest_area_acres': '19.3', 'year': '1994.0'}, {'forest_area_acres': '31.16', 'year': '1995.0'}] PARSE THE EXCEL FILE (MEZA)
  13. YOU’VE OBTAINED THE FOREST AREA ESTIMATES ¡FELICIDADES! • The declining

    forest area worries you • You suspect de- forestation • You also suspect pesticides ACRES OF “BUTTERFLY OCCUPIED” FOREST 0 12.5 25 37.5 50 1993 1994 1995 1996 1997
  14. YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN MEXICO AND

    PESTICIDE USAGE IN THE USA WHICH DATA SET WILL YOU INVESTIGATE FURTHER? Year Occupied Forest (acres) Deforestation (ha) Pesticides (millions of lbs) 2003* 23.0 140.4 476.5 2005* 10.0 480.4 488.2 2006 17.0 462.4 485.9 2007 11.4 244.6 503.8 2008 12.5 259.0 516.1 * two year averages
  15. INVESTIGATE DEFORESTATION OR INVESTIGATE PESTICIDES WHICH DATA SET WILL YOU

    INVESTIGATE FURTHER? YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN MEXICO AND PESTICIDE USAGE IN THE USA
  16. INVESTIGATE DEFORESTATION (PANDAS) >>> from pandas import ExcelFile >>> >>>

    book = ExcelFile(‘data.xlsx’) >>> >>> deforestation = book.parse( >>> 'deforestation') >>> >>> hibernation = book.parse( >>> 'hibernation') >>> >>> df = deforestation.merge( >>> hibernation, on='Year') >>>
  17. INVESTIGATE DEFORESTATION (PANDAS) >>> df.head() Year Deforested... Forest Area... 0

    2003 140.372293 27.48 1 2005 480.449496 14.60 2 2006 462.360283 16.98 3 2007 244.566159 11.39 4 2008 259.037529 12.50
  18. INVESTIGATE DEFORESTATION (PANDAS) >>> X = df['Deforested Area (ha)'] >>>

    X[:3] 0 140.372293 1 480.449496 2 462.360283 >>> Y = df['Forest Area (acres)'] >>> Y[:3] 0 27.48 1 14.60 2 16.98 >>> cor_coef = X.corr(Y) >>> cor_coef 0.5801873106352113
  19. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> from xlrd import open_workbook >>>

    >>> book = open_workbook('data.xlsx') >>> >>> deforestation = book.sheet_by_name( >>> 'deforestation') >>> >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> def_years = deforestation.col_values( >>> 0, start_rowx=1) >>> >>> hiber_years = hibernation.col_values( >>> 0, start_rowx=1)
  20. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> common = set(def_years).intersection( >>> hiber_years)

    >>> >>> common {2003.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0} >>> drows = deforestation.get_rows() >>> hrows = hibernation.get_rows() >>> >>> X = [ >>> r[2].value for r in drows >>> if r[0].value in common]
  21. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> Y = [ >>> r[1].value

    for r in hrows >>> if r[0].value in common]
  22. INVESTIGATE DEFORESTATION (NOT PANDAS) >>> from statistics import mean, pstdev

    >>> from itertools import starmap >>> from operator import mul >>> >>> >>> def correlation(X, Y): >>> prod = starmap(mul, zip(X, Y)) >>> ave = sum(prod) / len(X) >>> covar = ave - mean(X) * mean(Y) >>> std_prod = pstdev(X) * pstdev(Y) >>> return covar / std_prod >>> >>> cor_coef = correlation(X, Y)
  23. INVESTIGATE DEFORESTATION (MEZA) >>> from meza.io import read_xls >>> from

    meza.process import tfilter >>> >>> deforestation = read_xls( >>> 'data.xlsx', sheet=1, >>> sanitize=True) >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> pred = lambda y: float(y) in common >>> >>> drecords = tfilter( >>> deforestation, 'year', pred)
  24. INVESTIGATE DEFORESTATION (MEZA) >>> hrecords = tfilter( >>> hibernation, 'year',

    pred) >>> >>> X = [ >>> float(r['deforested_area_ha']) >>> for r in drecords] >>> >>> Y = [ >>> float(r['forest_area_acres']) >>> for r in hrecords] >>> >>> cor_coef = correlation(X, Y) >>> cor_coef 0.5801873106352112
  25. ¡FELICIDADES! YOU’VE OBTAINED THE DEFORESTATION CORRELATIONS DEFORESTATION VS OCCUPIED FOREST

    OCCUPIED FOREST (ACRES) 0 7.5 15 22.5 30 DEFORESTED AREA (HA) 0 125 250 375 500
  26. YOUR COLLEAGUES WOULD LIKE TO ACCESS YOUR FINDINGS AS EITHER

    A CSV OR JSON FILE HOW DO YOU WANT TO SAVE YOUR RESULTS? SAVE TO CSV OR SAVE TO JSON
  27. SAVE TO A JSON FILE (PANDAS) >>> from pandas import

    DataFrame >>> >>> metric = 'correlation coefficient' >>> data = { >>> 'metric': [metric], >>> 'value': [cor_coef]} >>> >>> df = DataFrame(data=data) >>> df metric value 0 correlation coefficient 0.311468 >>> df.to_json('results.json')
  28. SAVE TO A JSON FILE (NOT PANDAS) >>> from json

    import dump >>> >>> row = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> with open('results.json', 'w') as f: >>> dump([row], f)
  29. SAVE TO A JSON FILE (MEZA) >>> from meza.convert import

    records2json >>> from meza.io import write >>> >>> record = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> results = records2json([record]) >>> write('results.json', results) 68
  30. SCRAPE THE HTML FILE (PANDAS) >>> from pandas import read_html

    >>> >>> with open('hibernation.html') as f: >>> df = read_html(f)[0] >>> hibernation = df[1:] >>> >>> hibernation.head() Year Forest Area (acres) 1 1993 15.39 2 1994 19.3 3 1995 31.16 4 1996 44.95 5 1997 14.26
  31. SCRAPE THE HTML FILE (NOT PANDAS) >>> from bs4 import

    BeautifulSoup >>> >>> with open('hibernation.html') as f: >>> soup = BeautifulSoup(f, 'lxml') >>> trs = soup.table.find_all('tr') >>> >>> def gen_rows(trs): >>> for tr in trs: >>> row = tr.find_all('th') >>> row = row or tr.find_all('td') >>> yield tuple( >>> el.text for el in row)
  32. SCRAPE THE HTML FILE (NOT PANDAS) >>> from itertools import

    islice >>> >>> rows = gen_rows(trs) >>> list(islice(rows, 3)) [('Year', 'Forest Area (acres)'), ('1993', '15.39'), (‘1994’, '19.3')]
  33. >>> from meza.io import read_html >>> from meza.process import peek

    >>> >>> hibernation = read_html( >>> 'hibernation.html', sanitize=True) >>> >>> hibernation, head = peek(hibernation) >>> head[:3] [{'forest_area_acres': '15.39', 'year': '1993'}, {'forest_area_acres': '19.3', 'year': '1994'}, {'forest_area_acres': '31.16', 'year': '1995'}] SCRAPE THE HTML FILE (MEZA)
  34. INVESTIGATE PESTICIDES (PANDAS) >>> from pandas import ExcelFile >>> >>>

    book = ExcelFile('data.xlsx') >>> >>> pesticides = book.parse( >>> 'pesticides') >>> >>> hibernation = book.parse( >>> 'hibernation') >>> >>> df = pesticides.merge( >>> hibernation, on='Year')
  35. INVESTIGATE PESTICIDES (PANDAS) >>> df.head() Year Total... Forest Area... 0

    1993 549.3853 15.39 1 1994 568.4952 19.30 2 1995 541.9101 31.16 3 1996 597.3228 44.95 4 1997 600.5113 14.26
  36. INVESTIGATE PESTICIDES (PANDAS) >>> X = df['Total (millions of lbs)']

    >>> X[:3] 0 549.3853 1 568.4952 2 541.9101 >>> Y = df['Forest Area (acres)'] >>> Y[:3] 0 15.39 1 19.30 2 31.16 >>> cor_coef = X.corr(Y) >>> cor_coef 0.3114682506273226
  37. INVESTIGATE PESTICIDES (NOT PANDAS) >>> from xlrd import open_workbook >>>

    >>> book = open_workbook('data.xlsx') >>> >>> pesticides = book.sheet_by_name( >>> 'pesticides') >>> >>> hibernation = book.sheet_by_name( >>> 'hibernation') >>> >>> pest_years = pesticides.col_values( >>> 0, start_rowx=1) >>> >>> hiber_years = hibernation.col_values( >>> 0, start_rowx=1)
  38. INVESTIGATE PESTICIDES (NOT PANDAS) >>> common = set(pest_years).intersection( >>> hiber_years)

    >>> >>> common {1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, …, 2008.0} >>> prows = pesticides.get_rows() >>> hrows = hibernation.get_rows() >>> >>> X = [ >>> r[5].value for r in prows >>> if r[0].value in common]
  39. INVESTIGATE PESTICIDES (NOT PANDAS) >>> Y = [ >>> r[1].value

    for r in hrows >>> if r[0].value in common]
  40. INVESTIGATE PESTICIDES (NOT PANDAS) >>> from statistics import mean, pstdev

    >>> from itertools import starmap >>> from operator import mul >>> >>> >>> def correlation(X, Y): >>> prod = starmap(mul, zip(X, Y)) >>> ave = sum(prod) / len(X) >>> covar = ave - mean(X) * mean(Y) >>> std_prod = pstdev(X) * pstdev(Y) >>> return covar / std_prod >>> >>> cor_coef = correlation(X, Y)
  41. INVESTIGATE PESTICIDES (MEZA) >>> from meza.io import read_xls >>> from

    meza.process import tfilter >>> >>> pesticides = read_xls( >>> 'data.xlsx', sheet=2, >>> sanitize=True) >>> >>> hibernation = read_xls( >>> 'data.xlsx', sanitize=True) >>> >>> pred = lambda y: float(y) in common >>> >>> precords = tfilter( >>> pesticides, 'year', pred)
  42. INVESTIGATE PESTICIDES (MEZA) >>> hrecords = tfilter( >>> hibernation, 'year',

    pred) >>> >>> X = [ >>> float(r['total_millions_of_lbs']) >>> for r in precords] >>> >>> Y = [ >>> float(r['forest_area_acres']) >>> for r in hrecords] >>> >>> cor_coef = correlation(X, Y) >>> cor_coef 0.3114682506273143
  43. ¡FELICIDADES! YOU’VE OBTAINED THE PESTICIDE CORRELATIONS US PESTICIDE USE VS

    OCCUPIED FOREST OCCUPIED FOREST (ACRES) 0 12.5 25 37.5 50 PESTICIDE USE (MILLIONS OF LBS) 0 175 350 525 700
  44. SAVE TO A CSV FILE (PANDAS) >>> from pandas import

    DataFrame >>> >>> metric = 'correlation coefficient' >>> data = { >>> 'metric': [metric], >>> 'value': [cor_coef]} >>> >>> df = DataFrame(data=data) >>> df metric value 0 correlation coefficient 0.311468 >>> df.to_csv('results.csv')
  45. SAVE TO A CSV FILE (NOT PANDAS) >>> from csv

    import DictWriter >>> >>> row = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> with open('results.csv', 'w') as f: >>> fieldnames = ['metric', 'value'] >>> writer = DictWriter( >>> f, fieldnames=fieldnames) >>> writer.writeheader() >>> writer.writerow(row)
  46. SAVE TO A CSV FILE (MEZA) >>> from meza.convert import

    records2csv >>> from meza.io import write >>> >>> record = { >>> 'metric': metric, >>> 'value': cor_coef} >>> >>> results = records2csv([record]) >>> write('results.csv', results) 58