$30 off During Our Annual Pro Sale. View Details »

You Might Not Need Pandas

You Might Not Need Pandas

While Pandas makes it convenient to perform most types of data analysis, it bears the cost of complex installation and extra dependencies. In many cases, you can write utility code to forgo Pandas all together. In this talk, you will learn pure Python alternatives to some of these common tasks.

Reuben Cummings

March 15, 2018
Tweet

More Decks by Reuben Cummings

Other Decks in Programming

Transcript

  1. YOU MIGHT NOT
    NEED PANDAS
    DATA DAY MEXICO — MEXICO CITY
    MARCH 15, 2018
    BY REUBEN CUMMINGS
    @REUBANO

    View Slide

  2. WHO AM I?
    • Managing Director, Nerevu
    Development
    • Founder of Arusha Coders
    • Author of several popular Python
    packages
    • @reubano on Twitter and GitHub

    View Slide

  3. WHO DOESN’T LOVE THESE CUTE CUDDLY CREATURES?
    WHY PANDAS?
    • It’s fast
    • It’s
    ubiquitous
    • It gets the
    job done
    Photo credit: Mathias Appel (@mathiasappel)

    View Slide

  4. BECAUSE IT ISN’T THAT BAD AFTER ALL
    WHEN SHOULD YOU USE PANDAS?
    • Speed is paramount
    • You already have pandas code
    available
    • You don’t want have time to
    learn anything new

    View Slide

  5. WELL, THEY NEVER CLAIMED TO BE GRACEFUL…
    WHY NOT PANDAS?
    • It’s complex
    • It’s large
    • It likes lots
    of memory
    Photo credit: CSBaltimore

    View Slide

  6. BECAUSE SOMETIMES IT ISN’T THAT GREAT EITHER
    WHEN SHOULDN’T YOU USE PANDAS?
    • You don’t want all those
    dependencies
    • You like functional programming
    • You have tight RAM constraints

    View Slide

  7. PANDAS ALTERNATIVES
    • Pure Python
    • meza (https://github.com/
    reubano/meza)
    • Other libraries (csvkit,
    messytables, etc.)

    View Slide

  8. LA
    MARIPOSA
    MONARCA
    CHOOSE YOUR OWN
    ADVENTURE ANALYSIS
    ANALYSIS
    Photo credit: Pablo Leautaud (@pleautaud)

    View Slide

  9. DISCLAIMER

    View Slide

  10. ALL THE THINGS I AM NOT
    DISCLAIMER
    •Pandas expert
    •Statistician
    •Lepidopterologist
    (butterfly scientist)

    View Slide

  11. MONARCH BUTTERFLY
    HIBERNATING COLONIES
    WORLD WILDLIFE FUND MÉXICO

    View Slide

  12. Photo credit: Adam Jones (@adam_jones)

    View Slide

  13. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
    FILES
    HOW WILL YOU OBTAIN THE DATA?
    Year
    Occupied Forest
    (acres)
    1993 15.39
    1994 19.30
    1995 31.16
    1996 44.95
    1997 14.26

    View Slide

  14. THE DATA TABLE IS AVAILABLE AS BOTH EXCEL AND HTML
    FILES
    HOW WILL YOU OBTAIN THE DATA?
    PARSE THE
    EXCEL FILE
    OR
    SCRAPE THE
    HTML FILE

    View Slide

  15. PARSE THE EXCEL FILE (PANDAS)
    >>> from pandas import ExcelFile
    >>>
    >>> book = ExcelFile('data.xlsx')
    >>> hibernation = book.parse(
    >>> 'hibernation')
    >>>
    >>> hibernation.head()
    Year Forest Area (acres)
    0 1993 15.39
    1 1994 19.30
    2 1995 31.16
    3 1996 44.95
    4 1997 14.26

    View Slide

  16. PARSE THE EXCEL FILE (NOT PANDAS)
    >>> from xlrd import open_workbook
    >>>
    >>> book = open_workbook('data.xlsx')
    >>> hibernation = book.sheet_by_name(
    >>> 'hibernation')
    >>>
    >>> hibernation.row_values(0)
    ['Year', 'Forest Area (acres)']
    >>> hibernation.row_values(1)
    [1993.0, 15.39]

    View Slide

  17. >>> from meza.io import read_xls
    >>> from meza.process import peek
    >>>
    >>> hibernation = read_xls(
    >>> 'data.xlsx', sanitize=True)
    >>>
    >>> hibernation, head = peek(hibernation)
    >>> head[:3]
    [{'forest_area_acres': '15.39', 'year':
    '1993.0'},
    {'forest_area_acres': '19.3', 'year':
    '1994.0'},
    {'forest_area_acres': '31.16', 'year':
    '1995.0'}]
    PARSE THE EXCEL FILE (MEZA)

    View Slide

  18. ¡FELICIDADES!
    YOU’VE OBTAINED ESTIMATES
    FOR THE FOREST AREA
    OCCUPIED BY BUTTERFLY
    COLONIES

    View Slide

  19. YOU’VE OBTAINED THE FOREST AREA ESTIMATES
    ¡FELICIDADES!
    • The declining
    forest area
    worries you
    • You suspect de-
    forestation
    • You also suspect
    pesticides
    ACRES OF “BUTTERFLY
    OCCUPIED” FOREST
    0
    12.5
    25
    37.5
    50
    1993 1994 1995 1996 1997

    View Slide

  20. YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN
    MEXICO AND PESTICIDE USAGE IN THE USA
    WHICH DATA SET WILL YOU INVESTIGATE FURTHER?
    Year
    Occupied
    Forest (acres)
    Deforestation (ha)
    Pesticides
    (millions of lbs)
    2003* 23.0 140.4 476.5
    2005* 10.0 480.4 488.2
    2006 17.0 462.4 485.9
    2007 11.4 244.6 503.8
    2008 12.5 259.0 516.1
    * two year averages

    View Slide

  21. INVESTIGATE
    DEFORESTATION
    OR
    INVESTIGATE
    PESTICIDES
    WHICH DATA SET WILL YOU INVESTIGATE FURTHER?
    YOU HAVE OBTAINED DATA FOR BOTH DEFORESTATION IN
    MEXICO AND PESTICIDE USAGE IN THE USA

    View Slide

  22. MONARCH BUTTERFLY
    RESERVE DEFORESTATION
    WORLD WILDLIFE FUND MÉXICO

    View Slide

  23. INVESTIGATE DEFORESTATION (PANDAS)
    >>> from pandas import ExcelFile
    >>>
    >>> book = ExcelFile(‘data.xlsx’)
    >>>
    >>> deforestation = book.parse(
    >>> 'deforestation')
    >>>
    >>> hibernation = book.parse(
    >>> 'hibernation')
    >>>
    >>> df = deforestation.merge(
    >>> hibernation, on='Year')
    >>>

    View Slide

  24. INVESTIGATE DEFORESTATION (PANDAS)
    >>> df.head()
    Year Deforested... Forest Area...
    0 2003 140.372293 27.48
    1 2005 480.449496 14.60
    2 2006 462.360283 16.98
    3 2007 244.566159 11.39
    4 2008 259.037529 12.50

    View Slide

  25. INVESTIGATE DEFORESTATION (PANDAS)
    >>> X = df['Deforested Area (ha)']
    >>> X[:3]
    0 140.372293
    1 480.449496
    2 462.360283
    >>> Y = df['Forest Area (acres)']
    >>> Y[:3]
    0 27.48
    1 14.60
    2 16.98
    >>> cor_coef = X.corr(Y)
    >>> cor_coef
    0.5801873106352113

    View Slide

  26. INVESTIGATE DEFORESTATION (NOT PANDAS)
    >>> from xlrd import open_workbook
    >>>
    >>> book = open_workbook('data.xlsx')
    >>>
    >>> deforestation = book.sheet_by_name(
    >>> 'deforestation')
    >>>
    >>> hibernation = book.sheet_by_name(
    >>> 'hibernation')
    >>>
    >>> def_years = deforestation.col_values(
    >>> 0, start_rowx=1)
    >>>
    >>> hiber_years = hibernation.col_values(
    >>> 0, start_rowx=1)

    View Slide

  27. INVESTIGATE DEFORESTATION (NOT PANDAS)
    >>> common = set(def_years).intersection(
    >>> hiber_years)
    >>>
    >>> common
    {2003.0, 2005.0, 2006.0, 2007.0, 2008.0,
    2009.0, 2010.0, 2011.0, 2012.0, 2013.0,
    2014.0}
    >>> drows = deforestation.get_rows()
    >>> hrows = hibernation.get_rows()
    >>>
    >>> X = [
    >>> r[2].value for r in drows
    >>> if r[0].value in common]

    View Slide

  28. INVESTIGATE DEFORESTATION (NOT PANDAS)
    >>> Y = [
    >>> r[1].value for r in hrows
    >>> if r[0].value in common]

    View Slide

  29. INVESTIGATE DEFORESTATION (NOT PANDAS)
    >>> from statistics import mean, pstdev
    >>> from itertools import starmap
    >>> from operator import mul
    >>>
    >>>
    >>> def correlation(X, Y):
    >>> prod = starmap(mul, zip(X, Y))
    >>> ave = sum(prod) / len(X)
    >>> covar = ave - mean(X) * mean(Y)
    >>> std_prod = pstdev(X) * pstdev(Y)
    >>> return covar / std_prod
    >>>
    >>> cor_coef = correlation(X, Y)

    View Slide

  30. INVESTIGATE DEFORESTATION (NOT PANDAS)
    >>> cor_coef
    0.5801873106352112

    View Slide

  31. INVESTIGATE DEFORESTATION (MEZA)
    >>> from meza.io import read_xls
    >>> from meza.process import tfilter
    >>>
    >>> deforestation = read_xls(
    >>> 'data.xlsx', sheet=1,
    >>> sanitize=True)
    >>>
    >>> hibernation = read_xls(
    >>> 'data.xlsx', sanitize=True)
    >>>
    >>> pred = lambda y: float(y) in common
    >>>
    >>> drecords = tfilter(
    >>> deforestation, 'year', pred)

    View Slide

  32. INVESTIGATE DEFORESTATION (MEZA)
    >>> hrecords = tfilter(
    >>> hibernation, 'year', pred)
    >>>
    >>> X = [
    >>> float(r['deforested_area_ha'])
    >>> for r in drecords]
    >>>
    >>> Y = [
    >>> float(r['forest_area_acres'])
    >>> for r in hrecords]
    >>>
    >>> cor_coef = correlation(X, Y)
    >>> cor_coef
    0.5801873106352112

    View Slide

  33. ¡FELICIDADES!
    YOU’VE OBTAINED THE
    DEFORESTATION
    CORRELATIONS

    View Slide

  34. ¡FELICIDADES!
    YOU’VE OBTAINED THE DEFORESTATION CORRELATIONS
    DEFORESTATION VS OCCUPIED FOREST
    OCCUPIED FOREST (ACRES)
    0
    7.5
    15
    22.5
    30
    DEFORESTED AREA (HA)
    0 125 250 375 500

    View Slide

  35. YOUR COLLEAGUES WOULD LIKE TO ACCESS YOUR FINDINGS
    AS EITHER A CSV OR JSON FILE
    HOW DO YOU WANT TO SAVE YOUR RESULTS?
    SAVE TO
    CSV
    OR
    SAVE TO
    JSON

    View Slide

  36. SAVE TO A JSON FILE (PANDAS)
    >>> from pandas import DataFrame
    >>>
    >>> metric = 'correlation coefficient'
    >>> data = {
    >>> 'metric': [metric],
    >>> 'value': [cor_coef]}
    >>>
    >>> df = DataFrame(data=data)
    >>> df
    metric value
    0 correlation coefficient 0.311468
    >>> df.to_json('results.json')

    View Slide

  37. SAVE TO A JSON FILE (NOT PANDAS)
    >>> from json import dump
    >>>
    >>> row = {
    >>> 'metric': metric,
    >>> 'value': cor_coef}
    >>>
    >>> with open('results.json', 'w') as f:
    >>> dump([row], f)

    View Slide

  38. SAVE TO A JSON FILE (MEZA)
    >>> from meza.convert import records2json
    >>> from meza.io import write
    >>>
    >>> record = {
    >>> 'metric': metric,
    >>> 'value': cor_coef}
    >>>
    >>> results = records2json([record])
    >>> write('results.json', results)
    68

    View Slide

  39. ¡FELICIDADES!
    YOU’RE NOW ABLE TO READ,
    MANIPULATE, AND SAVE DATA.
    WITHOUT EVER TOUCHING
    PANDAS!

    View Slide

  40. ¡GRACIAS!

    View Slide

  41. ¿PREGUNTAS?
    REUBEN CUMMINGS
    @REUBANO

    View Slide

  42. EXTRA
    SLIDES

    View Slide

  43. SCRAPE THE HTML FILE (PANDAS)
    >>> from pandas import read_html
    >>>
    >>> with open('hibernation.html') as f:
    >>> df = read_html(f)[0]
    >>> hibernation = df[1:]
    >>>
    >>> hibernation.head()
    Year Forest Area (acres)
    1 1993 15.39
    2 1994 19.3
    3 1995 31.16
    4 1996 44.95
    5 1997 14.26

    View Slide

  44. SCRAPE THE HTML FILE (NOT PANDAS)
    >>> from bs4 import BeautifulSoup
    >>>
    >>> with open('hibernation.html') as f:
    >>> soup = BeautifulSoup(f, 'lxml')
    >>> trs = soup.table.find_all('tr')
    >>>
    >>> def gen_rows(trs):
    >>> for tr in trs:
    >>> row = tr.find_all('th')
    >>> row = row or tr.find_all('td')
    >>> yield tuple(
    >>> el.text for el in row)

    View Slide

  45. SCRAPE THE HTML FILE (NOT PANDAS)
    >>> from itertools import islice
    >>>
    >>> rows = gen_rows(trs)
    >>> list(islice(rows, 3))
    [('Year', 'Forest Area (acres)'),
    ('1993', '15.39'),
    (‘1994’, '19.3')]

    View Slide

  46. >>> from meza.io import read_html
    >>> from meza.process import peek
    >>>
    >>> hibernation = read_html(
    >>> 'hibernation.html', sanitize=True)
    >>>
    >>> hibernation, head = peek(hibernation)
    >>> head[:3]
    [{'forest_area_acres': '15.39', 'year':
    '1993'},
    {'forest_area_acres': '19.3', 'year':
    '1994'},
    {'forest_area_acres': '31.16', 'year':
    '1995'}]
    SCRAPE THE HTML FILE (MEZA)

    View Slide

  47. PESTICIDE USE IN U.S.
    AGRICULTURE
    USDA

    View Slide

  48. INVESTIGATE PESTICIDES (PANDAS)
    >>> from pandas import ExcelFile
    >>>
    >>> book = ExcelFile('data.xlsx')
    >>>
    >>> pesticides = book.parse(
    >>> 'pesticides')
    >>>
    >>> hibernation = book.parse(
    >>> 'hibernation')
    >>>
    >>> df = pesticides.merge(
    >>> hibernation, on='Year')

    View Slide

  49. INVESTIGATE PESTICIDES (PANDAS)
    >>> df.head()
    Year Total... Forest Area...
    0 1993 549.3853 15.39
    1 1994 568.4952 19.30
    2 1995 541.9101 31.16
    3 1996 597.3228 44.95
    4 1997 600.5113 14.26

    View Slide

  50. INVESTIGATE PESTICIDES (PANDAS)
    >>> X = df['Total (millions of lbs)']
    >>> X[:3]
    0 549.3853
    1 568.4952
    2 541.9101
    >>> Y = df['Forest Area (acres)']
    >>> Y[:3]
    0 15.39
    1 19.30
    2 31.16
    >>> cor_coef = X.corr(Y)
    >>> cor_coef
    0.3114682506273226

    View Slide

  51. INVESTIGATE PESTICIDES (NOT PANDAS)
    >>> from xlrd import open_workbook
    >>>
    >>> book = open_workbook('data.xlsx')
    >>>
    >>> pesticides = book.sheet_by_name(
    >>> 'pesticides')
    >>>
    >>> hibernation = book.sheet_by_name(
    >>> 'hibernation')
    >>>
    >>> pest_years = pesticides.col_values(
    >>> 0, start_rowx=1)
    >>>
    >>> hiber_years = hibernation.col_values(
    >>> 0, start_rowx=1)

    View Slide

  52. INVESTIGATE PESTICIDES (NOT PANDAS)
    >>> common = set(pest_years).intersection(
    >>> hiber_years)
    >>>
    >>> common
    {1993.0, 1994.0, 1995.0, 1996.0, 1997.0,
    1998.0, 1999.0, 2000.0, 2001.0, 2002.0,
    2003.0, 2004.0, 2005.0, …, 2008.0}
    >>> prows = pesticides.get_rows()
    >>> hrows = hibernation.get_rows()
    >>>
    >>> X = [
    >>> r[5].value for r in prows
    >>> if r[0].value in common]

    View Slide

  53. INVESTIGATE PESTICIDES (NOT PANDAS)
    >>> Y = [
    >>> r[1].value for r in hrows
    >>> if r[0].value in common]

    View Slide

  54. INVESTIGATE PESTICIDES (NOT PANDAS)
    >>> from statistics import mean, pstdev
    >>> from itertools import starmap
    >>> from operator import mul
    >>>
    >>>
    >>> def correlation(X, Y):
    >>> prod = starmap(mul, zip(X, Y))
    >>> ave = sum(prod) / len(X)
    >>> covar = ave - mean(X) * mean(Y)
    >>> std_prod = pstdev(X) * pstdev(Y)
    >>> return covar / std_prod
    >>>
    >>> cor_coef = correlation(X, Y)

    View Slide

  55. INVESTIGATE PESTICIDES (NOT PANDAS)
    >>> cor_coef
    0.3114682506273143

    View Slide

  56. INVESTIGATE PESTICIDES (MEZA)
    >>> from meza.io import read_xls
    >>> from meza.process import tfilter
    >>>
    >>> pesticides = read_xls(
    >>> 'data.xlsx', sheet=2,
    >>> sanitize=True)
    >>>
    >>> hibernation = read_xls(
    >>> 'data.xlsx', sanitize=True)
    >>>
    >>> pred = lambda y: float(y) in common
    >>>
    >>> precords = tfilter(
    >>> pesticides, 'year', pred)

    View Slide

  57. INVESTIGATE PESTICIDES (MEZA)
    >>> hrecords = tfilter(
    >>> hibernation, 'year', pred)
    >>>
    >>> X = [
    >>> float(r['total_millions_of_lbs'])
    >>> for r in precords]
    >>>
    >>> Y = [
    >>> float(r['forest_area_acres'])
    >>> for r in hrecords]
    >>>
    >>> cor_coef = correlation(X, Y)
    >>> cor_coef
    0.3114682506273143

    View Slide

  58. ¡FELICIDADES!
    YOU’VE OBTAINED THE
    PESTICIDE CORRELATIONS

    View Slide

  59. ¡FELICIDADES!
    YOU’VE OBTAINED THE PESTICIDE CORRELATIONS
    US PESTICIDE USE VS OCCUPIED FOREST
    OCCUPIED FOREST (ACRES)
    0
    12.5
    25
    37.5
    50
    PESTICIDE USE (MILLIONS OF LBS)
    0 175 350 525 700

    View Slide

  60. SAVE TO A CSV FILE (PANDAS)
    >>> from pandas import DataFrame
    >>>
    >>> metric = 'correlation coefficient'
    >>> data = {
    >>> 'metric': [metric],
    >>> 'value': [cor_coef]}
    >>>
    >>> df = DataFrame(data=data)
    >>> df
    metric value
    0 correlation coefficient 0.311468
    >>> df.to_csv('results.csv')

    View Slide

  61. SAVE TO A CSV FILE (NOT PANDAS)
    >>> from csv import DictWriter
    >>>
    >>> row = {
    >>> 'metric': metric,
    >>> 'value': cor_coef}
    >>>
    >>> with open('results.csv', 'w') as f:
    >>> fieldnames = ['metric', 'value']
    >>> writer = DictWriter(
    >>> f, fieldnames=fieldnames)
    >>> writer.writeheader()
    >>> writer.writerow(row)

    View Slide

  62. SAVE TO A CSV FILE (MEZA)
    >>> from meza.convert import records2csv
    >>> from meza.io import write
    >>>
    >>> record = {
    >>> 'metric': metric,
    >>> 'value': cor_coef}
    >>>
    >>> results = records2csv([record])
    >>> write('results.csv', results)
    58

    View Slide