Upgrade to Pro — share decks privately, control downloads, hide ads and more …

John Fries - Pandas - PyDSLA meetup - Nov 2014

E936a58f495e26123f9f537ea31968f7?s=47 Data Science LA
November 05, 2014
3.3k

John Fries - Pandas - PyDSLA meetup - Nov 2014

E936a58f495e26123f9f537ea31968f7?s=128

Data Science LA

November 05, 2014
Tweet

More Decks by Data Science LA

Transcript

  1. Data Munging with Pandas v0.01

  2. The Zen of Pandas

  3. Who am I? CTO at OpenMail in Venice Beach former

    Google engineer Python at YouTube MIT math degree Still intimidated by “Data Science”
  4. Who is this talk for? knows Python pretty well wants

    to get into data analysis just getting started with Pandas
  5. What is Pandas?

  6. What is Pandas? Pandas is a Python library for data

    analysis and data manipulation A Python version of the R data.frame library
  7. Why learn Pandas? • If you like Python…it’s a better

    Python. • It’s a smoother path than raw numpy • Very easy to *share* data analysis
  8. Extends and completes Python’s core datatypes! boolean indexing (same length)

    fancy indexing (list of integers) brings slices to dictionaries brings boolean and fancy indexing to dictionaries too
  9. Getting Started on ubuntu: >>pip install ipython[all] >>pip install pandas

    >>ipython notebook also install ipython notebook viewer browser extension
  10. Getting Started

  11. Core Data Structures Series DataFrame

  12. Core operations Create Select Insert Map Join Sort Clean Bin

    View Update Filter Append Group Summarize Conform Rotate
  13. Create Lots of ways to initialize a DataFrame. protip: Easiest

    is just passing in a list of dictionaries
  14. View Most common commands: head(), tail() protip: from IPython.display import

    display
  15. Select too many options: [], .loc[], .iloc[], .ix[], .iat[], .at[],

    .xs() protip: .loc[] won’t let you down
  16. Update

  17. Insert maximally inefficient! always preallocate and update or use concat

    if at all possible but yeah, you can do it!
  18. Filter • df[df[‘a’] < 10] • like the sql WHERE

    clause • boolean indexes • occasionally, df.where
  19. Map map - series elements apply - columns applymap -

    dataframe elements
  20. Append df1 + df2 is not what you want use

    pd.concat choice of axis is important use df.append() after you understand concat()
  21. Join

  22. Group groupby

  23. Summarize agg() is your friend

  24. Sort sort_index sort_index(by=’a’) sort_index(by=[‘a’, ‘b’])

  25. Clean dropna, fillna drop duplicates clean outliers

  26. Conform “make it look like this” (dropping or Nan’ing as

    needed) reindex() resample() (upsample, downsample, etc)
  27. Bin df.cut

  28. Rotate

  29. Rotate • df.T is the simplest case • pivot tables

    are really useful • prefer df.pivot() to df.pivot_table() • prefer df.set_index(), df.unstack() to df.pivot() • hierarchical indexes (aka MultiIndex) make pivoting pythonic • columns are indexes as well! therefore, they can contain multiindexes • reset_index() takes you back to the begining • protip: think of a multiindex as an index of tuples. • trailing entries can be sparsely empty!
  30. None
  31. None
  32. None
  33. None
  34. None
  35. Thank you!

  36. Acknowledgements http://www.analyticsvidhya.com/blog/2014/08/baby-steps-python-performing-exploratory-analysis-python/ - used his titanic cell image http://nbviewer.ipython.org/github/herrfz/dataanalysis/blob/master/week2/data_munging_basics.ipynb -

    used his cut example for binning http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ - for a merge example
  37. Random Points ZEN: pandas is really an extension of the

    python language, or at least of it’s core data structures: list and dict ZEN: OrderedDict kind of sucks ZEN: try using pandas as a sql replacement if your dataset can fit into memory ZEN: fundamentally, pandas likes to talk lists. if you can understand how pandas is extending python’s indexing methods to use lists, you are on your way to experiencing the zen of pandas ZEN: when we say pandas is built on numpy, consider that numpy primarily supports integer indexing...like a python list does. pandas supports much broader datatypes for indexing (strings, datetimes, tuples, etc) START:ipython notebook is awesome. ipython notebook is viral Dataframe, series: you can think of a DataFrame as a series of dicts, which all share the same index. however, in practice, I visualize a DataFrame as a table, and a Series as a list. PLOT: often the goal is to model and predict/explain, more often, for me, the goal is to visualize. I would even say that if you can’t visualize it, your chances of explaining it are pretty poor CLEAN: sometimes you want to fix broken rows, but more often than you might think you should just drop the nans and outliers. Just check how many there are first! CONFORM: reindexing is confusing because you have to understand this notion of *index*, which took me awhile to grok. not like a sql index! you could think of reindexing and resampling as examples of conforming ROTATE: pivot() doesn’t like NaNs, so often you want to dropna() SELECT: I’m not saying that .loc is always right or elegant, but if you are getting started it is always there and it always works. BIN: you could think of binning as a special case of grouping if you really wanted to JOIN: use df.merge don’t worry about df.join() until you understand df.merge() merge probably should have been called join. df.merge can be pretty confusing, compared to SQL syntax, but it provides equivalent functionality MAP: map, apply, applymap (I don’t use mapapply all that much because column types are different) Could map and apply have been called the same thing?