Upgrade to Pro — share decks privately, control downloads, hide ads and more …

John Fries - Pandas - PyDSLA meetup - Nov 2014

Data Science LA
November 05, 2014
3.4k

John Fries - Pandas - PyDSLA meetup - Nov 2014

Data Science LA

November 05, 2014
Tweet

More Decks by Data Science LA

Transcript

  1. Who am I? CTO at OpenMail in Venice Beach former

    Google engineer Python at YouTube MIT math degree Still intimidated by “Data Science”
  2. Who is this talk for? knows Python pretty well wants

    to get into data analysis just getting started with Pandas
  3. What is Pandas? Pandas is a Python library for data

    analysis and data manipulation A Python version of the R data.frame library
  4. Why learn Pandas? • If you like Python…it’s a better

    Python. • It’s a smoother path than raw numpy • Very easy to *share* data analysis
  5. Extends and completes Python’s core datatypes! boolean indexing (same length)

    fancy indexing (list of integers) brings slices to dictionaries brings boolean and fancy indexing to dictionaries too
  6. Getting Started on ubuntu: >>pip install ipython[all] >>pip install pandas

    >>ipython notebook also install ipython notebook viewer browser extension
  7. Core operations Create Select Insert Map Join Sort Clean Bin

    View Update Filter Append Group Summarize Conform Rotate
  8. Create Lots of ways to initialize a DataFrame. protip: Easiest

    is just passing in a list of dictionaries
  9. Filter • df[df[‘a’] < 10] • like the sql WHERE

    clause • boolean indexes • occasionally, df.where
  10. Append df1 + df2 is not what you want use

    pd.concat choice of axis is important use df.append() after you understand concat()
  11. Conform “make it look like this” (dropping or Nan’ing as

    needed) reindex() resample() (upsample, downsample, etc)
  12. Rotate • df.T is the simplest case • pivot tables

    are really useful • prefer df.pivot() to df.pivot_table() • prefer df.set_index(), df.unstack() to df.pivot() • hierarchical indexes (aka MultiIndex) make pivoting pythonic • columns are indexes as well! therefore, they can contain multiindexes • reset_index() takes you back to the begining • protip: think of a multiindex as an index of tuples. • trailing entries can be sparsely empty!
  13. Random Points ZEN: pandas is really an extension of the

    python language, or at least of it’s core data structures: list and dict ZEN: OrderedDict kind of sucks ZEN: try using pandas as a sql replacement if your dataset can fit into memory ZEN: fundamentally, pandas likes to talk lists. if you can understand how pandas is extending python’s indexing methods to use lists, you are on your way to experiencing the zen of pandas ZEN: when we say pandas is built on numpy, consider that numpy primarily supports integer indexing...like a python list does. pandas supports much broader datatypes for indexing (strings, datetimes, tuples, etc) START:ipython notebook is awesome. ipython notebook is viral Dataframe, series: you can think of a DataFrame as a series of dicts, which all share the same index. however, in practice, I visualize a DataFrame as a table, and a Series as a list. PLOT: often the goal is to model and predict/explain, more often, for me, the goal is to visualize. I would even say that if you can’t visualize it, your chances of explaining it are pretty poor CLEAN: sometimes you want to fix broken rows, but more often than you might think you should just drop the nans and outliers. Just check how many there are first! CONFORM: reindexing is confusing because you have to understand this notion of *index*, which took me awhile to grok. not like a sql index! you could think of reindexing and resampling as examples of conforming ROTATE: pivot() doesn’t like NaNs, so often you want to dropna() SELECT: I’m not saying that .loc is always right or elegant, but if you are getting started it is always there and it always works. BIN: you could think of binning as a special case of grouping if you really wanted to JOIN: use df.merge don’t worry about df.join() until you understand df.merge() merge probably should have been called join. df.merge can be pretty confusing, compared to SQL syntax, but it provides equivalent functionality MAP: map, apply, applymap (I don’t use mapapply all that much because column types are different) Could map and apply have been called the same thing?