John Fries - Pandas - PyDSLA meetup - Nov 2014

Data Munging with Pandas v0.01

The Zen of Pandas

Who am I? CTO at OpenMail in Venice Beach former
Google engineer Python at YouTube MIT math degree Still intimidated by “Data Science”

Who is this talk for? knows Python pretty well wants
to get into data analysis just getting started with Pandas

What is Pandas?

What is Pandas? Pandas is a Python library for data
analysis and data manipulation A Python version of the R data.frame library

Why learn Pandas? • If you like Python…it’s a better
Python. • It’s a smoother path than raw numpy • Very easy to *share* data analysis

Extends and completes Python’s core datatypes! boolean indexing (same length)
fancy indexing (list of integers) brings slices to dictionaries brings boolean and fancy indexing to dictionaries too

Getting Started on ubuntu: >>pip install ipython[all] >>pip install pandas
>>ipython notebook also install ipython notebook viewer browser extension

Getting Started

Core Data Structures Series DataFrame

Core operations Create Select Insert Map Join Sort Clean Bin
View Update Filter Append Group Summarize Conform Rotate

Create Lots of ways to initialize a DataFrame. protip: Easiest
is just passing in a list of dictionaries

View Most common commands: head(), tail() protip: from IPython.display import
display

Select too many options: [], .loc[], .iloc[], .ix[], .iat[], .at[],
.xs() protip: .loc[] won’t let you down

Update

Insert maximally inefficient! always preallocate and update or use concat
if at all possible but yeah, you can do it!

Filter • df[df[‘a’] < 10] • like the sql WHERE
clause • boolean indexes • occasionally, df.where

Map map - series elements apply - columns applymap -
dataframe elements

Append df1 + df2 is not what you want use
pd.concat choice of axis is important use df.append() after you understand concat()

Group groupby

Summarize agg() is your friend

Sort sort_index sort_index(by=’a’) sort_index(by=[‘a’, ‘b’])

Clean dropna, fillna drop duplicates clean outliers

Conform “make it look like this” (dropping or Nan’ing as
needed) reindex() resample() (upsample, downsample, etc)

Bin df.cut

Rotate

Rotate • df.T is the simplest case • pivot tables
are really useful • prefer df.pivot() to df.pivot_table() • prefer df.set_index(), df.unstack() to df.pivot() • hierarchical indexes (aka MultiIndex) make pivoting pythonic • columns are indexes as well! therefore, they can contain multiindexes • reset_index() takes you back to the begining • protip: think of a multiindex as an index of tuples. • trailing entries can be sparsely empty!

Thank you!

Acknowledgements http://www.analyticsvidhya.com/blog/2014/08/baby-steps-python-performing-exploratory-analysis-python/ - used his titanic cell image http://nbviewer.ipython.org/github/herrfz/dataanalysis/blob/master/week2/data_munging_basics.ipynb -
used his cut example for binning http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ - for a merge example

Random Points ZEN: pandas is really an extension of the
python language, or at least of it’s core data structures: list and dict ZEN: OrderedDict kind of sucks ZEN: try using pandas as a sql replacement if your dataset can fit into memory ZEN: fundamentally, pandas likes to talk lists. if you can understand how pandas is extending python’s indexing methods to use lists, you are on your way to experiencing the zen of pandas ZEN: when we say pandas is built on numpy, consider that numpy primarily supports integer indexing...like a python list does. pandas supports much broader datatypes for indexing (strings, datetimes, tuples, etc) START:ipython notebook is awesome. ipython notebook is viral Dataframe, series: you can think of a DataFrame as a series of dicts, which all share the same index. however, in practice, I visualize a DataFrame as a table, and a Series as a list. PLOT: often the goal is to model and predict/explain, more often, for me, the goal is to visualize. I would even say that if you can’t visualize it, your chances of explaining it are pretty poor CLEAN: sometimes you want to fix broken rows, but more often than you might think you should just drop the nans and outliers. Just check how many there are first! CONFORM: reindexing is confusing because you have to understand this notion of *index*, which took me awhile to grok. not like a sql index! you could think of reindexing and resampling as examples of conforming ROTATE: pivot() doesn’t like NaNs, so often you want to dropna() SELECT: I’m not saying that .loc is always right or elegant, but if you are getting started it is always there and it always works. BIN: you could think of binning as a special case of grouping if you really wanted to JOIN: use df.merge don’t worry about df.join() until you understand df.merge() merge probably should have been called join. df.merge can be pretty confusing, compared to SQL syntax, but it provides equivalent functionality MAP: map, apply, applymap (I don’t use mapapply all that much because column types are different) Could map and apply have been called the same thing?

John Fries - Pandas - PyDSLA meetup - Nov 2014

John Fries - Pandas - PyDSLA meetup - Nov 2014

Data Science LA

More Decks by Data Science LA

Featured

Transcript

Data Munging with Pandas v0.01

The Zen of Pandas

Who am I? CTO at OpenMail in Venice Beach former

Who is this talk for? knows Python pretty well wants

What is Pandas?

What is Pandas? Pandas is a Python library for data

Why learn Pandas? • If you like Python…it’s a better

Extends and completes Python’s core datatypes! boolean indexing (same length)

Getting Started on ubuntu: >>pip install ipython[all] >>pip install pandas

Getting Started

Core Data Structures Series DataFrame

Core operations Create Select Insert Map Join Sort Clean Bin

Create Lots of ways to initialize a DataFrame. protip: Easiest

View Most common commands: head(), tail() protip: from IPython.display import

Select too many options: [], .loc[], .iloc[], .ix[], .iat[], .at[],

Update

Insert maximally inefficient! always preallocate and update or use concat

Filter • df[df[‘a’] < 10] • like the sql WHERE

Map map - series elements apply - columns applymap -

Append df1 + df2 is not what you want use

Join

Group groupby

Summarize agg() is your friend

Sort sort_index sort_index(by=’a’) sort_index(by=[‘a’, ‘b’])

Clean dropna, fillna drop duplicates clean outliers

Conform “make it look like this” (dropping or Nan’ing as

Bin df.cut

Rotate

Rotate • df.T is the simplest case • pivot tables

Thank you!

Acknowledgements http://www.analyticsvidhya.com/blog/2014/08/baby-steps-python-performing-exploratory-analysis-python/ - used his titanic cell image http://nbviewer.ipython.org/github/herrfz/dataanalysis/blob/master/week2/data_munging_basics.ipynb -

Random Points ZEN: pandas is really an extension of the