Slide 1

Slide 1 text

Data Munging with Pandas v0.01

Slide 2

Slide 2 text

The Zen of Pandas

Slide 3

Slide 3 text

Who am I? CTO at OpenMail in Venice Beach former Google engineer Python at YouTube MIT math degree Still intimidated by “Data Science”

Slide 4

Slide 4 text

Who is this talk for? knows Python pretty well wants to get into data analysis just getting started with Pandas

Slide 5

Slide 5 text

What is Pandas?

Slide 6

Slide 6 text

What is Pandas? Pandas is a Python library for data analysis and data manipulation A Python version of the R data.frame library

Slide 7

Slide 7 text

Why learn Pandas? ● If you like Python…it’s a better Python. ● It’s a smoother path than raw numpy ● Very easy to *share* data analysis

Slide 8

Slide 8 text

Extends and completes Python’s core datatypes! boolean indexing (same length) fancy indexing (list of integers) brings slices to dictionaries brings boolean and fancy indexing to dictionaries too

Slide 9

Slide 9 text

Getting Started on ubuntu: >>pip install ipython[all] >>pip install pandas >>ipython notebook also install ipython notebook viewer browser extension

Slide 10

Slide 10 text

Getting Started

Slide 11

Slide 11 text

Core Data Structures Series DataFrame

Slide 12

Slide 12 text

Core operations Create Select Insert Map Join Sort Clean Bin View Update Filter Append Group Summarize Conform Rotate

Slide 13

Slide 13 text

Create Lots of ways to initialize a DataFrame. protip: Easiest is just passing in a list of dictionaries

Slide 14

Slide 14 text

View Most common commands: head(), tail() protip: from IPython.display import display

Slide 15

Slide 15 text

Select too many options: [], .loc[], .iloc[], .ix[], .iat[], .at[], .xs() protip: .loc[] won’t let you down

Slide 16

Slide 16 text

Update

Slide 17

Slide 17 text

Insert maximally inefficient! always preallocate and update or use concat if at all possible but yeah, you can do it!

Slide 18

Slide 18 text

Filter ● df[df[‘a’] < 10] ● like the sql WHERE clause ● boolean indexes ● occasionally, df.where

Slide 19

Slide 19 text

Map map - series elements apply - columns applymap - dataframe elements

Slide 20

Slide 20 text

Append df1 + df2 is not what you want use pd.concat choice of axis is important use df.append() after you understand concat()

Slide 21

Slide 21 text

Join

Slide 22

Slide 22 text

Group groupby

Slide 23

Slide 23 text

Summarize agg() is your friend

Slide 24

Slide 24 text

Sort sort_index sort_index(by=’a’) sort_index(by=[‘a’, ‘b’])

Slide 25

Slide 25 text

Clean dropna, fillna drop duplicates clean outliers

Slide 26

Slide 26 text

Conform “make it look like this” (dropping or Nan’ing as needed) reindex() resample() (upsample, downsample, etc)

Slide 27

Slide 27 text

Bin df.cut

Slide 28

Slide 28 text

Rotate

Slide 29

Slide 29 text

Rotate ● df.T is the simplest case ● pivot tables are really useful ● prefer df.pivot() to df.pivot_table() ● prefer df.set_index(), df.unstack() to df.pivot() ● hierarchical indexes (aka MultiIndex) make pivoting pythonic ● columns are indexes as well! therefore, they can contain multiindexes ● reset_index() takes you back to the begining ● protip: think of a multiindex as an index of tuples. ● trailing entries can be sparsely empty!

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Thank you!

Slide 36

Slide 36 text

Acknowledgements http://www.analyticsvidhya.com/blog/2014/08/baby-steps-python-performing-exploratory-analysis-python/ - used his titanic cell image http://nbviewer.ipython.org/github/herrfz/dataanalysis/blob/master/week2/data_munging_basics.ipynb - used his cut example for binning http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ - for a merge example

Slide 37

Slide 37 text

Random Points ZEN: pandas is really an extension of the python language, or at least of it’s core data structures: list and dict ZEN: OrderedDict kind of sucks ZEN: try using pandas as a sql replacement if your dataset can fit into memory ZEN: fundamentally, pandas likes to talk lists. if you can understand how pandas is extending python’s indexing methods to use lists, you are on your way to experiencing the zen of pandas ZEN: when we say pandas is built on numpy, consider that numpy primarily supports integer indexing...like a python list does. pandas supports much broader datatypes for indexing (strings, datetimes, tuples, etc) START:ipython notebook is awesome. ipython notebook is viral Dataframe, series: you can think of a DataFrame as a series of dicts, which all share the same index. however, in practice, I visualize a DataFrame as a table, and a Series as a list. PLOT: often the goal is to model and predict/explain, more often, for me, the goal is to visualize. I would even say that if you can’t visualize it, your chances of explaining it are pretty poor CLEAN: sometimes you want to fix broken rows, but more often than you might think you should just drop the nans and outliers. Just check how many there are first! CONFORM: reindexing is confusing because you have to understand this notion of *index*, which took me awhile to grok. not like a sql index! you could think of reindexing and resampling as examples of conforming ROTATE: pivot() doesn’t like NaNs, so often you want to dropna() SELECT: I’m not saying that .loc is always right or elegant, but if you are getting started it is always there and it always works. BIN: you could think of binning as a special case of grouping if you really wanted to JOIN: use df.merge don’t worry about df.join() until you understand df.merge() merge probably should have been called join. df.merge can be pretty confusing, compared to SQL syntax, but it provides equivalent functionality MAP: map, apply, applymap (I don’t use mapapply all that much because column types are different) Could map and apply have been called the same thing?