are really useful • prefer df.pivot() to df.pivot_table() • prefer df.set_index(), df.unstack() to df.pivot() • hierarchical indexes (aka MultiIndex) make pivoting pythonic • columns are indexes as well! therefore, they can contain multiindexes • reset_index() takes you back to the begining • protip: think of a multiindex as an index of tuples. • trailing entries can be sparsely empty!
python language, or at least of it’s core data structures: list and dict ZEN: OrderedDict kind of sucks ZEN: try using pandas as a sql replacement if your dataset can fit into memory ZEN: fundamentally, pandas likes to talk lists. if you can understand how pandas is extending python’s indexing methods to use lists, you are on your way to experiencing the zen of pandas ZEN: when we say pandas is built on numpy, consider that numpy primarily supports integer indexing...like a python list does. pandas supports much broader datatypes for indexing (strings, datetimes, tuples, etc) START:ipython notebook is awesome. ipython notebook is viral Dataframe, series: you can think of a DataFrame as a series of dicts, which all share the same index. however, in practice, I visualize a DataFrame as a table, and a Series as a list. PLOT: often the goal is to model and predict/explain, more often, for me, the goal is to visualize. I would even say that if you can’t visualize it, your chances of explaining it are pretty poor CLEAN: sometimes you want to fix broken rows, but more often than you might think you should just drop the nans and outliers. Just check how many there are first! CONFORM: reindexing is confusing because you have to understand this notion of *index*, which took me awhile to grok. not like a sql index! you could think of reindexing and resampling as examples of conforming ROTATE: pivot() doesn’t like NaNs, so often you want to dropna() SELECT: I’m not saying that .loc is always right or elegant, but if you are getting started it is always there and it always works. BIN: you could think of binning as a special case of grouping if you really wanted to JOIN: use df.merge don’t worry about df.join() until you understand df.merge() merge probably should have been called join. df.merge can be pretty confusing, compared to SQL syntax, but it provides equivalent functionality MAP: map, apply, applymap (I don’t use mapapply all that much because column types are different) Could map and apply have been called the same thing?