Diving into Open Data with IPython Notebook & Pandas (Julia Evans)

Slide 1

Slide 1 text

Diving into Open Data with IPython Notebook & Diving into Open Data with IPython Notebook & Pandas Pandas I'm Julia Evans I'm Julia Evans Data scientist, programmer, co-organize , You can follow along with this talk at: (http://twitter.com/b0rk) (http://jvns.ca) (http://github.com/jvns) (http://mtlallgirlhacknight.ca) (http://meetup.com/pyladiesmtl) (http://bit.ly/pyconca-pandas) http://bit.ly/pyconca-pandas PyLadies MTL Montréal All-Girl Hack Night http://github.com/jvns http://jvns.ca http://twitter.com/b0rk

Slide 2

Slide 2 text

IPython Notebook IPython Notebook web-based user interface to IPython pretty graphs literate programming Can make slideshows :) (this presentation) version controlled science! Pandas Pandas "R for Python" Provides easy to use data structures & a ton of useful helper functions for data cleanup and transformations Fast! (backed by numpy arrays) integrates well with scikit-learn

Slide 3

Slide 3 text

An installation warning An installation warning Don't: Use the Ubuntu packages Don't: Use the Ubuntu packages sudo apt-get install ipython-notebook sudo apt-get install python-pandas Do: Use pip or Do: Use pip or pip install ipython tornado pyzmq pip install pandas Anaconda is amazing. Anaconda is amazing. (https://store.continuum.io/) Anaconda

Slide 4

Slide 4 text

How to run IPython Notebook How to run IPython Notebook $ ipython notebook --pylab inline

Slide 5

Slide 5 text

The open data The open data Taken from (click "Vélos - comptage") Number of people per day on 7 bike paths (collected using sensors) (http://donnees.ville.montreal.qc.ca/) http://donnees.ville.montreal.qc.ca/fiche/velos-comptage/

Slide 6

Slide 6 text

Part 1: Import the 2012 bike path data from a CSV Part 1: Import the 2012 bike path data from a CSV Before Before Download and unzip the zip file from to run this yourself. In [1]: In [3]: (http://donnees.ville.montreal.qc.ca/fiche/velos-comptage/) import pandas as pd bike_data = pd.read_csv("./2012.csv") bike_data[:5] Out[3]: Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte- Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles) 0 01/01/2012;35;;0;38;51;26;10;16; 1 02/01/2012;83;;1;68;153;53;6;43; 2 03/01/2012;135;;2;104;248;89;3;58; 3 04/01/2012;144;;1;116;318;111;8;61; 4 05/01/2012;197;;2;124;330;97;13;95; this page

Slide 7

Slide 7 text

After After In [4]: In [6]: Exercise: Parse the CSVs from 2011 and earlier (warning: it's annoying) bike_data = pd.read_csv("./2012.csv", encoding='latin1', sep=';', index_col='Date', pars e_dates=True, dayfirst=True) bike_data = bike_data[['Berri 1', u'Côte-Sainte-Catherine', 'Maisonneuve 1']] bike_data[:5] Out[6]: Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Date 2012-01-01 35 0 38 2012-01-02 83 1 68 2012-01-03 135 2 104 2012-01-04 144 1 116 2012-01-05 197 2 124

Slide 8

Slide 8 text

Part 2: take a look at the data Part 2: take a look at the data We have a dataframe: In [7]: bike_data[:3] Out[7]: Berri 1 Côte-Sainte-Catherine Maisonneuve 1 Date 2012-01-01 35 0 38 2012-01-02 83 1 68 2012-01-03 135 2 104

Slide 9

Slide 9 text

In [8]: bike_data.plot() Out[8]: /opt/anaconda/envs/ipython-1.0.0a1/lib/python2.7/site-packages/matplotlib/font_manager.py: 1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream V era Sans (prop.get_family(), self.defaultFamily[fontext]))

Slide 10

Slide 10 text

In [9]: bike_data.median() Out[9]: Berri 1 3128.0 Côte-Sainte-Catherine 1269.0 Maisonneuve 1 2019.5 dtype: float64

Slide 11

Slide 11 text

In [10]: bike_data.median().plot(kind='bar') Out[10]:

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Slicing dataframes Slicing dataframes In [11]: # column slice column_slice = bike_data[['Berri 1', 'Maisonneuve 1']] # row slice column_slice[:3] Out[11]: Berri 1 Maisonneuve 1 Date 2012-01-01 35 38 2012-01-02 83 68 2012-01-03 135 104

Slide 14

Slide 14 text

In [12]: column_slice.plot() Out[12]:

Slide 15

Slide 15 text

Part 2: Do more people bike on weekdays or Part 2: Do more people bike on weekdays or weekends? weekends?

Slide 16

Slide 16 text

Step 1: add a 'weekday' column to our dataframe Step 1: add a 'weekday' column to our dataframe In [13]: bike_data['weekday'] = bike_data.index.weekday bike_data.head() Out[13]: Berri 1 Côte-Sainte-Catherine Maisonneuve 1 weekday Date 2012-01-01 35 0 38 6 2012-01-02 83 1 68 0 2012-01-03 135 2 104 1 2012-01-04 144 1 116 2 2012-01-05 197 2 124 3

Slide 17

Slide 17 text

Step 2: Use .groupby() and .aggregate() to get the counts Step 2: Use .groupby() and .aggregate() to get the counts In [14]: counts_by_day = bike_data.groupby('weekday').aggregate(numpy.sum) counts_by_day Out[14]: Berri 1 Côte-Sainte-Catherine Maisonneuve 1 weekday 0 134298 60329 90051 1 135305 58708 92035 2 152972 67344 104891 3 160131 69028 111895 4 141771 56446 98568 5 101578 34018 62067 6 99310 36466 55324

Slide 18

Slide 18 text

Step 3: draw a graph! Step 3: draw a graph! In [15]: counts_by_day.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunda y'] counts_by_day.plot() Out[15]:

Slide 19

Slide 19 text

There's more going on, though There's more going on, though In [16]: bike_data['Berri 1'].plot() Out[16]:

Slide 20

Slide 20 text

Part 3: Grab some weather data and look at the Part 3: Grab some weather data and look at the temperatures temperatures In [17]: In [18]: def get_weather_data(year): url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv& stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data" # mctavish station: 10761, airport station: 5415 data_by_month = [] for month in range(1, 13): url = url_template.format(year=year, month=month) weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates= True).dropna(axis=1) weather_data.columns = map(lambda x: x.replace('\xb0', ''), weather_data.columns ) weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality' ], axis=1) data_by_month.append(weather_data.dropna()) # Concatenate and drop any empty columns return pd.concat(data_by_month).dropna(axis=1, how='all').dropna() weather_data = get_weather_data(2012)

Slide 21

Slide 21 text

In [19]: weather_data[:5] Out[19]: Dew Point Temp (C) Rel Hum (%) Stn Press (kPa) Temp (C) Visibility (km) Weather Wind Spd (km/h) Date/Time 2012-01-01 00:00:00 -3.9 86 101.24 -1.8 8.0 Fog 4 2012-01-01 01:00:00 -3.7 87 101.24 -1.8 8.0 Fog 4 2012-01-01 02:00:00 -3.4 89 101.26 -1.8 4.0 Freezing Drizzle,Fog 7 2012-01-01 03:00:00 -3.2 88 101.27 -1.5 4.0 Freezing Drizzle,Fog 6 2012-01-01 04:00:00 -3.3 88 101.23 -1.5 4.8 Fog 7

Slide 22

Slide 22 text

We need the temperatures every day, not every hour... We need the temperatures every day, not every hour... In [20]: In [21]: bike_data['mean temp'] = weather_data['Temp (C)'].resample('D', how='mean') bike_data.head() Out[21]: Berri 1 Côte-Sainte-Catherine Maisonneuve 1 weekday mean temp Date 2012-01-01 35 0 38 6 0.629167 2012-01-02 83 1 68 0 0.041667 2012-01-03 135 2 104 1 -14.416667 2012-01-04 144 1 116 2 -13.645833 2012-01-05 197 2 124 3 -6.750000

Slide 23

Slide 23 text

Bikers per day and temperature Bikers per day and temperature In [22]: bike_data[['Berri 1', 'mean temp']].plot(subplots=True) Out[22]: array([, ], dtype=object)

Slide 24

Slide 24 text

Do people bike when it's raining? Do people bike when it's raining? In [23]: bike_data['Rain'] = weather_data['Weather'].str.contains('Rain').map(lambda x: int(x)).r esample('D', how='mean')

Slide 25

Slide 25 text

In [24]: bike_data[['Berri 1', 'Rain']].plot(subplots=True) Out[24]: array([, ], dtype=object)

Slide 26

Slide 26 text

Let's look at unpopular days in the summer Let's look at unpopular days in the summer In [25]: In [26]: # Look at everything between May and September summertime_data = bike_data['2012-05-01':'2012-09-01'] summertime_data['Berri 1'][:5] < 2500 Out[26]: Date 2012-05-01 True 2012-05-02 False 2012-05-03 False 2012-05-04 False 2012-05-05 False Name: Berri 1, dtype: bool

Slide 27

Slide 27 text

In [27]: summertime_data = bike_data['2012-05-01':'2012-09-01'] bad_days = summertime_data[summertime_data['Berri 1'] < 2500] bad_days[['Berri 1', 'Rain', 'mean temp', 'weekday']] Out[27]: Berri 1 Rain mean temp weekday Date 2012-05-01 1986 0.416667 9.437500 1 2012-05-08 1241 0.666667 12.645833 1 2012-05-22 2315 0.583333 18.279167 1 2012-06-02 943 0.583333 13.566667 5 2012-06-25 2245 0.208333 17.270833 0 2012-08-05 1864 0.166667 25.783333 6 2012-08-10 2414 0.458333 19.841667 4 2012-08-11 2453 0.125000 20.891667 5

Slide 28

Slide 28 text

Some advice Some advice Read (some of) the documentation has a 460-page PDF with lots of examples Python for Data Analysis by Wes McKinney is great Always use vectorized operations, try not to write your own loops (though: see Numba) (http://pandas.pydata.org/) http://pandas.pydata.org/

Slide 29

Slide 29 text

Thanks! Questions? Thanks! Questions? In [29]: print 'Email:', julia['email'] print 'Twitter:', julia['twitter'] print 'Slides: http://bit.ly/pyconca-pandas' Email: [email protected] Twitter: http://twitter.com/b0rk Slides: http://bit.ly/pyconca-pandas