Slide 1

Slide 1 text

Log Analysis With Pandas Taavi Burns, http://twitter.com/jaaaarel (http://twitter.com/jaaaarel), http://taaviburns.ca (http://taaviburns.ca), [email protected] What is pandas? “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” http://pandas.pydata.org/ (http://pandas.pydata.org/) In [1]: from IPython.core.display import Image Image(filename='PythonForDataAnalysisCover.png') Out[1]: 13-03-17 14:56

Slide 2

Slide 2 text

Let's dive right in! Installing things OSX $ brew install zeromq freetype If you also want HDF5 support: $ brew install hdf5 Debian/Ubuntu $ sudo apt-get install libzmq-dev libpng-dev libfreetype6-dev g++ If you also want HDF5 support: $ sudo apt-get install libhdf5-serial-dev Python packages $ pip install ipython pyzmq tornado numpy matplotlib pandas If you also want HDF5 support: $ pip install cython numexpr tables Run it! $ ipython notebook --pylab inline Hello, World! In [2]: Inline plotting Line plot print "Hello, world!" Hello, world! 13-03-17 14:56

Slide 3

Slide 3 text

In [3]: Histogram In [4]: data = range(-100, 100) title('Cubic plot') plot(data, [x**3 for x in data], label="$y = x^3$") legend(loc='best') Out[3]: import random datapoints = [random.gauss(0, 1) for _ in range(1000)] title('Gaussian distribution') h = hist(datapoints) 13-03-17 14:56

Slide 4

Slide 4 text

Pandas is built on top of NumPy, providing higher-level abstractions. A Series is like an array: a 1-D list of homogenously-typed items. In [1]: In [2]: Here's a Series() of floating point numbers: In [3]: In [4]: Of course if you mix things up, everything in Python is an object in the end. In [5]: In [6]: Broadcasting operations across a Series You can apply conditional expressions to a Series, and it will return another Series with the result of that expression applied to each value. NumPy calls this "broadcasting". import pandas as pd a = pd.Series([1, 2, 3]) a Out[1]: 0 1 1 2 2 3 dtype: int64 a.dtype Out[2]: dtype('int64') b = pd.Series([1, 2.3, 3]) b Out[3]: 0 1.0 1 2.3 2 3.0 dtype: float64 b.dtype Out[4]: dtype('float64') c = pd.Series(['a', None, 5]) c Out[5]: 0 a 1 None 2 5 dtype: object c.dtype Out[6]: dtype('O') 13-03-17 14:56

Slide 5

Slide 5 text

In [7]: In [8]: In [9]: It's also easy to broadcast your own callable by using Series.map(): In [10]: DataFrames A DataFrame is essentially a set of Series objects (as columns) with a shared index (the row labels). a Out[7]: 0 1 1 2 2 3 dtype: int64 a > 1 Out[8]: 0 False 1 True 2 True dtype: bool a == 1 Out[9]: 0 True 1 False 2 False dtype: bool a.map(lambda x: x % 2 == 0) Out[10]: 0 False 1 True 2 False dtype: bool 13-03-17 14:56

Slide 6

Slide 6 text

In [11]: In [12]: Selecting data Selecting by column by using a key lookup: In [13]: You can look up two columns by indexing using a list of columns: d = pd.DataFrame( [ [1, 2.3, 'three'], [4, 5, 6], [7, 8, 9], [10, 11, 12]], columns=['Integers', 'Floats', 'Objects'], index=[1, 2, 3, 4]) d Out[11]: Integers Floats Objects 1 1 2.3 three 2 4 5.0 6 3 7 8.0 9 4 10 11.0 12 d.dtypes Out[12]: Integers int64 Floats float64 Objects object dtype: object d['Floats'] Out[13]: 1 2.3 2 5.0 3 8.0 4 11.0 Name: Floats, dtype: float64 13-03-17 14:56

Slide 7

Slide 7 text

In [14]: You can select a range of rows using list slices. Note that this refers to the rows as if they were in a Python list()! In [15]: You can also avoid the magic and just use DataFrame.xs() to access the rows by their indexed name: In [16]: Or specifying column names: In [17]: Row indexing can also be done using a mask: d[['Integers', 'Objects']] Out[14]: Integers Objects 1 1 three 2 4 6 3 7 9 4 10 12 d[2:] Out[15]: Integers Floats Objects 3 7 8 9 4 10 11 12 d.xs(3, axis=0) Out[16]: Integers 7 Floats 8 Objects 9 Name: 3, dtype: object d.xs('Floats', axis=1) Out[17]: 1 2.3 2 5.0 3 8.0 4 11.0 Name: Floats, dtype: float64 13-03-17 14:56

Slide 8

Slide 8 text

In [18]: Combined with conditional expression broadcasting, and you can get some really interesting results: In [19]: In [20]: To get a subset of the rows based on the index value: In [21]: mask = [True, False, True, False] d[mask] Out[18]: Integers Floats Objects 1 1 2.3 three 3 7 8.0 9 where_Integers_is_gt_4 = d['Integers'] > 4 d[where_Integers_is_gt_4] Out[19]: Integers Floats Objects 3 7 8 9 4 10 11 12 d[np.invert(where_Integers_is_gt_4)] Out[20]: Integers Floats Objects 1 1 2.3 three 2 4 5.0 6 d[d.index > 2] Out[21]: Integers Floats Objects 3 7 8 9 4 10 11 12 13-03-17 14:56

Slide 9

Slide 9 text

In [22]: from IPython.core.display import Image Image(filename='masking.png') Out[22]: 13-03-17 14:56

Slide 10

Slide 10 text

So we have some logs How to get them into a readable format is beyond the scope of this presentation. Let's assume you have your logs files in CSV format already: In [1]: Then it's really easy to import the data into pandas: In [2]: In [3]: If you want to skip the minutes of waiting for your data to re-load (and feel ashamed about having to invoke gc.disable()), then install the tables (PyTables) module so you can use HDF5! In [4]: In [5]: In [6]: from cStringIO import StringIO log_file = StringIO( 'posix_timestamp,elapsed,sys,user,queries,query_time,rows,' 'accountid,userid,contactid,level,silo,method\n' '1343103150,0.062353,0,4,6,0.01690,3,' '12345,1,-1,3,invoice_InvoiceResource,search\n' ) import pandas as pd from datetime import datetime import gc def posix_string_to_datetime(posix_string): return datetime.utcfromtimestamp(int(posix_string)) # Disabling GC for CPython dramatically improves the speed of # bulk imports and won't actually use more memory unless we # create circular garbage. For example, loading 1.8M records # goes from taking 55s to taking 35s! gc.disable() df = pd.io.parsers.read_csv( log_file, # index_col is the first column, our posix_timestamp index_col=0, # Interpret the index column as a date parse_dates=0, date_parser=posix_string_to_datetime) gc.enable() df Out[3]: elapsed sys user queries query_time rows accountid userid contactid level silo method posix_timestamp 2012-07-24 04:12:30 0.062353 0 4 6 0.0169 3 12345 1 -1 3 invoice_InvoiceResource search store = pd.HDFStore('example.h5', complib='blosc') store['example_data'] = df store Out[5]: File path: example.h5 /example_data frame (shape->[1,12]) del store 13-03-17 14:57

Slide 11

Slide 11 text

In [7]: store = pd.HDFStore('example.h5') store['example_data'] Out[7]: elapsed sys user queries query_time rows accountid userid contactid level silo method posix_timestamp 2012-07-24 04:12:30 0.062353 0 4 6 0.0169 3 12345 1 -1 3 invoice_InvoiceResource search 13-03-17 14:57

Slide 12

Slide 12 text

Histogram example In [1]: In [2]: In [3]: import pandas as pd store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5') may07 = store['may07'] # bins is all the edges of the bins # e.g. bins = [1, 2, 3, 4] defines the ranges # [1,2), [2,3), and [3,4] bins = [2**x for x in range(-10, 10)] xscale('log') xlim((bins[0], bins[-1])) xlabel('Response time (s)') ylabel('# of requests that took that long') hist(may07['elapsed'], bins=bins) Out[3]: (array([ 0, 24, 187141, 347874, 488751, 529416, 160609, 96496, 34292, 14574, 9848, 20565, 969, 237, 77, 56, 8, 0, 0]), array([ 9.76562500e-04, 1.95312500e-03, 3.90625000e-03, 7.81250000e-03, 1.56250000e-02, 3.12500000e-02, 6.25000000e-02, 1.25000000e-01, 2.50000000e-01, 5.00000000e-01, 1.00000000e+00, 2.00000000e+00, 4.00000000e+00, 8.00000000e+00, 1.60000000e+01, 3.20000000e+01, 6.40000000e+01, 1.28000000e+02, 2.56000000e+02, 5.12000000e+02]), ) 13-03-17 14:57

Slide 13

Slide 13 text

13-03-17 14:57

Slide 14

Slide 14 text

In [1]: Is there a correlation between response time and response size? We can see how queries that return more data take longer to run using the pylab plot() method: In [2]: But that's a little generic, as it's looking at all requests. We can do better! And we can compare one day to another! Let's start by selecting only a subset of the data. We do deployments around 9am, so we'll just look at requests that come in after release: In [3]: import pandas as pd store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5') may07 = store['may07'] may08 = store['may08'] xlabel('Elapsed (s)') ylabel('Number of rows returned') plot(may07['elapsed'], may07['rows'], 'r.', label='May 7') Out[2]: [] from datetime import datetime import pytz toronto = pytz.timezone('America/Toronto') may07_0900 = toronto.localize( datetime(2012, 5, 7, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) may08_0900 = toronto.localize( datetime(2012, 5, 8, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) 13-03-17 14:57

Slide 15

Slide 15 text

I deployed some interesting changes on the 8th, with the goal of improving performance for invoice-related requests for larger amounts of data. (With thanks to Mike Pirnat for his presentation on metaclassing two years ago!) Let's start by getting the comparison data from the previous day: In [4]: Now we'll get the same range of data, but from after my code went live: In [5]: # Select only the requests that came in after 9am local time may07_after_0900 = may07[may07.index >= may07_0900] # Select only the requests for invoice things may07_invoice_after_0900 = may07_after_0900[ may07_after_0900['silo'] == 'invoice_InvoiceResource'] # Note that Pandas stores the DatetimeIndex values # as naïve datetimes in UTC! may07_invoice_after_0900 Out[4]: DatetimeIndex: 284115 entries, 2012-05-07 13:00:00 to 2012-05-08 03:59:59 Data columns: request_id 284115 non-null values port 284115 non-null values accountid 284115 non-null values userid 284115 non-null values contactid 284115 non-null values level 284115 non-null values silo 284115 non-null values method 284115 non-null values rows 284115 non-null values queries 284115 non-null values query_time 284115 non-null values elapsed 284115 non-null values user 284115 non-null values sys 284115 non-null values dtypes: float64(2), int64(10), object(2) may08_after_0900 = may08[may08.index >= may08_0900] may08_invoice_after_0900 = may08_after_0900[ may08_after_0900['silo'] == 'invoice_InvoiceResource'] may08_invoice_after_0900 Out[5]: DatetimeIndex: 263552 entries, 2012-05-08 13:00:00 to 2012-05-09 03:59:59 Data columns: request_id 263552 non-null values port 263552 non-null values accountid 263552 non-null values userid 263552 non-null values contactid 263552 non-null values level 263552 non-null values silo 263552 non-null values method 263552 non-null values rows 263552 non-null values queries 263552 non-null values query_time 263552 non-null values elapsed 263552 non-null values user 263552 non-null values sys 263552 non-null values dtypes: float64(2), int64(10), object(2) 13-03-17 14:57

Slide 16

Slide 16 text

In [6]: For comparison, I didn't change much about Estimates: title( 'Response size vs request time for invoices, ' 'comparing two days') xlabel('Elapsed (s)') xlim(0, 30) # There are some uninteresting outliners ylabel('Number of rows returned') plot( may07_invoice_after_0900['elapsed'], may07_invoice_after_0900['rows'], 'r.', label='May 7') plot( may08_invoice_after_0900['elapsed'], may08_invoice_after_0900['rows'], 'b.', label='May 8') legend() Out[6]: 13-03-17 14:57

Slide 17

Slide 17 text

In [7]: But I did change estimates week later! Let's see how they fare: may07_estimate_after_0900 = may07_after_0900[ may07_after_0900['silo'] == 'estimate_EstimateResource'] may08_estimate_after_0900 = may08_after_0900[ may08_after_0900['silo'] == 'estimate_EstimateResource'] title( 'Response size vs request time for estimates, ' 'comparing two days') xlabel('Elapsed (s)') xlim(0, 30) ylabel('Number of rows returned') plot( may07_estimate_after_0900['elapsed'], may07_estimate_after_0900['rows'], 'r.', label='May 7') plot( may08_estimate_after_0900['elapsed'], may08_estimate_after_0900['rows'], 'b.', label='May 8') legend() Out[7]: 13-03-17 14:57

Slide 18

Slide 18 text

In [8]: Bingo! may14 = store['may14'] may14_0900 = toronto.localize( datetime(2012, 5, 14, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) may14_after_0900 = may14[may14.index >= may14_0900] may14_estimate_after_0900 = may14_after_0900[ may14_after_0900['silo'] == 'estimate_EstimateResource'] may15 = store['may15'] may15_0900 = toronto.localize( datetime(2012, 5, 15, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) may15_after_0900 = may15[may15.index >= may15_0900] may15_estimate_after_0900 = may15_after_0900[ may15_after_0900['silo'] == 'estimate_EstimateResource'] title( 'Response size vs request time for estimates,' 'comparing two days') xlabel('Elapsed (s)') xlim(0, 30) ylabel('Number of rows returned') plot( may14_estimate_after_0900['elapsed'], may14_estimate_after_0900['rows'], 'r.', label='May 14') plot( may15_estimate_after_0900['elapsed'], may15_estimate_after_0900['rows'], 'b.', label='May 15') legend() Out[8]: 13-03-17 14:57

Slide 19

Slide 19 text

In [1]: If we want to get a graph over time like those you get from munin and graphite, we'll need to resample() our DataFrame: In [2]: Then we can use the handy DataFrame.plot() method to get it displayed easily: In [3]: We can use other callables to indicate how we'd like the downsampling to occur, such as a percentile function: import pandas as pd store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5') may07 = store['may07'] from datetime import timedelta import numpy as np first_minutes = may07[ may07.index < (may07.index[0] + timedelta(minutes=10))] first_minutes['elapsed'].resample('T', how=np.median) Out[2]: posix_timestamp 2012-05-07 04:00:00 0.033588 2012-05-07 04:01:00 0.032231 2012-05-07 04:02:00 0.031700 2012-05-07 04:03:00 0.031906 2012-05-07 04:04:00 0.033401 2012-05-07 04:05:00 0.032794 2012-05-07 04:06:00 0.031914 2012-05-07 04:07:00 0.031836 2012-05-07 04:08:00 0.034356 2012-05-07 04:09:00 0.030676 Freq: T, dtype: float64 ylabel('Median response time by minute (s)') may07['elapsed'].resample('T', how='median').plot() Out[3]: 13-03-17 14:57

Slide 20

Slide 20 text

In [4]: elapsed = may07['elapsed'] ylabel('Seconds') elapsed.resample( 'T', how=lambda x: np.percentile(x, 25) ).plot(label="25%ile", color='red') elapsed.resample( 'T', how=lambda x: np.percentile(x, 50) ).plot(label="50%ile", color='orange') elapsed.resample( 'T', how=lambda x: np.percentile(x, 75) ).plot(label="75%ile", color='green') legend() Out[4]: 13-03-17 14:57

Slide 21

Slide 21 text

Log analysis with pandas As presented at PyCon US 2013 http://taaviburns.ca/presentations/log_analysis_with_pandas2013/ IPython Notebook Don't forget to have sufficient requirements installed first. Particularly (as of this writing) Pandas 0.11-dev from github. Introduction (nbviewer) 1. Pandas Crash Course (nbviewer) 2. ZOMG LOGS (nbviewer) 3. Histogram Example (nbviewer) 4. Scatterplots (nbviewer) 5. Resampling (nbviewer) 6. PDF Introduction 1. Pandas Crash Course 2. ZOMG LOGS 3. Histogram Example 4. Scatterplots 5. Resampling 6. 13-03-17 18:34