Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Server Log Analysis with Pandas by Taavi Burns

Server Log Analysis with Pandas by Taavi Burns

Use iPython, matplotlib, and Pandas to slice, dice, and visualize your application's behaviour through its logs.

PyCon 2013

March 17, 2013
Tweet

More Decks by PyCon 2013

Other Decks in Programming

Transcript

  1. Log Analysis With Pandas Taavi Burns, http://twitter.com/jaaaarel (http://twitter.com/jaaaarel), http://taaviburns.ca (http://taaviburns.ca),

    [email protected] What is pandas? “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” http://pandas.pydata.org/ (http://pandas.pydata.org/) In [1]: from IPython.core.display import Image Image(filename='PythonForDataAnalysisCover.png') Out[1]: 13-03-17 14:56
  2. Let's dive right in! Installing things OSX $ brew install

    zeromq freetype If you also want HDF5 support: $ brew install hdf5 Debian/Ubuntu $ sudo apt-get install libzmq-dev libpng-dev libfreetype6-dev g++ If you also want HDF5 support: $ sudo apt-get install libhdf5-serial-dev Python packages $ pip install ipython pyzmq tornado numpy matplotlib pandas If you also want HDF5 support: $ pip install cython numexpr tables Run it! $ ipython notebook --pylab inline Hello, World! In [2]: Inline plotting Line plot print "Hello, world!" Hello, world! 13-03-17 14:56
  3. In [3]: Histogram In [4]: data = range(-100, 100) title('Cubic

    plot') plot(data, [x**3 for x in data], label="$y = x^3$") legend(loc='best') Out[3]: <matplotlib.legend.Legend at 0x104a2c2d0> import random datapoints = [random.gauss(0, 1) for _ in range(1000)] title('Gaussian distribution') h = hist(datapoints) 13-03-17 14:56
  4. Pandas is built on top of NumPy, providing higher-level abstractions.

    A Series is like an array: a 1-D list of homogenously-typed items. In [1]: In [2]: Here's a Series() of floating point numbers: In [3]: In [4]: Of course if you mix things up, everything in Python is an object in the end. In [5]: In [6]: Broadcasting operations across a Series You can apply conditional expressions to a Series, and it will return another Series with the result of that expression applied to each value. NumPy calls this "broadcasting". import pandas as pd a = pd.Series([1, 2, 3]) a Out[1]: 0 1 1 2 2 3 dtype: int64 a.dtype Out[2]: dtype('int64') b = pd.Series([1, 2.3, 3]) b Out[3]: 0 1.0 1 2.3 2 3.0 dtype: float64 b.dtype Out[4]: dtype('float64') c = pd.Series(['a', None, 5]) c Out[5]: 0 a 1 None 2 5 dtype: object c.dtype Out[6]: dtype('O') 13-03-17 14:56
  5. In [7]: In [8]: In [9]: It's also easy to

    broadcast your own callable by using Series.map(): In [10]: DataFrames A DataFrame is essentially a set of Series objects (as columns) with a shared index (the row labels). a Out[7]: 0 1 1 2 2 3 dtype: int64 a > 1 Out[8]: 0 False 1 True 2 True dtype: bool a == 1 Out[9]: 0 True 1 False 2 False dtype: bool a.map(lambda x: x % 2 == 0) Out[10]: 0 False 1 True 2 False dtype: bool 13-03-17 14:56
  6. In [11]: In [12]: Selecting data Selecting by column by

    using a key lookup: In [13]: You can look up two columns by indexing using a list of columns: d = pd.DataFrame( [ [1, 2.3, 'three'], [4, 5, 6], [7, 8, 9], [10, 11, 12]], columns=['Integers', 'Floats', 'Objects'], index=[1, 2, 3, 4]) d Out[11]: Integers Floats Objects 1 1 2.3 three 2 4 5.0 6 3 7 8.0 9 4 10 11.0 12 d.dtypes Out[12]: Integers int64 Floats float64 Objects object dtype: object d['Floats'] Out[13]: 1 2.3 2 5.0 3 8.0 4 11.0 Name: Floats, dtype: float64 13-03-17 14:56
  7. In [14]: You can select a range of rows using

    list slices. Note that this refers to the rows as if they were in a Python list()! In [15]: You can also avoid the magic and just use DataFrame.xs() to access the rows by their indexed name: In [16]: Or specifying column names: In [17]: Row indexing can also be done using a mask: d[['Integers', 'Objects']] Out[14]: Integers Objects 1 1 three 2 4 6 3 7 9 4 10 12 d[2:] Out[15]: Integers Floats Objects 3 7 8 9 4 10 11 12 d.xs(3, axis=0) Out[16]: Integers 7 Floats 8 Objects 9 Name: 3, dtype: object d.xs('Floats', axis=1) Out[17]: 1 2.3 2 5.0 3 8.0 4 11.0 Name: Floats, dtype: float64 13-03-17 14:56
  8. In [18]: Combined with conditional expression broadcasting, and you can

    get some really interesting results: In [19]: In [20]: To get a subset of the rows based on the index value: In [21]: mask = [True, False, True, False] d[mask] Out[18]: Integers Floats Objects 1 1 2.3 three 3 7 8.0 9 where_Integers_is_gt_4 = d['Integers'] > 4 d[where_Integers_is_gt_4] Out[19]: Integers Floats Objects 3 7 8 9 4 10 11 12 d[np.invert(where_Integers_is_gt_4)] Out[20]: Integers Floats Objects 1 1 2.3 three 2 4 5.0 6 d[d.index > 2] Out[21]: Integers Floats Objects 3 7 8 9 4 10 11 12 13-03-17 14:56
  9. So we have some logs How to get them into

    a readable format is beyond the scope of this presentation. Let's assume you have your logs files in CSV format already: In [1]: Then it's really easy to import the data into pandas: In [2]: In [3]: If you want to skip the minutes of waiting for your data to re-load (and feel ashamed about having to invoke gc.disable()), then install the tables (PyTables) module so you can use HDF5! In [4]: In [5]: In [6]: from cStringIO import StringIO log_file = StringIO( 'posix_timestamp,elapsed,sys,user,queries,query_time,rows,' 'accountid,userid,contactid,level,silo,method\n' '1343103150,0.062353,0,4,6,0.01690,3,' '12345,1,-1,3,invoice_InvoiceResource,search\n' ) import pandas as pd from datetime import datetime import gc def posix_string_to_datetime(posix_string): return datetime.utcfromtimestamp(int(posix_string)) # Disabling GC for CPython dramatically improves the speed of # bulk imports and won't actually use more memory unless we # create circular garbage. For example, loading 1.8M records # goes from taking 55s to taking 35s! gc.disable() df = pd.io.parsers.read_csv( log_file, # index_col is the first column, our posix_timestamp index_col=0, # Interpret the index column as a date parse_dates=0, date_parser=posix_string_to_datetime) gc.enable() df Out[3]: elapsed sys user queries query_time rows accountid userid contactid level silo method posix_timestamp 2012-07-24 04:12:30 0.062353 0 4 6 0.0169 3 12345 1 -1 3 invoice_InvoiceResource search store = pd.HDFStore('example.h5', complib='blosc') store['example_data'] = df store Out[5]: <class 'pandas.io.pytables.HDFStore'> File path: example.h5 /example_data frame (shape->[1,12]) del store 13-03-17 14:57
  10. In [7]: store = pd.HDFStore('example.h5') store['example_data'] Out[7]: elapsed sys user

    queries query_time rows accountid userid contactid level silo method posix_timestamp 2012-07-24 04:12:30 0.062353 0 4 6 0.0169 3 12345 1 -1 3 invoice_InvoiceResource search 13-03-17 14:57
  11. Histogram example In [1]: In [2]: In [3]: import pandas

    as pd store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5') may07 = store['may07'] # bins is all the edges of the bins # e.g. bins = [1, 2, 3, 4] defines the ranges # [1,2), [2,3), and [3,4] bins = [2**x for x in range(-10, 10)] xscale('log') xlim((bins[0], bins[-1])) xlabel('Response time (s)') ylabel('# of requests that took that long') hist(may07['elapsed'], bins=bins) Out[3]: (array([ 0, 24, 187141, 347874, 488751, 529416, 160609, 96496, 34292, 14574, 9848, 20565, 969, 237, 77, 56, 8, 0, 0]), array([ 9.76562500e-04, 1.95312500e-03, 3.90625000e-03, 7.81250000e-03, 1.56250000e-02, 3.12500000e-02, 6.25000000e-02, 1.25000000e-01, 2.50000000e-01, 5.00000000e-01, 1.00000000e+00, 2.00000000e+00, 4.00000000e+00, 8.00000000e+00, 1.60000000e+01, 3.20000000e+01, 6.40000000e+01, 1.28000000e+02, 2.56000000e+02, 5.12000000e+02]), <a list of 19 Patch objects>) 13-03-17 14:57
  12. In [1]: Is there a correlation between response time and

    response size? We can see how queries that return more data take longer to run using the pylab plot() method: In [2]: But that's a little generic, as it's looking at all requests. We can do better! And we can compare one day to another! Let's start by selecting only a subset of the data. We do deployments around 9am, so we'll just look at requests that come in after release: In [3]: import pandas as pd store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5') may07 = store['may07'] may08 = store['may08'] xlabel('Elapsed (s)') ylabel('Number of rows returned') plot(may07['elapsed'], may07['rows'], 'r.', label='May 7') Out[2]: [<matplotlib.lines.Line2D at 0x111ac7410>] from datetime import datetime import pytz toronto = pytz.timezone('America/Toronto') may07_0900 = toronto.localize( datetime(2012, 5, 7, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) may08_0900 = toronto.localize( datetime(2012, 5, 8, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) 13-03-17 14:57
  13. I deployed some interesting changes on the 8th, with the

    goal of improving performance for invoice-related requests for larger amounts of data. (With thanks to Mike Pirnat for his presentation on metaclassing two years ago!) Let's start by getting the comparison data from the previous day: In [4]: Now we'll get the same range of data, but from after my code went live: In [5]: # Select only the requests that came in after 9am local time may07_after_0900 = may07[may07.index >= may07_0900] # Select only the requests for invoice things may07_invoice_after_0900 = may07_after_0900[ may07_after_0900['silo'] == 'invoice_InvoiceResource'] # Note that Pandas stores the DatetimeIndex values # as naïve datetimes in UTC! may07_invoice_after_0900 Out[4]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 284115 entries, 2012-05-07 13:00:00 to 2012-05-08 03:59:59 Data columns: request_id 284115 non-null values port 284115 non-null values accountid 284115 non-null values userid 284115 non-null values contactid 284115 non-null values level 284115 non-null values silo 284115 non-null values method 284115 non-null values rows 284115 non-null values queries 284115 non-null values query_time 284115 non-null values elapsed 284115 non-null values user 284115 non-null values sys 284115 non-null values dtypes: float64(2), int64(10), object(2) may08_after_0900 = may08[may08.index >= may08_0900] may08_invoice_after_0900 = may08_after_0900[ may08_after_0900['silo'] == 'invoice_InvoiceResource'] may08_invoice_after_0900 Out[5]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 263552 entries, 2012-05-08 13:00:00 to 2012-05-09 03:59:59 Data columns: request_id 263552 non-null values port 263552 non-null values accountid 263552 non-null values userid 263552 non-null values contactid 263552 non-null values level 263552 non-null values silo 263552 non-null values method 263552 non-null values rows 263552 non-null values queries 263552 non-null values query_time 263552 non-null values elapsed 263552 non-null values user 263552 non-null values sys 263552 non-null values dtypes: float64(2), int64(10), object(2) 13-03-17 14:57
  14. In [6]: For comparison, I didn't change much about Estimates:

    title( 'Response size vs request time for invoices, ' 'comparing two days') xlabel('Elapsed (s)') xlim(0, 30) # There are some uninteresting outliners ylabel('Number of rows returned') plot( may07_invoice_after_0900['elapsed'], may07_invoice_after_0900['rows'], 'r.', label='May 7') plot( may08_invoice_after_0900['elapsed'], may08_invoice_after_0900['rows'], 'b.', label='May 8') legend() Out[6]: <matplotlib.legend.Legend at 0x11198cf10> 13-03-17 14:57
  15. In [7]: But I did change estimates week later! Let's

    see how they fare: may07_estimate_after_0900 = may07_after_0900[ may07_after_0900['silo'] == 'estimate_EstimateResource'] may08_estimate_after_0900 = may08_after_0900[ may08_after_0900['silo'] == 'estimate_EstimateResource'] title( 'Response size vs request time for estimates, ' 'comparing two days') xlabel('Elapsed (s)') xlim(0, 30) ylabel('Number of rows returned') plot( may07_estimate_after_0900['elapsed'], may07_estimate_after_0900['rows'], 'r.', label='May 7') plot( may08_estimate_after_0900['elapsed'], may08_estimate_after_0900['rows'], 'b.', label='May 8') legend() Out[7]: <matplotlib.legend.Legend at 0x139780f90> 13-03-17 14:57
  16. In [8]: Bingo! may14 = store['may14'] may14_0900 = toronto.localize( datetime(2012,

    5, 14, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) may14_after_0900 = may14[may14.index >= may14_0900] may14_estimate_after_0900 = may14_after_0900[ may14_after_0900['silo'] == 'estimate_EstimateResource'] may15 = store['may15'] may15_0900 = toronto.localize( datetime(2012, 5, 15, 9, 0) ).astimezone(pytz.utc).replace(tzinfo=None) may15_after_0900 = may15[may15.index >= may15_0900] may15_estimate_after_0900 = may15_after_0900[ may15_after_0900['silo'] == 'estimate_EstimateResource'] title( 'Response size vs request time for estimates,' 'comparing two days') xlabel('Elapsed (s)') xlim(0, 30) ylabel('Number of rows returned') plot( may14_estimate_after_0900['elapsed'], may14_estimate_after_0900['rows'], 'r.', label='May 14') plot( may15_estimate_after_0900['elapsed'], may15_estimate_after_0900['rows'], 'b.', label='May 15') legend() Out[8]: <matplotlib.legend.Legend at 0x1397cd9d0> 13-03-17 14:57
  17. In [1]: If we want to get a graph over

    time like those you get from munin and graphite, we'll need to resample() our DataFrame: In [2]: Then we can use the handy DataFrame.plot() method to get it displayed easily: In [3]: We can use other callables to indicate how we'd like the downsampling to occur, such as a percentile function: import pandas as pd store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5') may07 = store['may07'] from datetime import timedelta import numpy as np first_minutes = may07[ may07.index < (may07.index[0] + timedelta(minutes=10))] first_minutes['elapsed'].resample('T', how=np.median) Out[2]: posix_timestamp 2012-05-07 04:00:00 0.033588 2012-05-07 04:01:00 0.032231 2012-05-07 04:02:00 0.031700 2012-05-07 04:03:00 0.031906 2012-05-07 04:04:00 0.033401 2012-05-07 04:05:00 0.032794 2012-05-07 04:06:00 0.031914 2012-05-07 04:07:00 0.031836 2012-05-07 04:08:00 0.034356 2012-05-07 04:09:00 0.030676 Freq: T, dtype: float64 ylabel('Median response time by minute (s)') may07['elapsed'].resample('T', how='median').plot() Out[3]: <matplotlib.axes.AxesSubplot at 0x10587fa10> 13-03-17 14:57
  18. In [4]: elapsed = may07['elapsed'] ylabel('Seconds') elapsed.resample( 'T', how=lambda x:

    np.percentile(x, 25) ).plot(label="25%ile", color='red') elapsed.resample( 'T', how=lambda x: np.percentile(x, 50) ).plot(label="50%ile", color='orange') elapsed.resample( 'T', how=lambda x: np.percentile(x, 75) ).plot(label="75%ile", color='green') legend() Out[4]: <matplotlib.legend.Legend at 0x111b46e50> 13-03-17 14:57
  19. Log analysis with pandas As presented at PyCon US 2013

    http://taaviburns.ca/presentations/log_analysis_with_pandas2013/ IPython Notebook Don't forget to have sufficient requirements installed first. Particularly (as of this writing) Pandas 0.11-dev from github. Introduction (nbviewer) 1. Pandas Crash Course (nbviewer) 2. ZOMG LOGS (nbviewer) 3. Histogram Example (nbviewer) 4. Scatterplots (nbviewer) 5. Resampling (nbviewer) 6. PDF Introduction 1. Pandas Crash Course 2. ZOMG LOGS 3. Histogram Example 4. Scatterplots 5. Resampling 6. 13-03-17 18:34