Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Server Log Analysis with Pandas by Taavi Burns

Server Log Analysis with Pandas by Taavi Burns

Use iPython, matplotlib, and Pandas to slice, dice, and visualize your application's behaviour through its logs.

PyCon 2013

March 17, 2013
Tweet

More Decks by PyCon 2013

Other Decks in Programming

Transcript

  1. Log Analysis With Pandas
    Taavi Burns, http://twitter.com/jaaaarel (http://twitter.com/jaaaarel),
    http://taaviburns.ca (http://taaviburns.ca), [email protected]
    What is pandas?
    “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis
    tools for the Python programming language.”
    http://pandas.pydata.org/ (http://pandas.pydata.org/)
    In [1]: from IPython.core.display import Image
    Image(filename='PythonForDataAnalysisCover.png')
    Out[1]:
    13-03-17 14:56

    View Slide

  2. Let's dive right in!
    Installing things
    OSX
    $ brew install zeromq freetype
    If you also want HDF5 support:
    $ brew install hdf5
    Debian/Ubuntu
    $ sudo apt-get install libzmq-dev libpng-dev libfreetype6-dev g++
    If you also want HDF5 support:
    $ sudo apt-get install libhdf5-serial-dev
    Python packages
    $ pip install ipython pyzmq tornado numpy matplotlib pandas
    If you also want HDF5 support:
    $ pip install cython numexpr tables
    Run it!
    $ ipython notebook --pylab inline
    Hello, World!
    In [2]:
    Inline plotting
    Line plot
    print "Hello, world!"
    Hello, world!
    13-03-17 14:56

    View Slide

  3. In [3]:
    Histogram
    In [4]:
    data = range(-100, 100)
    title('Cubic plot')
    plot(data, [x**3 for x in data], label="$y = x^3$")
    legend(loc='best')
    Out[3]:
    import random
    datapoints = [random.gauss(0, 1) for _ in range(1000)]
    title('Gaussian distribution')
    h = hist(datapoints)
    13-03-17 14:56

    View Slide

  4. Pandas is built on top of NumPy, providing higher-level abstractions.
    A Series is like an array: a 1-D list of homogenously-typed items.
    In [1]:
    In [2]:
    Here's a Series() of floating point numbers:
    In [3]:
    In [4]:
    Of course if you mix things up, everything in Python is an object in the end.
    In [5]:
    In [6]:
    Broadcasting operations across a Series
    You can apply conditional expressions to a Series, and it will return another Series with the result of that expression applied
    to each value. NumPy calls this "broadcasting".
    import pandas as pd
    a = pd.Series([1, 2, 3])
    a
    Out[1]: 0 1
    1 2
    2 3
    dtype: int64
    a.dtype
    Out[2]: dtype('int64')
    b = pd.Series([1, 2.3, 3])
    b
    Out[3]: 0 1.0
    1 2.3
    2 3.0
    dtype: float64
    b.dtype
    Out[4]: dtype('float64')
    c = pd.Series(['a', None, 5])
    c
    Out[5]: 0 a
    1 None
    2 5
    dtype: object
    c.dtype
    Out[6]: dtype('O')
    13-03-17 14:56

    View Slide

  5. In [7]:
    In [8]:
    In [9]:
    It's also easy to broadcast your own callable by using Series.map():
    In [10]:
    DataFrames
    A DataFrame is essentially a set of Series objects (as columns) with a shared index (the row labels).
    a
    Out[7]: 0 1
    1 2
    2 3
    dtype: int64
    a > 1
    Out[8]: 0 False
    1 True
    2 True
    dtype: bool
    a == 1
    Out[9]: 0 True
    1 False
    2 False
    dtype: bool
    a.map(lambda x: x % 2 == 0)
    Out[10]: 0 False
    1 True
    2 False
    dtype:
    bool
    13-03-17 14:56

    View Slide

  6. In [11]:
    In [12]:
    Selecting data
    Selecting by column by using a key lookup:
    In [13]:
    You can look up two columns by indexing using a list of columns:
    d = pd.DataFrame(
    [
    [1, 2.3, 'three'],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]],
    columns=['Integers', 'Floats', 'Objects'],
    index=[1, 2, 3, 4])
    d
    Out[11]:
    Integers Floats Objects
    1 1 2.3 three
    2 4 5.0 6
    3 7 8.0 9
    4 10 11.0 12
    d.dtypes
    Out[12]: Integers
    int64
    Floats
    float64
    Objects
    object
    dtype: object
    d['Floats']
    Out[13]: 1 2.3
    2 5.0
    3 8.0
    4 11.0
    Name: Floats, dtype:
    float64
    13-03-17 14:56

    View Slide

  7. In [14]:
    You can select a range of rows using list slices. Note that this refers to the rows as if they were in a Python list()!
    In [15]:
    You can also avoid the magic and just use DataFrame.xs() to access the rows by their indexed name:
    In [16]:
    Or specifying column names:
    In [17]:
    Row indexing can also be done using a mask:
    d[['Integers', 'Objects']]
    Out[14]:
    Integers Objects
    1 1 three
    2 4 6
    3 7 9
    4 10 12
    d[2:]
    Out[15]:
    Integers Floats Objects
    3 7 8 9
    4 10 11 12
    d.xs(3, axis=0)
    Out[16]: Integers 7
    Floats 8
    Objects 9
    Name: 3, dtype:
    object
    d.xs('Floats', axis=1)
    Out[17]: 1 2.3
    2 5.0
    3 8.0
    4 11.0
    Name: Floats, dtype:
    float64
    13-03-17 14:56

    View Slide

  8. In [18]:
    Combined with conditional expression broadcasting, and you can get some really interesting results:
    In [19]:
    In [20]:
    To get a subset of the rows based on the index value:
    In [21]:
    mask = [True, False, True, False]
    d[mask]
    Out[18]:
    Integers Floats Objects
    1 1 2.3 three
    3 7 8.0 9
    where_Integers_is_gt_4 = d['Integers'] > 4
    d[where_Integers_is_gt_4]
    Out[19]:
    Integers Floats Objects
    3 7 8 9
    4 10 11 12
    d[np.invert(where_Integers_is_gt_4)]
    Out[20]:
    Integers Floats Objects
    1 1 2.3 three
    2 4 5.0 6
    d[d.index > 2]
    Out[21]:
    Integers Floats Objects
    3 7 8 9
    4 10 11 12
    13-03-17 14:56

    View Slide

  9. In [22]: from IPython.core.display import Image
    Image(filename='masking.png')
    Out[22]:
    13-03-17 14:56

    View Slide

  10. So we have some logs
    How to get them into a readable format is beyond the scope of this presentation.
    Let's assume you have your logs files in CSV format already:
    In [1]:
    Then it's really easy to import the data into pandas:
    In [2]:
    In [3]:
    If you want to skip the minutes of waiting for your data to re-load (and feel ashamed about having to invoke gc.disable()), then install the tables (PyTables)
    module so you can use HDF5!
    In [4]:
    In [5]:
    In [6]:
    from cStringIO import StringIO
    log_file = StringIO(
    'posix_timestamp,elapsed,sys,user,queries,query_time,rows,'
    'accountid,userid,contactid,level,silo,method\n'
    '1343103150,0.062353,0,4,6,0.01690,3,'
    '12345,1,-1,3,invoice_InvoiceResource,search\n'
    )
    import pandas as pd
    from datetime import datetime
    import gc
    def posix_string_to_datetime(posix_string):
    return datetime.utcfromtimestamp(int(posix_string))
    # Disabling GC for CPython dramatically improves the speed of
    # bulk imports and won't actually use more memory unless we
    # create circular garbage. For example, loading 1.8M records
    # goes from taking 55s to taking 35s!
    gc.disable()
    df = pd.io.parsers.read_csv(
    log_file,
    # index_col is the first column, our posix_timestamp
    index_col=0,
    # Interpret the index column as a date
    parse_dates=0,
    date_parser=posix_string_to_datetime)
    gc.enable()
    df
    Out[3]:
    elapsed sys user queries query_time rows accountid userid contactid level silo method
    posix_timestamp
    2012-07-24
    04:12:30
    0.062353 0 4 6 0.0169 3 12345 1 -1 3 invoice_InvoiceResource search
    store = pd.HDFStore('example.h5', complib='blosc')
    store['example_data'] = df
    store
    Out[5]:
    File path: example.h5
    /example_data frame (shape->[1,12])
    del store
    13-03-17 14:57

    View Slide

  11. In [7]: store = pd.HDFStore('example.h5')
    store['example_data']
    Out[7]:
    elapsed sys user queries query_time rows accountid userid contactid level silo method
    posix_timestamp
    2012-07-24
    04:12:30
    0.062353 0 4 6 0.0169 3 12345 1 -1 3 invoice_InvoiceResource search
    13-03-17 14:57

    View Slide

  12. Histogram example
    In [1]:
    In [2]:
    In [3]:
    import pandas as pd
    store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5')
    may07 = store['may07']
    # bins is all the edges of the bins
    # e.g. bins = [1, 2, 3, 4] defines the ranges
    # [1,2), [2,3), and [3,4]
    bins = [2**x for x in range(-10, 10)]
    xscale('log')
    xlim((bins[0], bins[-1]))
    xlabel('Response time (s)')
    ylabel('# of requests that took that long')
    hist(may07['elapsed'], bins=bins)
    Out[3]: (array([ 0, 24, 187141, 347874, 488751, 529416, 160609, 96496,
    34292, 14574, 9848, 20565, 969, 237, 77, 56,
    8, 0, 0]),
    array([ 9.76562500e-04, 1.95312500e-03, 3.90625000e-03,
    7.81250000e-03, 1.56250000e-02, 3.12500000e-02,
    6.25000000e-02, 1.25000000e-01, 2.50000000e-01,
    5.00000000e-01, 1.00000000e+00, 2.00000000e+00,
    4.00000000e+00, 8.00000000e+00, 1.60000000e+01,
    3.20000000e+01, 6.40000000e+01, 1.28000000e+02,
    2.56000000e+02, 5.12000000e+02]),
    )
    13-03-17 14:57

    View Slide

  13. 13-03-17 14:57

    View Slide

  14. In [1]:
    Is there a correlation between response time and response
    size?
    We can see how queries that return more data take longer to run using the pylab plot() method:
    In [2]:
    But that's a little generic, as it's looking at all requests. We can do better! And we can compare one day to another!
    Let's start by selecting only a subset of the data. We do deployments around 9am, so we'll just look at requests that come
    in after release:
    In [3]:
    import pandas as pd
    store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5')
    may07 = store['may07']
    may08 = store['may08']
    xlabel('Elapsed (s)')
    ylabel('Number of rows returned')
    plot(may07['elapsed'], may07['rows'], 'r.', label='May 7')
    Out[2]: []
    from datetime import datetime
    import pytz
    toronto = pytz.timezone('America/Toronto')
    may07_0900 = toronto.localize(
    datetime(2012, 5, 7, 9, 0)
    ).astimezone(pytz.utc).replace(tzinfo=None)
    may08_0900 = toronto.localize(
    datetime(2012, 5, 8, 9, 0)
    ).astimezone(pytz.utc).replace(tzinfo=None)
    13-03-17 14:57

    View Slide

  15. I deployed some interesting changes on the 8th, with the goal of improving performance for invoice-related requests for
    larger amounts of data. (With thanks to Mike Pirnat for his presentation on metaclassing two years ago!) Let's start by
    getting the comparison data from the previous day:
    In [4]:
    Now we'll get the same range of data, but from after my code went live:
    In [5]:
    # Select only the requests that came in after 9am local time
    may07_after_0900 = may07[may07.index >= may07_0900]
    # Select only the requests for invoice things
    may07_invoice_after_0900 = may07_after_0900[
    may07_after_0900['silo'] == 'invoice_InvoiceResource']
    # Note that Pandas stores the DatetimeIndex values
    # as naïve datetimes in UTC!
    may07_invoice_after_0900
    Out[4]:
    DatetimeIndex: 284115 entries, 2012-05-07 13:00:00 to 2012-05-08 03:59:59
    Data columns:
    request_id 284115 non-null values
    port 284115 non-null values
    accountid 284115 non-null values
    userid 284115 non-null values
    contactid 284115 non-null values
    level 284115 non-null values
    silo 284115 non-null values
    method 284115 non-null values
    rows 284115 non-null values
    queries 284115 non-null values
    query_time 284115 non-null values
    elapsed 284115 non-null values
    user 284115 non-null values
    sys 284115 non-null values
    dtypes: float64(2), int64(10), object(2)
    may08_after_0900 = may08[may08.index >= may08_0900]
    may08_invoice_after_0900 = may08_after_0900[
    may08_after_0900['silo'] == 'invoice_InvoiceResource']
    may08_invoice_after_0900
    Out[5]:
    DatetimeIndex: 263552 entries, 2012-05-08 13:00:00 to 2012-05-09 03:59:59
    Data columns:
    request_id 263552 non-null values
    port 263552 non-null values
    accountid 263552 non-null values
    userid 263552 non-null values
    contactid 263552 non-null values
    level 263552 non-null values
    silo 263552 non-null values
    method 263552 non-null values
    rows 263552 non-null values
    queries 263552 non-null values
    query_time 263552 non-null values
    elapsed 263552 non-null values
    user 263552 non-null values
    sys 263552 non-null values
    dtypes: float64(2), int64(10), object(2)
    13-03-17 14:57

    View Slide

  16. In [6]:
    For comparison, I didn't change much about Estimates:
    title(
    'Response size vs request time for invoices, '
    'comparing two days')
    xlabel('Elapsed (s)')
    xlim(0, 30) # There are some uninteresting outliners
    ylabel('Number of rows returned')
    plot(
    may07_invoice_after_0900['elapsed'],
    may07_invoice_after_0900['rows'], 'r.', label='May 7')
    plot(
    may08_invoice_after_0900['elapsed'],
    may08_invoice_after_0900['rows'], 'b.', label='May 8')
    legend()
    Out[6]:
    13-03-17 14:57

    View Slide

  17. In [7]:
    But I did change estimates week later! Let's see how they fare:
    may07_estimate_after_0900 = may07_after_0900[
    may07_after_0900['silo'] == 'estimate_EstimateResource']
    may08_estimate_after_0900 = may08_after_0900[
    may08_after_0900['silo'] == 'estimate_EstimateResource']
    title(
    'Response size vs request time for estimates, '
    'comparing two days')
    xlabel('Elapsed (s)')
    xlim(0, 30)
    ylabel('Number of rows returned')
    plot(
    may07_estimate_after_0900['elapsed'],
    may07_estimate_after_0900['rows'], 'r.', label='May 7')
    plot(
    may08_estimate_after_0900['elapsed'],
    may08_estimate_after_0900['rows'], 'b.', label='May 8')
    legend()
    Out[7]:
    13-03-17 14:57

    View Slide

  18. In [8]:
    Bingo!
    may14 = store['may14']
    may14_0900 = toronto.localize(
    datetime(2012, 5, 14, 9, 0)
    ).astimezone(pytz.utc).replace(tzinfo=None)
    may14_after_0900 = may14[may14.index >= may14_0900]
    may14_estimate_after_0900 = may14_after_0900[
    may14_after_0900['silo'] == 'estimate_EstimateResource']
    may15 = store['may15']
    may15_0900 = toronto.localize(
    datetime(2012, 5, 15, 9, 0)
    ).astimezone(pytz.utc).replace(tzinfo=None)
    may15_after_0900 = may15[may15.index >= may15_0900]
    may15_estimate_after_0900 = may15_after_0900[
    may15_after_0900['silo'] == 'estimate_EstimateResource']
    title(
    'Response size vs request time for estimates,'
    'comparing two days')
    xlabel('Elapsed (s)')
    xlim(0, 30)
    ylabel('Number of rows returned')
    plot(
    may14_estimate_after_0900['elapsed'],
    may14_estimate_after_0900['rows'], 'r.', label='May 14')
    plot(
    may15_estimate_after_0900['elapsed'],
    may15_estimate_after_0900['rows'], 'b.', label='May 15')
    legend()
    Out[8]:
    13-03-17 14:57

    View Slide

  19. In [1]:
    If we want to get a graph over time like those you get from munin and graphite, we'll need to resample() our DataFrame:
    In [2]:
    Then we can use the handy DataFrame.plot() method to get it displayed easily:
    In [3]:
    We can use other callables to indicate how we'd like the downsampling to occur, such as a percentile function:
    import pandas as pd
    store = pd.HDFStore('/Volumes/FreshBooks/data/store.h5')
    may07 = store['may07']
    from datetime import timedelta
    import numpy as np
    first_minutes = may07[
    may07.index < (may07.index[0] + timedelta(minutes=10))]
    first_minutes['elapsed'].resample('T', how=np.median)
    Out[2]: posix_timestamp
    2012-05-07 04:00:00 0.033588
    2012-05-07 04:01:00 0.032231
    2012-05-07 04:02:00 0.031700
    2012-05-07 04:03:00 0.031906
    2012-05-07 04:04:00 0.033401
    2012-05-07 04:05:00 0.032794
    2012-05-07 04:06:00 0.031914
    2012-05-07 04:07:00 0.031836
    2012-05-07 04:08:00 0.034356
    2012-05-07 04:09:00 0.030676
    Freq: T, dtype: float64
    ylabel('Median response time by minute (s)')
    may07['elapsed'].resample('T', how='median').plot()
    Out[3]:
    13-03-17 14:57

    View Slide

  20. In [4]: elapsed = may07['elapsed']
    ylabel('Seconds')
    elapsed.resample(
    'T', how=lambda x: np.percentile(x, 25)
    ).plot(label="25%ile", color='red')
    elapsed.resample(
    'T', how=lambda x: np.percentile(x, 50)
    ).plot(label="50%ile", color='orange')
    elapsed.resample(
    'T', how=lambda x: np.percentile(x, 75)
    ).plot(label="75%ile", color='green')
    legend()
    Out[4]:
    13-03-17 14:57

    View Slide

  21. Log analysis with pandas
    As presented at PyCon US 2013
    http://taaviburns.ca/presentations/log_analysis_with_pandas2013/
    IPython Notebook
    Don't forget to have sufficient requirements installed first. Particularly (as of this writing) Pandas
    0.11-dev from github.
    Introduction (nbviewer)
    1.
    Pandas Crash Course (nbviewer)
    2.
    ZOMG LOGS (nbviewer)
    3.
    Histogram Example (nbviewer)
    4.
    Scatterplots (nbviewer)
    5.
    Resampling (nbviewer)
    6.
    PDF
    Introduction
    1.
    Pandas Crash Course
    2.
    ZOMG LOGS
    3.
    Histogram Example
    4.
    Scatterplots
    5.
    Resampling
    6.
    13-03-17 18:34

    View Slide