$30 off During Our Annual Pro Sale. View Details »

Introduction to Data Analysis with Security Onion (and Other Open Source Tools)

Introduction to Data Analysis with Security Onion (and Other Open Source Tools)

A brief introduction to finding evil in network data with Python and Pandas.

As presented at Security Onion Conference 2018 (with bonus slides)

David J. Bianco

October 19, 2018
Tweet

More Decks by David J. Bianco

Other Decks in Technology

Transcript

  1. Security Onion Conference 2018

    View Slide

  2. View Slide

  3. Fast, structure-centric library built around data
    manipulation and analysis.
    Series – Like a fancy array, but with more built-in
    operations
    DataFrame – 2D (usually) tabular data structure.
    Similar to a programmable spreadsheet, but in a
    good way.
    Extremely popular with scientists and data wranglers,
    so tons of good documentations and Stack Overflow
    examples exist. That’s pretty much how I learned it.
    CC BY-SA 2.5 es, https://commons.wikimedia.org/w/index.php?curid=531560

    View Slide

  4. From a CSV file
    From a TSV file (e.g. Zeek)
    From JSON (one record per row)

    View Slide

  5. Huntlib is a an Open Source Python library to help with data analysis tasks
    common to Threat Hunting operations.
    Thanks to Target for allowing me to make this available to the community!
    Huntlib makes it easy to retrieve Elastic or Splunk search results as
    DataFrames via the ElasticDF() and SplunkDF() objects.
    It also has other functions, such as entropy and string similarity calculations,
    and will continue to grow over time.
    https://github.com/target/huntlib

    View Slide

  6. Pandas tries to infer datatypes, but is often wrong. Default is to store as ‘object’
    (native Python string). This increases memory use and restricts operations you
    can perform on the data.
    If you know all the columns and datatypes in advance, you can provide them
    to read_csv() / read_json(), but you can also fix them up later…

    View Slide

  7. When working with new datasets, a good first step might be to just look at a few
    rows to see what’s there.

    View Slide

  8. Dropping columns you don’t need is also a good way to save memory and
    speed up operations. It also avoids cluttering the output and makes your results
    easier to read.

    View Slide

  9. Dropping rows you don’t need also saves memory and makes things faster.
    In our case, we’re interested in network transactions, so we want to drop rows
    that don’t have any bytes transferred (items that aren’t network flows).

    View Slide

  10. Look especially for categorical data, which may be ‘disguised’ as some other
    type. If only a few specific values are allowed, it’s probably categorical. For
    example, dest ports look like ints but are usually categorical. Ditto HTTP methods
    and ‘http’ vs ‘https’.
    Categories are usually much more memory efficient than other types, too!

    View Slide

  11. Our dataset here consists of basic info about outgoing HTTP transactions. We’d
    like to know the least common verbs, which may indicate suspicious activity.
    This may not look as you expected. The groupby() doesn’t drop columns outside
    the grouping, so count() could returns a different value for each column (the
    number of rows with a value, skipping NaN or null).
    Every row is guaranteed a timestamp, so we use that as the source for sorting,
    though it doesn’t really matter much in this case.

    View Slide

  12. Select just the PUT rows to see what’s up. Something certainly looks suspicious…

    View Slide

  13. Pandas makes it easy to
    apply an operation to a
    whole column of
    numbers.
    In this case, we’re
    dividing by (1024 * 1024)
    to convert bytes
    transferred into
    megabytes transferred.
    This is a bit easier to read.

    View Slide

  14. A box plot is a great way to visualize the distribution of data in a
    column. You can easily see whether the data skews low, high or
    center. Outlier detection is also pretty easy.
    Outlier!
    Upper Bound
    75%
    50% (“Median”)
    25%
    Lower Bound
    Interquartile Range

    View Slide

  15. Let’s look at bytes for
    ONLY outgoing POST/PUT
    requests.
    The low value is pretty low
    (100 bytes or so), the high
    value is a little over
    10,000, and most of the
    other values seem
    reasonable for outgoing
    HTTP traffic.
    We’re plotting outliers as
    points, and look at that
    one point at the top.
    More than 1,000,000,000
    bytes!
    You should probably
    check that out. Any
    ideas how?

    View Slide

  16. Time Series Analysis (TSA) is basically just taking a good look at how things vary over
    time. Typically you ‘bin’ the data rather than plotting individual data points. This
    smooths things out a bit and also reduces the number of data points.
    Here we bin (or ‘resample’) the data into 15 minute buckets. The value for each is the
    number of POST/PUT transactions to the Internet for all points in that bucket.
    We also call dropna() to remove any buckets with no data (collection errors?).

    View Slide

  17. This shows about a month
    of raw transaction volume
    data.
    Which parts are unusually
    active?
    Maybe you can eyeball
    this, but there’s a lot of
    noise.
    If you have enough
    history, computing and
    correcting for seasonality
    may make things more
    clear.

    View Slide

  18. Seasonality refers to the
    normal pattern of activity
    that over specific regular
    intervals (e.g., days,
    weeks, months).
    We’ll try a ‘daily’ period.
    Our data is binned by 15
    mins, so frequency is 24
    hours * 60 minutes,
    divided by 15 (bucket
    size).
    There is a clear seasonal
    component. The residual
    is the observed data
    corrected for the daily
    pattern and trend.

    View Slide

  19. Here’s a single iteration of
    the pattern. It goes from
    midnight to midnight, with
    clear peaks around 13:00
    and 23:00 UTC.
    It otherwise looks fairly
    noisy, but the roughly the
    same noise is repeated
    each day in our dataset.
    If we subtract this from our
    raw observations, we can
    more easily see the true
    peaks and valleys.

    View Slide

  20. This version is much less
    noisy, and very much
    easier to interpret!
    The trending has been
    corrected for, so points
    center around 0 instead
    of fluctuating wildly.
    Daily seasonality has
    been removed, so
    remaining dips and spikes
    are due to something
    else.
    A good outlier threshold
    might be between 400
    and 600.

    View Slide

  21. A common method of identifying outliers is to pick a threshold ‘n’, and say “anything
    above/below n standard deviations from the mean is an outlier.” The exact value of n is
    as much art as science sometimes, but 3 or 4 is pretty common.
    Standard deviation is just a measure of how close datapoints tend to be to the mean.

    View Slide

  22. •Availability
    •Location
    •Collection
    •Format
    Locate
    Data
    •CSV
    •JSON
    •Elastic
    •Splunk
    Read
    DataFrame •Drop columns
    •Drop rows
    •Clean / Normalize
    •Fix datatypes
    Prepare
    Data
    •Explore
    •Reduce
    •Query
    Analyze
    •Outlier
    identification
    •Time Series
    Analysis (TSA)
    •Seasonality
    Visualize

    View Slide

  23. Python – Popular data analysis language
    • python.org
    Project Jupyter – Interactive Python (et al)
    environment
    • jupyter.org
    Numpy – Scientific computing for Python
    • numpy.org
    Pandas – High-perf data analysis
    • pandas.pydata.org
    Plotly – Interactive data visualization
    • plot.ly
    Cufflinks – Pandas bindings for Plotly
    • github.com/santosjorge/cufflinks
    Huntlib – Threat Hunting analysis package
    • github.com/target/huntlib
    Statsmodels – Statistics library
    • statsmodels.org
    David J. Bianco
    @DavidJBianco
    The ThreatHunting Project – ‘Hunter’ docker image
    containing all the above tools and libs
    • threathunting.net/hunting-platform

    View Slide