Introduction to Data Analysis with Security Onion (and Other Open Source Tools)

Introduction to Data Analysis with Security Onion (and Other Open Source Tools)

A brief introduction to finding evil in network data with Python and Pandas.

As presented at Security Onion Conference 2018 (with bonus slides)

49d635b47da1fee5d0972745390e0633?s=128

David J. Bianco

October 19, 2018
Tweet

Transcript

  1. Security Onion Conference 2018

  2. None
  3. Fast, structure-centric library built around data manipulation and analysis. Series

    – Like a fancy array, but with more built-in operations DataFrame – 2D (usually) tabular data structure. Similar to a programmable spreadsheet, but in a good way. Extremely popular with scientists and data wranglers, so tons of good documentations and Stack Overflow examples exist. That’s pretty much how I learned it. CC BY-SA 2.5 es, https://commons.wikimedia.org/w/index.php?curid=531560
  4. From a CSV file From a TSV file (e.g. Zeek)

    From JSON (one record per row)
  5. Huntlib is a an Open Source Python library to help

    with data analysis tasks common to Threat Hunting operations. Thanks to Target for allowing me to make this available to the community! Huntlib makes it easy to retrieve Elastic or Splunk search results as DataFrames via the ElasticDF() and SplunkDF() objects. It also has other functions, such as entropy and string similarity calculations, and will continue to grow over time. https://github.com/target/huntlib
  6. Pandas tries to infer datatypes, but is often wrong. Default

    is to store as ‘object’ (native Python string). This increases memory use and restricts operations you can perform on the data. If you know all the columns and datatypes in advance, you can provide them to read_csv() / read_json(), but you can also fix them up later…
  7. When working with new datasets, a good first step might

    be to just look at a few rows to see what’s there.
  8. Dropping columns you don’t need is also a good way

    to save memory and speed up operations. It also avoids cluttering the output and makes your results easier to read.
  9. Dropping rows you don’t need also saves memory and makes

    things faster. In our case, we’re interested in network transactions, so we want to drop rows that don’t have any bytes transferred (items that aren’t network flows).
  10. Look especially for categorical data, which may be ‘disguised’ as

    some other type. If only a few specific values are allowed, it’s probably categorical. For example, dest ports look like ints but are usually categorical. Ditto HTTP methods and ‘http’ vs ‘https’. Categories are usually much more memory efficient than other types, too!
  11. Our dataset here consists of basic info about outgoing HTTP

    transactions. We’d like to know the least common verbs, which may indicate suspicious activity. This may not look as you expected. The groupby() doesn’t drop columns outside the grouping, so count() could returns a different value for each column (the number of rows with a value, skipping NaN or null). Every row is guaranteed a timestamp, so we use that as the source for sorting, though it doesn’t really matter much in this case.
  12. Select just the PUT rows to see what’s up. Something

    certainly looks suspicious…
  13. Pandas makes it easy to apply an operation to a

    whole column of numbers. In this case, we’re dividing by (1024 * 1024) to convert bytes transferred into megabytes transferred. This is a bit easier to read.
  14. A box plot is a great way to visualize the

    distribution of data in a column. You can easily see whether the data skews low, high or center. Outlier detection is also pretty easy. Outlier! Upper Bound 75% 50% (“Median”) 25% Lower Bound Interquartile Range
  15. Let’s look at bytes for ONLY outgoing POST/PUT requests. The

    low value is pretty low (100 bytes or so), the high value is a little over 10,000, and most of the other values seem reasonable for outgoing HTTP traffic. We’re plotting outliers as points, and look at that one point at the top. More than 1,000,000,000 bytes! You should probably check that out. Any ideas how?
  16. Time Series Analysis (TSA) is basically just taking a good

    look at how things vary over time. Typically you ‘bin’ the data rather than plotting individual data points. This smooths things out a bit and also reduces the number of data points. Here we bin (or ‘resample’) the data into 15 minute buckets. The value for each is the number of POST/PUT transactions to the Internet for all points in that bucket. We also call dropna() to remove any buckets with no data (collection errors?).
  17. This shows about a month of raw transaction volume data.

    Which parts are unusually active? Maybe you can eyeball this, but there’s a lot of noise. If you have enough history, computing and correcting for seasonality may make things more clear.
  18. Seasonality refers to the normal pattern of activity that over

    specific regular intervals (e.g., days, weeks, months). We’ll try a ‘daily’ period. Our data is binned by 15 mins, so frequency is 24 hours * 60 minutes, divided by 15 (bucket size). There is a clear seasonal component. The residual is the observed data corrected for the daily pattern and trend.
  19. Here’s a single iteration of the pattern. It goes from

    midnight to midnight, with clear peaks around 13:00 and 23:00 UTC. It otherwise looks fairly noisy, but the roughly the same noise is repeated each day in our dataset. If we subtract this from our raw observations, we can more easily see the true peaks and valleys.
  20. This version is much less noisy, and very much easier

    to interpret! The trending has been corrected for, so points center around 0 instead of fluctuating wildly. Daily seasonality has been removed, so remaining dips and spikes are due to something else. A good outlier threshold might be between 400 and 600.
  21. A common method of identifying outliers is to pick a

    threshold ‘n’, and say “anything above/below n standard deviations from the mean is an outlier.” The exact value of n is as much art as science sometimes, but 3 or 4 is pretty common. Standard deviation is just a measure of how close datapoints tend to be to the mean.
  22. •Availability •Location •Collection •Format Locate Data •CSV •JSON •Elastic •Splunk

    Read DataFrame •Drop columns •Drop rows •Clean / Normalize •Fix datatypes Prepare Data •Explore •Reduce •Query Analyze •Outlier identification •Time Series Analysis (TSA) •Seasonality Visualize
  23. Python – Popular data analysis language • python.org Project Jupyter

    – Interactive Python (et al) environment • jupyter.org Numpy – Scientific computing for Python • numpy.org Pandas – High-perf data analysis • pandas.pydata.org Plotly – Interactive data visualization • plot.ly Cufflinks – Pandas bindings for Plotly • github.com/santosjorge/cufflinks Huntlib – Threat Hunting analysis package • github.com/target/huntlib Statsmodels – Statistics library • statsmodels.org David J. Bianco @DavidJBianco The ThreatHunting Project – ‘Hunter’ docker image containing all the above tools and libs • threathunting.net/hunting-platform