Introduction to Data Analysis with Security Onion (and Other Open Source Tools)

Slide 1

Slide 1 text

Security Onion Conference 2018

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Fast, structure-centric library built around data manipulation and analysis. Series – Like a fancy array, but with more built-in operations DataFrame – 2D (usually) tabular data structure. Similar to a programmable spreadsheet, but in a good way. Extremely popular with scientists and data wranglers, so tons of good documentations and Stack Overflow examples exist. That’s pretty much how I learned it. CC BY-SA 2.5 es, https://commons.wikimedia.org/w/index.php?curid=531560

Slide 4

Slide 4 text

From a CSV file From a TSV file (e.g. Zeek) From JSON (one record per row)

Slide 5

Slide 5 text

Huntlib is a an Open Source Python library to help with data analysis tasks common to Threat Hunting operations. Thanks to Target for allowing me to make this available to the community! Huntlib makes it easy to retrieve Elastic or Splunk search results as DataFrames via the ElasticDF() and SplunkDF() objects. It also has other functions, such as entropy and string similarity calculations, and will continue to grow over time. https://github.com/target/huntlib

Slide 6

Slide 6 text

Pandas tries to infer datatypes, but is often wrong. Default is to store as ‘object’ (native Python string). This increases memory use and restricts operations you can perform on the data. If you know all the columns and datatypes in advance, you can provide them to read_csv() / read_json(), but you can also fix them up later…

Slide 7

Slide 7 text

When working with new datasets, a good first step might be to just look at a few rows to see what’s there.

Slide 8

Slide 8 text

Dropping columns you don’t need is also a good way to save memory and speed up operations. It also avoids cluttering the output and makes your results easier to read.

Slide 9

Slide 9 text

Dropping rows you don’t need also saves memory and makes things faster. In our case, we’re interested in network transactions, so we want to drop rows that don’t have any bytes transferred (items that aren’t network flows).

Slide 10

Slide 10 text

Look especially for categorical data, which may be ‘disguised’ as some other type. If only a few specific values are allowed, it’s probably categorical. For example, dest ports look like ints but are usually categorical. Ditto HTTP methods and ‘http’ vs ‘https’. Categories are usually much more memory efficient than other types, too!

Slide 11

Slide 11 text

Our dataset here consists of basic info about outgoing HTTP transactions. We’d like to know the least common verbs, which may indicate suspicious activity. This may not look as you expected. The groupby() doesn’t drop columns outside the grouping, so count() could returns a different value for each column (the number of rows with a value, skipping NaN or null). Every row is guaranteed a timestamp, so we use that as the source for sorting, though it doesn’t really matter much in this case.

Slide 12

Slide 12 text

Select just the PUT rows to see what’s up. Something certainly looks suspicious…

Slide 13

Slide 13 text

Pandas makes it easy to apply an operation to a whole column of numbers. In this case, we’re dividing by (1024 * 1024) to convert bytes transferred into megabytes transferred. This is a bit easier to read.

Slide 14

Slide 14 text

A box plot is a great way to visualize the distribution of data in a column. You can easily see whether the data skews low, high or center. Outlier detection is also pretty easy. Outlier! Upper Bound 75% 50% (“Median”) 25% Lower Bound Interquartile Range

Slide 15

Slide 15 text

Let’s look at bytes for ONLY outgoing POST/PUT requests. The low value is pretty low (100 bytes or so), the high value is a little over 10,000, and most of the other values seem reasonable for outgoing HTTP traffic. We’re plotting outliers as points, and look at that one point at the top. More than 1,000,000,000 bytes! You should probably check that out. Any ideas how?

Slide 16

Slide 16 text

Time Series Analysis (TSA) is basically just taking a good look at how things vary over time. Typically you ‘bin’ the data rather than plotting individual data points. This smooths things out a bit and also reduces the number of data points. Here we bin (or ‘resample’) the data into 15 minute buckets. The value for each is the number of POST/PUT transactions to the Internet for all points in that bucket. We also call dropna() to remove any buckets with no data (collection errors?).

Slide 17

Slide 17 text

This shows about a month of raw transaction volume data. Which parts are unusually active? Maybe you can eyeball this, but there’s a lot of noise. If you have enough history, computing and correcting for seasonality may make things more clear.

Slide 18

Slide 18 text

Seasonality refers to the normal pattern of activity that over specific regular intervals (e.g., days, weeks, months). We’ll try a ‘daily’ period. Our data is binned by 15 mins, so frequency is 24 hours * 60 minutes, divided by 15 (bucket size). There is a clear seasonal component. The residual is the observed data corrected for the daily pattern and trend.

Slide 19

Slide 19 text

Here’s a single iteration of the pattern. It goes from midnight to midnight, with clear peaks around 13:00 and 23:00 UTC. It otherwise looks fairly noisy, but the roughly the same noise is repeated each day in our dataset. If we subtract this from our raw observations, we can more easily see the true peaks and valleys.

Slide 20

Slide 20 text

This version is much less noisy, and very much easier to interpret! The trending has been corrected for, so points center around 0 instead of fluctuating wildly. Daily seasonality has been removed, so remaining dips and spikes are due to something else. A good outlier threshold might be between 400 and 600.

Slide 21

Slide 21 text

A common method of identifying outliers is to pick a threshold ‘n’, and say “anything above/below n standard deviations from the mean is an outlier.” The exact value of n is as much art as science sometimes, but 3 or 4 is pretty common. Standard deviation is just a measure of how close datapoints tend to be to the mean.

Slide 22

Slide 22 text

•Availability •Location •Collection •Format Locate Data •CSV •JSON •Elastic •Splunk Read DataFrame •Drop columns •Drop rows •Clean / Normalize •Fix datatypes Prepare Data •Explore •Reduce •Query Analyze •Outlier identification •Time Series Analysis (TSA) •Seasonality Visualize

Slide 23

Slide 23 text

Python – Popular data analysis language • python.org Project Jupyter – Interactive Python (et al) environment • jupyter.org Numpy – Scientific computing for Python • numpy.org Pandas – High-perf data analysis • pandas.pydata.org Plotly – Interactive data visualization • plot.ly Cufflinks – Pandas bindings for Plotly • github.com/santosjorge/cufflinks Huntlib – Threat Hunting analysis package • github.com/target/huntlib Statsmodels – Statistics library • statsmodels.org David J. Bianco @DavidJBianco The ThreatHunting Project – ‘Hunter’ docker image containing all the above tools and libs • threathunting.net/hunting-platform