A brief introduction to finding evil in network data with Python and Pandas.
As presented at Security Onion Conference 2018 (with bonus slides)
Security Onion Conference 2018
Fast, structure-centric library built around data
manipulation and analysis.
Series – Like a fancy array, but with more built-in
DataFrame – 2D (usually) tabular data structure.
Similar to a programmable spreadsheet, but in a
Extremely popular with scientists and data wranglers,
so tons of good documentations and Stack Overflow
examples exist. That’s pretty much how I learned it.
CC BY-SA 2.5 es, https://commons.wikimedia.org/w/index.php?curid=531560
From a CSV file
From a TSV file (e.g. Zeek)
From JSON (one record per row)
Huntlib is a an Open Source Python library to help with data analysis tasks
common to Threat Hunting operations.
Thanks to Target for allowing me to make this available to the community!
Huntlib makes it easy to retrieve Elastic or Splunk search results as
DataFrames via the ElasticDF() and SplunkDF() objects.
It also has other functions, such as entropy and string similarity calculations,
and will continue to grow over time.
Pandas tries to infer datatypes, but is often wrong. Default is to store as ‘object’
(native Python string). This increases memory use and restricts operations you
can perform on the data.
If you know all the columns and datatypes in advance, you can provide them
to read_csv() / read_json(), but you can also fix them up later…
When working with new datasets, a good first step might be to just look at a few
rows to see what’s there.
Dropping columns you don’t need is also a good way to save memory and
speed up operations. It also avoids cluttering the output and makes your results
easier to read.
Dropping rows you don’t need also saves memory and makes things faster.
In our case, we’re interested in network transactions, so we want to drop rows
that don’t have any bytes transferred (items that aren’t network flows).
Look especially for categorical data, which may be ‘disguised’ as some other
type. If only a few specific values are allowed, it’s probably categorical. For
example, dest ports look like ints but are usually categorical. Ditto HTTP methods
and ‘http’ vs ‘https’.
Categories are usually much more memory efficient than other types, too!
Our dataset here consists of basic info about outgoing HTTP transactions. We’d
like to know the least common verbs, which may indicate suspicious activity.
This may not look as you expected. The groupby() doesn’t drop columns outside
the grouping, so count() could returns a different value for each column (the
number of rows with a value, skipping NaN or null).
Every row is guaranteed a timestamp, so we use that as the source for sorting,
though it doesn’t really matter much in this case.
Select just the PUT rows to see what’s up. Something certainly looks suspicious…
Pandas makes it easy to
apply an operation to a
whole column of
In this case, we’re
dividing by (1024 * 1024)
to convert bytes
This is a bit easier to read.
A box plot is a great way to visualize the distribution of data in a
column. You can easily see whether the data skews low, high or
center. Outlier detection is also pretty easy.
Let’s look at bytes for
ONLY outgoing POST/PUT
The low value is pretty low
(100 bytes or so), the high
value is a little over
10,000, and most of the
other values seem
reasonable for outgoing
We’re plotting outliers as
points, and look at that
one point at the top.
More than 1,000,000,000
You should probably
check that out. Any
Time Series Analysis (TSA) is basically just taking a good look at how things vary over
time. Typically you ‘bin’ the data rather than plotting individual data points. This
smooths things out a bit and also reduces the number of data points.
Here we bin (or ‘resample’) the data into 15 minute buckets. The value for each is the
number of POST/PUT transactions to the Internet for all points in that bucket.
We also call dropna() to remove any buckets with no data (collection errors?).
This shows about a month
of raw transaction volume
Which parts are unusually
Maybe you can eyeball
this, but there’s a lot of
If you have enough
history, computing and
correcting for seasonality
may make things more
Seasonality refers to the
normal pattern of activity
that over specific regular
intervals (e.g., days,
We’ll try a ‘daily’ period.
Our data is binned by 15
mins, so frequency is 24
hours * 60 minutes,
divided by 15 (bucket
There is a clear seasonal
component. The residual
is the observed data
corrected for the daily
pattern and trend.
Here’s a single iteration of
the pattern. It goes from
midnight to midnight, with
clear peaks around 13:00
and 23:00 UTC.
It otherwise looks fairly
noisy, but the roughly the
same noise is repeated
each day in our dataset.
If we subtract this from our
raw observations, we can
more easily see the true
peaks and valleys.
This version is much less
noisy, and very much
easier to interpret!
The trending has been
corrected for, so points
center around 0 instead
of fluctuating wildly.
Daily seasonality has
been removed, so
remaining dips and spikes
are due to something
A good outlier threshold
might be between 400
A common method of identifying outliers is to pick a threshold ‘n’, and say “anything
above/below n standard deviations from the mean is an outlier.” The exact value of n is
as much art as science sometimes, but 3 or 4 is pretty common.
Standard deviation is just a measure of how close datapoints tend to be to the mean.
DataFrame •Drop columns
•Clean / Normalize
Python – Popular data analysis language
Project Jupyter – Interactive Python (et al)
Numpy – Scientific computing for Python
Pandas – High-perf data analysis
Plotly – Interactive data visualization
Cufflinks – Pandas bindings for Plotly
Huntlib – Threat Hunting analysis package
Statsmodels – Statistics library
David J. Bianco
The ThreatHunting Project – ‘Hunter’ docker image
containing all the above tools and libs