The Statistics of Anomaly Detection

Slide 1

Slide 1 text

The Statistics of Anomaly Detection ConFoo 2025 Philip Tellis / Akamai

Slide 2

Slide 2 text

Philip Tellis Principal RUM Distiller @ Akamai @bluesmoon

Slide 3

Slide 3 text

C’est mon pain I bake bread

Slide 4

Slide 4 text

● Bread was rationed: 1 Kg loaf per person ● Bread is handmade, so isn’t exactly 1 Kg ● Normal distribution centered below 1 Kg ● Showed that the baker was skimming. ● Baker adjusted process so loaves measured over 1 Kg ● Distribution showed that the only change was in who received the heavier loaves Story Tim e

Slide 5

Slide 5 text

mPulse: kinda the same but for web performance

Slide 6

Slide 6 text

Background on mPulse ● We collect web performance data from end user browsers using the boomerang JavaScript library. ● This is beaconed back to our performance analytics application – mPulse. ● The data is cleaned, ﬁltered, sorted, and streamed to various backend tasks for storage, analysis, visualization, and alerting. ● We also have a task to do advanced data analysis and visualization.

Slide 7

Slide 7 text

Our Data ● Mostly time series (though we can ignore time for interesting views). ● Web performance data related to real user experiences. ● Data about multiple events during the page load process; analyze them independently or in relation to each other. ● Real Users implies an uncontrolled environment, and we can’t even be sure that all those users are human with non-malicious intent.

Slide 8

Slide 8 text

We can render individual waterfalls

Slide 9

Slide 9 text

Or combine multiple waterfalls

Slide 10

Slide 10 text

And of course time series This isn’t really outside expected bounds… This isn’t really outside expected bounds… But it is for this time of day

Slide 11

Slide 11 text

A better view of anomalies

Slide 12

Slide 12 text

Let’s learn some Stats!

Slide 13

Slide 13 text

Distributions (the shape of the data) Site with very little data (this is my blog… please visit)

Slide 14

Slide 14 text

Distributions (the shape of the data) Most sites with decent traffic

Slide 15

Slide 15 text

Distributions (the shape of the data) Site with SPA & MPA navigation

Slide 16

Slide 16 text

Distributions (the shape of the data) This site uses Early Hints to make things really fast

Slide 17

Slide 17 text

Distributions (the shape of the data) Several distributions showing up together

Slide 18

Slide 18 text

Define Average! Pop Q uiz!

Slide 19

Slide 19 text

Central Tendency ● Arithmetic mean, Median, Mode, Geometric mean, Trimmed mean, and more ● Which one you use depends on the distribution https://en.wikipedia.org/wiki/Central_tendency

Slide 20

Slide 20 text

Central Tendency ● Use Arithmetic Mean for Normal distributions ● Use Geometric Mean for Log-Normal distributions ● Use the median or trimmed mean or ﬁltered median for other distributions https://en.wikipedia.org/wiki/Central_tendency

Slide 21

Slide 21 text

About the Trimmed Mean/Median… ● Take the entire data set ● Remove the top (and maybe also bottom) 1% of data points ● Find the mean or median of what’s left We can also use IQR for something similar

Slide 22

Slide 22 text

Spread or dispersion ● How far does the data move away from the center ● Standard Deviation (arithmetic & geometric), Interquartile Range Median Absolute Deviation ● IQR and MAD are robust measures – superior for distributions with outliers https://en.wikipedia.org/wiki/Statistical_dispersion

Slide 23

Slide 23 text

Spread or dispersion – IQR https://en.wikipedia.org/wiki/Interquartile_range ● The delta between the 25th & 75th percentiles (aka 1st & 3rd quartiles) ● Multiply this range by 1.5 and expand the quartiles out – this is a good ﬁlter for trimmed median ● For most performance distributions, the left band will become negative so ﬁx it at 0.

Slide 24

Slide 24 text

Spread or dispersion – MAD https://en.wikipedia.org/wiki/Median_absolute_deviation ● Take the median of the dataset ● Take the absolute value of the delta between every point and the median ● Take the median of these deltas median( | x i - median(x) | )

Slide 25

Slide 25 text

Anscombe’s Quartet Anscombe's Quartet Frank Anscombe Plot of Anscombe's Quartet by Schutz & Avenue ● 4 data sets with the same summary statistics: ○ 𝜇 x = 9, 𝜇 y = 7.5 ○ s x 2 = 11, s y 2 = 4.125 ○ 𝜌 x,y = 0.816 ○ Linear Regression Line: y=3 ○ ℝ2 = 0.67 ● Anscombe’s Quartet shows us why it’s important to visualize data and not just look at summary stats Fun Tip

Slide 26

Slide 26 text

Temporal Data

Slide 27

Slide 27 text

Time Series - Trends This is 1 year of performance data at 1 day intervals

Slide 28

Slide 28 text

Trends Trends are useful to identify how data grows. We subtract the trend to remove expected growth. ● We can smooth this data to identify a trend

Slide 29

Slide 29 text

Trends Trends are useful to identify how data grows. We subtract the trend to remove expected growth. ● We can smooth this data to identify a trend ● With a Simple Moving Average or Moving Median

Slide 30

Slide 30 text

Trends Trends are useful to identify how data grows. We subtract the trend to remove expected growth. ● We can smooth this data to identify a trend ● With a Simple Moving Average or Moving Median ● Or Loess (LOcally Estimated Scatterplot Smoothing)

Slide 31

Slide 31 text

Trends ● We can smooth this data to identify a trend ● With a Simple Moving Average or Moving Median ● Or Loess (LOcally Estimated Scatterplot Smoothing) ● Or Savitzky-Golay Trends are useful to identify how data grows. We subtract the trend to remove expected growth.

Slide 32

Slide 32 text

Time Series - Seasons This is 1 month of performance data at 1 minute intervals

Slide 33

Slide 33 text

Seasonality ● Repeated cycles are predictable ● Fourier Analysis is a good way to identify repeating cycles ● There may be multiple cycles, eg: daily, weekly, holidays… Seasonality is useful to identify how data repeats. We subtract the cycles to remove expected repetitions.

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Noise ● If we subtract the trend and seasonality, we’re left with actual variation in the data ● This is what we can study for distributions and patterns

Slide 37

Slide 37 text

Noise ● We could look at the same minute from each day (seasonality frequency) ● This gives us a time series of distributions, central tendency, and spread

Slide 38

Slide 38 text

Noise ● We could look at the same minute from each day (seasonality frequency) ● This gives us a time series of distributions, central tendency, and spread ● If we add the time series of spreads to the seasonality, we get acceptable tolerance across a period of time ● The sensitivity of this tolerance can be adjusted with an IQR multiplier (0.7-1.5)

Slide 39

Slide 39 text

We use DSP to find noise!

Slide 40

Slide 40 text

Clustering ● A method to ﬁnd groups of data points that ﬁt closer together than others ● Regardless of the algorithm, you’ll need some kind of distance function ● DBSCAN is one of the popular algorithms. OPTICS is a newer variant. ● The most common distance function is Euclidean Distance, but it only works for numeric data. ● Levenshtein Distance is a function to compare strings based on lexicographic similarity. ● For other kinds of categorical data, consider the Jaccard Distance. https://en.wikipedia.org/wiki/Cluster_analysis

Slide 41

Slide 41 text

Summary ● As with bread, preparing the dough model takes way more time than consuming it. ● There are a lot of steps to creating a model ● Simple statistics are often faster to compute than more complex unsupervised learning techniques

Slide 42

Slide 42 text

Merci ! ⚜