The Statistics of Anomaly Detection

The Statistics of Anomaly Detection ConFoo 2025 Philip Tellis /
Akamai

Philip Tellis Principal RUM Distiller @ Akamai @bluesmoon

C’est mon pain I bake bread

• Bread was rationed: 1 Kg loaf per person •
Bread is handmade, so isn’t exactly 1 Kg • Normal distribution centered below 1 Kg • Showed that the baker was skimming. • Baker adjusted process so loaves measured over 1 Kg • Distribution showed that the only change was in who received the heavier loaves Story Tim e

mPulse: kinda the same but for web performance

Background on mPulse • We collect web performance data from
end user browsers using the boomerang JavaScript library. • This is beaconed back to our performance analytics application – mPulse. • The data is cleaned, ﬁltered, sorted, and streamed to various backend tasks for storage, analysis, visualization, and alerting. • We also have a task to do advanced data analysis and visualization.

Our Data • Mostly time series (though we can ignore
time for interesting views). • Web performance data related to real user experiences. • Data about multiple events during the page load process; analyze them independently or in relation to each other. • Real Users implies an uncontrolled environment, and we can’t even be sure that all those users are human with non-malicious intent.

We can render individual waterfalls

Or combine multiple waterfalls

And of course time series This isn’t really outside expected
bounds… This isn’t really outside expected bounds… But it is for this time of day

A better view of anomalies

Let’s learn some Stats!

Distributions (the shape of the data) Site with very little
data (this is my blog… please visit)

Distributions (the shape of the data) Most sites with decent
traffic

Distributions (the shape of the data) Site with SPA &
MPA navigation

Distributions (the shape of the data) This site uses Early
Hints to make things really fast

Distributions (the shape of the data) Several distributions showing up
together

Define Average! Pop Q uiz!

Central Tendency • Arithmetic mean, Median, Mode, Geometric mean, Trimmed
mean, and more • Which one you use depends on the distribution https://en.wikipedia.org/wiki/Central_tendency

Central Tendency • Use Arithmetic Mean for Normal distributions •
Use Geometric Mean for Log-Normal distributions • Use the median or trimmed mean or ﬁltered median for other distributions https://en.wikipedia.org/wiki/Central_tendency

About the Trimmed Mean/Median… • Take the entire data set
• Remove the top (and maybe also bottom) 1% of data points • Find the mean or median of what’s left We can also use IQR for something similar

Spread or dispersion • How far does the data move
away from the center • Standard Deviation (arithmetic & geometric), Interquartile Range Median Absolute Deviation • IQR and MAD are robust measures – superior for distributions with outliers https://en.wikipedia.org/wiki/Statistical_dispersion

Spread or dispersion – IQR https://en.wikipedia.org/wiki/Interquartile_range • The delta between
the 25th & 75th percentiles (aka 1st & 3rd quartiles) • Multiply this range by 1.5 and expand the quartiles out – this is a good ﬁlter for trimmed median • For most performance distributions, the left band will become negative so ﬁx it at 0.

Spread or dispersion – MAD https://en.wikipedia.org/wiki/Median_absolute_deviation • Take the median
of the dataset • Take the absolute value of the delta between every point and the median • Take the median of these deltas median( | x i - median(x) | )

Anscombe’s Quartet Anscombe's Quartet Frank Anscombe Plot of Anscombe's Quartet
by Schutz & Avenue • 4 data sets with the same summary statistics: ◦ 𝜇 x = 9, 𝜇 y = 7.5 ◦ s x 2 = 11, s y 2 = 4.125 ◦ 𝜌 x,y = 0.816 ◦ Linear Regression Line: y=3 ◦ ℝ2 = 0.67 • Anscombe’s Quartet shows us why it’s important to visualize data and not just look at summary stats Fun Tip

Temporal Data

Time Series - Trends This is 1 year of performance
data at 1 day intervals

Trends Trends are useful to identify how data grows. We
subtract the trend to remove expected growth. • We can smooth this data to identify a trend

subtract the trend to remove expected growth. • We can smooth this data to identify a trend • With a Simple Moving Average or Moving Median

subtract the trend to remove expected growth. • We can smooth this data to identify a trend • With a Simple Moving Average or Moving Median • Or Loess (LOcally Estimated Scatterplot Smoothing)

Trends • We can smooth this data to identify a
trend • With a Simple Moving Average or Moving Median • Or Loess (LOcally Estimated Scatterplot Smoothing) • Or Savitzky-Golay Trends are useful to identify how data grows. We subtract the trend to remove expected growth.

Time Series - Seasons This is 1 month of performance
data at 1 minute intervals

Seasonality • Repeated cycles are predictable • Fourier Analysis is
a good way to identify repeating cycles • There may be multiple cycles, eg: daily, weekly, holidays… Seasonality is useful to identify how data repeats. We subtract the cycles to remove expected repetitions.

Noise • If we subtract the trend and seasonality, we’re
left with actual variation in the data • This is what we can study for distributions and patterns

Noise • We could look at the same minute from
each day (seasonality frequency) • This gives us a time series of distributions, central tendency, and spread

Noise • We could look at the same minute from
each day (seasonality frequency) • This gives us a time series of distributions, central tendency, and spread • If we add the time series of spreads to the seasonality, we get acceptable tolerance across a period of time • The sensitivity of this tolerance can be adjusted with an IQR multiplier (0.7-1.5)

We use DSP to find noise!

Clustering • A method to ﬁnd groups of data points
that ﬁt closer together than others • Regardless of the algorithm, you’ll need some kind of distance function • DBSCAN is one of the popular algorithms. OPTICS is a newer variant. • The most common distance function is Euclidean Distance, but it only works for numeric data. • Levenshtein Distance is a function to compare strings based on lexicographic similarity. • For other kinds of categorical data, consider the Jaccard Distance. https://en.wikipedia.org/wiki/Cluster_analysis

Summary • As with bread, preparing the dough model takes
way more time than consuming it. • There are a lot of steps to creating a model • Simple statistics are often faster to compute than more complex unsupervised learning techniques

Merci ! ⚜

The Statistics of Anomaly Detection

The Statistics of Anomaly Detection

More Decks by Philip Tellis

Other Decks in Technology

Featured

Transcript