Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Statistics of Anomaly Detection

The Statistics of Anomaly Detection

If you collect any kind of real-time data, you're probably interested getting alerted when this data goes out of whack, but setting static thresholds has the downside of having a lot of false positives. Using Anomaly Detection can help determine normal patterns and alert you when those go out of whack.
In this talk we'll look at some of the methods of detecting anomalies in your data.

https://confoo.ca/en/2025/session/the-statistics-of-anomaly-detection

Leave feedback at https://confoo.ca/en/2025/feedback/DB5044AB426889A6A2CA1510FBB56617

Philip Tellis

February 28, 2025
Tweet

More Decks by Philip Tellis

Other Decks in Technology

Transcript

  1. • Bread was rationed: 1 Kg loaf per person •

    Bread is handmade, so isn’t exactly 1 Kg • Normal distribution centered below 1 Kg • Showed that the baker was skimming. • Baker adjusted process so loaves measured over 1 Kg • Distribution showed that the only change was in who received the heavier loaves Story Tim e
  2. Background on mPulse • We collect web performance data from

    end user browsers using the boomerang JavaScript library. • This is beaconed back to our performance analytics application – mPulse. • The data is cleaned, filtered, sorted, and streamed to various backend tasks for storage, analysis, visualization, and alerting. • We also have a task to do advanced data analysis and visualization.
  3. Our Data • Mostly time series (though we can ignore

    time for interesting views). • Web performance data related to real user experiences. • Data about multiple events during the page load process; analyze them independently or in relation to each other. • Real Users implies an uncontrolled environment, and we can’t even be sure that all those users are human with non-malicious intent.
  4. And of course time series This isn’t really outside expected

    bounds… This isn’t really outside expected bounds… But it is for this time of day
  5. Distributions (the shape of the data) Site with very little

    data (this is my blog… please visit)
  6. Central Tendency • Arithmetic mean, Median, Mode, Geometric mean, Trimmed

    mean, and more • Which one you use depends on the distribution https://en.wikipedia.org/wiki/Central_tendency
  7. Central Tendency • Use Arithmetic Mean for Normal distributions •

    Use Geometric Mean for Log-Normal distributions • Use the median or trimmed mean or filtered median for other distributions https://en.wikipedia.org/wiki/Central_tendency
  8. About the Trimmed Mean/Median… • Take the entire data set

    • Remove the top (and maybe also bottom) 1% of data points • Find the mean or median of what’s left We can also use IQR for something similar
  9. Spread or dispersion • How far does the data move

    away from the center • Standard Deviation (arithmetic & geometric), Interquartile Range Median Absolute Deviation • IQR and MAD are robust measures – superior for distributions with outliers https://en.wikipedia.org/wiki/Statistical_dispersion
  10. Spread or dispersion – IQR https://en.wikipedia.org/wiki/Interquartile_range • The delta between

    the 25th & 75th percentiles (aka 1st & 3rd quartiles) • Multiply this range by 1.5 and expand the quartiles out – this is a good filter for trimmed median • For most performance distributions, the left band will become negative so fix it at 0.
  11. Spread or dispersion – MAD https://en.wikipedia.org/wiki/Median_absolute_deviation • Take the median

    of the dataset • Take the absolute value of the delta between every point and the median • Take the median of these deltas median( | x i - median(x) | )
  12. Anscombe’s Quartet Anscombe's Quartet Frank Anscombe Plot of Anscombe's Quartet

    by Schutz & Avenue • 4 data sets with the same summary statistics: ◦ 𝜇 x = 9, 𝜇 y = 7.5 ◦ s x 2 = 11, s y 2 = 4.125 ◦ 𝜌 x,y = 0.816 ◦ Linear Regression Line: y=3 ◦ ℝ2 = 0.67 • Anscombe’s Quartet shows us why it’s important to visualize data and not just look at summary stats Fun Tip
  13. Trends Trends are useful to identify how data grows. We

    subtract the trend to remove expected growth. • We can smooth this data to identify a trend
  14. Trends Trends are useful to identify how data grows. We

    subtract the trend to remove expected growth. • We can smooth this data to identify a trend • With a Simple Moving Average or Moving Median
  15. Trends Trends are useful to identify how data grows. We

    subtract the trend to remove expected growth. • We can smooth this data to identify a trend • With a Simple Moving Average or Moving Median • Or Loess (LOcally Estimated Scatterplot Smoothing)
  16. Trends • We can smooth this data to identify a

    trend • With a Simple Moving Average or Moving Median • Or Loess (LOcally Estimated Scatterplot Smoothing) • Or Savitzky-Golay Trends are useful to identify how data grows. We subtract the trend to remove expected growth.
  17. Seasonality • Repeated cycles are predictable • Fourier Analysis is

    a good way to identify repeating cycles • There may be multiple cycles, eg: daily, weekly, holidays… Seasonality is useful to identify how data repeats. We subtract the cycles to remove expected repetitions.
  18. Seasonality • Repeated cycles are predictable • Fourier Analysis is

    a good way to identify repeating cycles • There may be multiple cycles, eg: daily, weekly, holidays… Seasonality is useful to identify how data repeats. We subtract the cycles to remove expected repetitions.
  19. Seasonality • Repeated cycles are predictable • Fourier Analysis is

    a good way to identify repeating cycles • There may be multiple cycles, eg: daily, weekly, holidays… Seasonality is useful to identify how data repeats. We subtract the cycles to remove expected repetitions.
  20. Noise • If we subtract the trend and seasonality, we’re

    left with actual variation in the data • This is what we can study for distributions and patterns
  21. Noise • We could look at the same minute from

    each day (seasonality frequency) • This gives us a time series of distributions, central tendency, and spread
  22. Noise • We could look at the same minute from

    each day (seasonality frequency) • This gives us a time series of distributions, central tendency, and spread • If we add the time series of spreads to the seasonality, we get acceptable tolerance across a period of time • The sensitivity of this tolerance can be adjusted with an IQR multiplier (0.7-1.5)
  23. Clustering • A method to find groups of data points

    that fit closer together than others • Regardless of the algorithm, you’ll need some kind of distance function • DBSCAN is one of the popular algorithms. OPTICS is a newer variant. • The most common distance function is Euclidean Distance, but it only works for numeric data. • Levenshtein Distance is a function to compare strings based on lexicographic similarity. • For other kinds of categorical data, consider the Jaccard Distance. https://en.wikipedia.org/wiki/Cluster_analysis
  24. Summary • As with bread, preparing the dough model takes

    way more time than consuming it. • There are a lot of steps to creating a model • Simple statistics are often faster to compute than more complex unsupervised learning techniques