Anomaly Detection in IT Systems at Scale

Slide 1

Slide 1 text

ANOMALY DETECTION IN IT SYSTEMS Tom Veasey -‐ Research Director

Slide 2

Slide 2 text

AGENDA •  Why? •  Use case •  Background •  Machine data characterisFcs •  Designing analyFcs for scale •  AnalyFcs case studies

Slide 3

Slide 3 text

ABOUT US •  Prelert has a packaged soluFon for doing “relevant” anomaly detecFon for general Fme series data sets •  Our approach is designed from the outset to scale to handle massive data sets •  Our approach is designed from the outset to operate in a data stream seNng where data characterisFcs change over Fme •  Our soPware is aimed at (staFsFcal) lay people

Slide 4

Slide 4 text

WHY? •  Many pracFcally important tasks are anomaly detecFon problems •  DetecFng/diagnosing IT system faults •  DetecFng ﬁres/deforestaFon in satellite imagery •  DetecFng mechanical problems in cars/planes/ satellites, etc •  DetecFng congesFon on traﬃc networks •  DetecFng malware, network intrusion and extrusion •  DetecFng rogue traders •  Adds value to many data sets

Slide 5

Slide 5 text

USE CASE

Slide 6

Slide 6 text

USE CASE – TRANSPORT CONGESTION IMPACTS DUE TO FAULTS •  Problem: Which faults and incidents are causing biggest congesFon impact? –  For example, is a traﬃc signal failure causing: 1.  Signiﬁcant CongesFon? 2.  Bus lateness? •  SoluFon: Centralize data, analyze with Prelert: –  Determine (in real-‐,me) faults and incidents causing anomalous congesFon and bus lateness –  List most impacFng faults and incidents –  Filter impacFng faults and incidents by type

Slide 7

Slide 7 text

USE CASE – TRANSPORT CONGESTION IMPACTS DUE TO FAULTS •  Average journey Fmes are gathered for a large number of segments (links) of London’s roads •  Generates a large number of Fme series (>2000). Increases in journey .me indicate conges.on or other traﬃc issues on that link

Slide 8

Slide 8 text

USE CASE – TRANSPORT CONGESTION IMPACTS DUE TO FAULTS •  Prelert automaFcally creates staFsFcal models of journey Fmes across all links and idenFﬁes and correlates anomalies in real-‐ Fme – unsupervised, no custom analyFcs •  For example,

Slide 9

Slide 9 text

PUTTING IT ALL TOGETHER •  Correlate congesFon impact and faults (using Fme and locaFon) CongesFon Anomalies Bus Lateness Anomalies Incidents Ranked List of Congestion Impacting Faults

Slide 10

Slide 10 text

BACKGROUND

Slide 11

Slide 11 text

TAXONOMY OF ANOMALIES •  Point outliers •  Outliers in context

Slide 12

Slide 12 text

TAXONOMY OF ANOMALIES •  Bulk or pafern anomalies ABBA BABA BABA ABBA BAAB BABA •  Rare values AAAAAAAAAAAABAAAAAAAAAA

Slide 13

Slide 13 text

TAXONOMY OF ANOMALIES •  Structural (when objects have more structure than points, i.e. afribute graphs, funcFons, etc)

Slide 14

Slide 14 text

WHAT YOU MEASURE IS IMPORTANT •  Cross anomalies essenFally impossible to detect unsupervised

Slide 15

Slide 15 text

WHAT YOU MEASURE IS IMPORTANT •  Cross anomalies are easily separated by a plane

Slide 16

Slide 16 text

WHAT YOU MEASURE IS IMPORTANT – EXAMPLE •  For the Fme series discussed earlier the correlaFon of increments between adjacent values increased at the Fme of the change •  For example, look at increments’ sliding window variance, i.e. and the change is clear

Slide 17

Slide 17 text

WHAT YOU MEASURE IS IMPORTANT – EXAMPLE

Slide 18

Slide 18 text

WHAT YOU MEASURE IS IMPORTANT – EXAMPLE

Slide 19

Slide 19 text

DISTANCE BASED METHODS •  Is more than a certain fracFon, f, of data points further than some distance, d, from the point (Knox, Ng) f = 1 f = 12/16

Slide 20

Slide 20 text

DENSITY BASED METHODS •  Local outlier factor, compare distance from the point to its neighbours and all neighbours to their neighbours (Breunig et al) Not outlier Outlier

Slide 21

Slide 21 text

PREDICTIVE METHODS •  When a dynamical model is available for the system, i.e. satellite, plane, etc. Look at outliers compared to the predicFons (residuals). Kalman ﬁltering, manoeuvre detecFon, etc Outlier

Slide 22

Slide 22 text

STATISTICAL •  Robust PCA Low rank Sparse Small and rank full

Slide 23

Slide 23 text

COMMONALITY •  What are these trying to do? •  Es,mate some probability density func,on and then look for points in the tails of that distribu,on •  Distance amounts to this when clusters are uniform density, and this is the case where it works •  Local outlier factor tries to account for cluster density impact on distribuFon •  PredicFve model amounts to path dependent probability distribuFon •  Robust PCA is esFmaFng the probability density funcFon when it lies in a subspace and doing it robustly

Slide 24

Slide 24 text

DENSITY BASED METHODS – REVISITED •  Contour lines would be level sets of a probability density funcFon that would be a reasonable ﬁt to these points

Slide 25

Slide 25 text

DENSITY BASED METHODS – REVISITED •  Part of my talk at SIAM* workshop concerned how to measure the anomalousness of an outcome in a principled way given a probability distribuFon * Society for Industrial and Applied MathemaFcs

Slide 26

Slide 26 text

IS THIS THE ANSWER?

Slide 27

Slide 27 text

CONCENTRATION OF MEASURE •  A brief diversion: survival theory •  Assume uniform IID points in an interval •  Probability that the nearest neighbour is at least •  Assumes r is small compared to total volume

Slide 28

Slide 28 text

CONCENTRATION OF MEASURE •  What about in two dimensions? •  In n-‐dimensions volume of shell is •  Points become increasingly uniform •  People have suggested alternaFve metrics which don’t suﬀer this problem to the same extent (see for example Aggarwal et al)

Slide 29

Slide 29 text

CONCENTRATION OF MEASURE

Slide 30

Slide 30 text

CURSE OF DIMENSIONALITY •  Previous picture a bit misleading •  I wasn’t clear about the constant of proporFonality. Assumes number of points is varying •  Suppose you have 10 points to esFmate density in 1 dimension (using Kernel Density EsFmator) •  Need 100 to get the same resoluFon in 2 dimensions •  In general, you need for the same resoluFon •  So in 10 dimensions you need 10,000,000,000 points •  InteresFngly, the same problem also aﬄicts models with more parameters

Slide 31

Slide 31 text

CURSE OF DIMENSIONALITY – SIMPLE EXPERIMENT •  Consider the case of esFmaFng the probability of seeing more extreme sample from spherically symmetric mulFvariate normal •  EsFmate sample mean and full covariance matrix, i.e. D + D (D + 1) / 2 parameters •  Compute probability as 1-‐F (r ), where F (r ) is the generalized CDF (see here) •  Fix a “3 sigma point”, red crosses in ﬁgures. (Note that r of point with constant probability varies with dimension) •  EsFmate using 100 samples of spherically symmetric mulFvariate normal and repeat 500 Fmes

Slide 32

Slide 32 text

CURSE OF DIMENSIONALITY – SIMPLE EXPERIMENT •  For higher dimensions we get really large variaFon in the magnitude of the probability up to 106 across our 500 trials •  Can’t just esFmate a model and then use it to compute probability of more extreme sample without understanding variaFon due to possible errors in esFmaFng model parameters •  Consider bootstrap or Bayesian approach to understand uncertainty in assessment of how anomalous an event is

Slide 33

Slide 33 text

CURSE OF DIMENSIONALITY – SIMPLE EXPERIMENT

Slide 34

Slide 34 text

CURSE OF DIMENSIONALITY – SIMPLE EXPERIMENT

Slide 35

Slide 35 text

CURSE OF DIMENSIONALITY – SIMPLE EXPERIMENT

Slide 36

Slide 36 text

IMPLICATIONS •  In high dimensions –  Data are very uniform –  Data are very sparse •  Advice: full mul,variate analysis is not eﬀec,ve for anomaly detec,on in high dimensions •  Can be possible when the data are near some low dimensional manifold, or if there are constraints on the joint distribuFon, for example some independence •  You need a principled way to aggregate measures of anomalousness on the distribuFon factors

Slide 37

Slide 37 text

WARNING •  Insight is not always equivalent to acFonability •  Telling a system administrator that his/her system “is opera.ng in the tails of its distribu.on about the low dimensional manifold that describes its typical behaviour on a Monday morning” is likely to be met with a blank expression •  AcFonability is a funcFon of user experFse •  Good visualizaFons always help •  Note this is diﬀerent from correlaFon between anomalies, which is always of interest, easy to understand, and useful for root cause analysis

Slide 38

Slide 38 text

HOW DO YOU DO ANOMALY DETECTION IN MACHINE DATA?

Slide 39

Slide 39 text

MACHINE DATA CHARACTERISTICS •  Periodic (daily and weekly) •  Sporadic features (bank holidays) •  Domain restricted (percentages) •  Heterogeneous (error messages, codes, counters, performance metrics, IP addresses …) •  CorrelaFons (load balancing) •  Random (non-‐Gaussian residuals, more later) •  CharacterisFcs constantly evolving (machines brought online, new users arrive, exisFng users leave) •  Data stream •  Variable data rates

Slide 40

Slide 40 text

MACHINE DATA IS BIG DATA •  >1,000,000 performance metrics for a big system •  Thousands of events per second •  People are indexing mulFple terabytes of data a day and NetFlow data, packet header informaFon, is much bigger

Slide 41

Slide 41 text

DESIGNING ANALYTICS FOR SCALE •  Map reducible features allow one to compress the data near where its stored and only pass the compressed data to the anomaly detecFon processes, provided it is combinable •  At Prelert we use aggregate value staFsFcs like count, minimum, maximum and mean metric values in short(ish) Fme intervals •  More complex features, such as least squares regression coeﬃcients, can also be computed in this way

Slide 42

Slide 42 text

DESIGNING ANALYTICS FOR SCALE •  Allows us to scale and means the runFme of the analyFcs are predictable at very diﬀerent data rates •  One might think that this is a disadvantage in terms of accuracy. However, gives diﬀerent insight into the data (for simple bulk and structure anomaly detecFon) •  In the following Fme series, the process mean changes, but only by small amount compared to the addiFve noise •  However, the change is clear when looking at the Fme window means

Slide 43

Slide 43 text

DESIGNING ANALYTICS FOR SCALE

Slide 44

Slide 44 text

DESIGNING ANALYTICS FOR SCALE

Slide 45

Slide 45 text

DESIGNING ANALYTICS FOR SCALE

Slide 46

Slide 46 text

DESIGNING ANALYTICS FOR SCALE

Slide 47

Slide 47 text

DESIGNING ANALYTICS FOR SCALE •  Online algorithms typically get to see each piece of data exactly once •  Suitable for data streaming environment, parFcularly where data characterisFcs are evolving •  Also allow one to scale to arbitrarily large data sets, provided one can keep up with the data rate •  OpFmizaFon also possible when the objecFve is amenable, i.e. SGD family of techniques •  Can make things challenging, such as clustering •  Advice: it’s worth it

Slide 48

Slide 48 text

ANALYTICS CASE STUDY – PERIODICITY •  Periodicity is a good predictor for many real world data sets •  Advice: no danger of over ﬁRng there’s plenty of data so use fully non-‐parametric approach •  InteresFngly, if and is mean zero •  Is solved by

Slide 49

Slide 49 text

ANALYTICS CASE STUDY – PERIODICITY •  Bucket data, compute means and then adapt standard spline formulaFon to handle mean value constraint rather than knot value constraints (very eﬃcient eliminaFons sFll exist) •  Splines are parFcularly well suited to cases when there are near step changes or periodic spikes •  Means can be computed online storing only the current mean and count •  Advice: always think about your memory QoR curve, i.e. adapt your bucke,ng so that it has its highest resolu,on where func,on is changing fastest

Slide 50

Slide 50 text

ANALYTICS CASE STUDY – CLUSTERING •  Clustering data streams is hard because irreversible decisions can have bad consequences •  Advice: avoid making them if at all possible •  Sketch data structures help here, see for example BIRCH •  If you are using mixture models, say a mixture of Gaussians, think about the memory QoR curve again, i.e. using spherically symmetric verses full covariance matrices means that you can use more components for the same total memory

Slide 51

Slide 51 text

ANALYTICS CASE STUDY – CLUSTERING

Slide 52

Slide 52 text

ANALYTICS CASE STUDY – CLUSTERING

Slide 53

Slide 53 text

ANALYTICS CASE STUDY – CLUSTERING

Slide 54

Slide 54 text

ANALYTICS CASE STUDY: TAIL BEHAVIOUR •  The Normal DistribuFon doesn’t visually appear to be too bad. However, for anomaly detecFon the key part of model is the tail of the distribuFon: •  The error in calculaFng the probability of values becomes more and more signiﬁcant with the Normal DistribuFon •  Given there are 3263 samples and 4 samples are > 12 Prelert’s probability (10e-‐3) is a more reasonable esFmate than the Normal distribuFon (10e-‐6) Value Reference Probability* Probability Using Normal Distribution Probability Using Prelert 1 1 0.319498 0.560764 2 0.71835734 0.620394 0.834183 3 0.493717438 0.996086 0.49202 4 0.329757892 0.613486 0.27675 5 0.200122587 0.314752 0.157113 6 0.121973644 0.132196 0.0906333 7 0.065277352 0.0448887 0.0532552 8 0.033404842 0.0122142 0.0318895 9 0.033404842 0.00264629 0.0194524 10 0.015629789 0.000454401 0.012078 11 0.006435795 6.16E-05 0.00762581 12 0.002758198 6.58E-06 0.00489119 13 0.001225866 5.53E-07 0.00318382 14 0.000919399 3.65E-08 0.00210125 15 0.000612933 1.89E-09 0.00140482 16 ? 7.64E-11 0.000950655 17 ? 2.42E-12 0.000650663 18 ? 6.01E-14 0.000450111 19 ? 1.17E-15 0.000314514 20 ? 1.77E-17 0.000221852 * Tuned kernel estimate 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Error Magnitude (log Scale) Normal DistribuFon Anomalies (<1% probability)