The Mass Volume curve, a performance metric for unsupervised anomaly detection

Slide 1

Slide 1 text

Anomaly detection Mass Volume curve Extreme regions The Mass Volume curve, a performance metric for unsupervised anomaly detection Rare Events, Extremes and Machine Learning Workshop May 24th, 2018 Albert Thomas Huawei Technologies - T´ el´ ecom ParisTech - Airbus Group Innovations Joint work with Stephan Cl´ emen¸ con, Alexandre Gramfort, Vincent Feuillard and Anne Sabourin. 1 / 32

Slide 2

Slide 2 text

Anomaly detection Mass Volume curve Extreme regions Outline 1 Unsupervised anomaly detection 2 The Mass Volume curve 3 Anomaly detection in extreme regions 2 / 32

Slide 3

Slide 3 text

Anomaly detection Mass Volume curve Extreme regions Unsupervised anomaly detection 1. Find anomalies in an unlabeled data set X1, . . . , Xn 2. Anomalies are assumed to be rare events 3 / 32

Slide 4

Slide 4 text

Anomaly detection Mass Volume curve Extreme regions Unsupervised anomaly detection X1, . . . , Xn Rd unlabeled data set iid P density f w.r.t. Lebesgue measure λ anomalies rare events Estimate a density level set 4 / 32

Slide 5

Slide 5 text

Anomaly detection Mass Volume curve Extreme regions Minimum Volume set (Polonik, 1997) A density level set is a Minimum Volume set, i.e. a solution of min Ω B pRd qλpΩq such that PpΩq ¥ α 6 4 2 0 2 4 6 6 4 2 0 2 4 6 5 / 32

Slide 6

Slide 6 text

Anomaly detection Mass Volume curve Extreme regions Minimum volume set density level set 1. A density level set is always a minimum volume set 2. If f has no ﬂat parts, a minimum volume set is a density level set. (Einmahl and Mason, 1992), (Polonik, 1997), (Nunez-Garcia et al., 2003) 6 / 32

Slide 7

Slide 7 text

Anomaly detection Mass Volume curve Extreme regions Unsupervised anomaly detection algorithms Common approach for most algorithms 1 Learn a scoring function ˆ s : x Rd ÞÑ R such that the smaller ˆ spxq the more abnormal is x. 2 Threshold ˆ s at an oﬀset q such that p Ωα tx, ˆ spxq ¥ qu is an estimation of the Minimum Volume set with mass α. Ñ density estimation (Cadre et al., 2013), One-Class SVM/Support Vector Data Description (Sch¨ olkopf et al., 2001), (Vert and Vert, 2006), (Tax et al., 2004), k-NN (Sricharan & Hero, 2011) Ñ Isolation Forest (Liu et al., 2008) and Local Outlier Factor (Breunig et al., 2000) 7 / 32

Slide 8

Slide 8 text

Anomaly detection Mass Volume curve Extreme regions Ideal scoring functions Ideal scoring functions s preserve the order induced by density f spx1 q spx2 q ðñ f px1 q f px2 q i.e. strictly increasing transform of f . 0.00 0.25 0.50 0.75 1.00 0 1 2 3 4 5 density f f(x − 0.05) f(x) + 2 s does not need to be close to f in the sense of a Lp norm 8 / 32

Slide 9

Slide 9 text

Anomaly detection Mass Volume curve Extreme regions Scoring function of the One-Class SVM 0.0 0.2 Gaussian mixture density f 0 5 10 0 20 One-Class SVM scoring function s Asymptotically constant near the modes and proportional to the density in the low density regions (Vert and Vert, 2006) 9 / 32

Slide 10

Slide 10 text

Anomaly detection Mass Volume curve Extreme regions Problem One-Class SVM with Gaussian kernel kσ The user needs to choose the bandwidth σ of the kernel σ 0.5 10 5 0 5 5 0 5 Estimated True True Overﬁtting σ 10 10 5 0 5 5 0 5 Estimated True True Underﬁtting How to automatically choose σ? 10 / 32

Slide 11

Slide 11 text

Anomaly detection Mass Volume curve Extreme regions Problem Given a data set Sn pX1, . . . , Xn q, a hyperparameter space Θ and an unsupervised anomaly detection algorithm A : Sn ¢Θ Ñ RRd pSn, θq ÞÑ ˆ sθ How to assess the performance of ˆ sθ without a labeled data set? (Anomaly Detection Workshop, Thomas, Cl´ emen¸ con, Feuillard, Gramfort, ICML 2016) 11 / 32

Slide 12

Slide 12 text

Anomaly detection Mass Volume curve Extreme regions Mass Volume curve X P, scoring function s : Rd ÝÑ R, t-level set of s: tx, spxq ¥ tu αs ptq P pspXq ¥ tq mass of the t-level set λs ptq λptx, spxq ¥ tuq volume of t-the level set. 12 / 32

Slide 13

Slide 13 text

Anomaly detection Mass Volume curve Extreme regions Mass Volume curve Mass Volume curve MVs of a scoring function s (Cl´ emen¸ con and Jakubowicz, 2013), (Cl´ emen¸ con and Thomas, 2017) t R ÞÑ pαs ptq, λs ptqq 4 3 2 1 0 1 2 3 4 x 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Score f s Scoring functions 0.0 0.2 0.4 0.6 0.8 1.0 Mass 0 1 2 3 4 5 6 7 8 Volume MVf MVs Mass Volume curves 13 / 32

Slide 14

Slide 14 text

Anomaly detection Mass Volume curve Extreme regions Mass Volume curve Easier to work with the following definition: MVs is defined as the plot of the function MVs : α p0, 1q ÞÑ λs pα ¡1 s pαqq λptx, spxq ¥ α ¡1 s pαquq where α ¡1 s generalized inverse of αs. Property (Cl´ emen¸ con and Jakubowicz, 2013), (Cl´ emen¸ con and Thomas, 2017) Assume that the underlying density f has no flat parts, then for all scoring functions s, dα p0, 1q, MV ¦pαq def MVf pαq ¤ MVs pαq The lower is MVs the better is s 14 / 32

Slide 15

Slide 15 text

Anomaly detection Mass Volume curve Extreme regions Learning hyperparameters Consider ˆ sθ ApSn, θq Choose θ ¦ minimizing area under MVˆ sθ As MVˆ sθ depends on P, use empirical MV curve estimated on a test set y MVs : α r0, 1q ÞÑ λs pp α ¡1 s pαqq where p αs ptq 1 n ° n i 1 1tx,s px q¥t upXi q. 15 / 32

Slide 16

Slide 16 text

Anomaly detection Mass Volume curve Extreme regions Algorithm Choose θ X minimizing area under Mass Volume curve AMVˆ sθ . 16 / 32

Slide 17

Slide 17 text

Anomaly detection Mass Volume curve Extreme regions Aggregation For each random split we may obtain a diﬀerent θ X Ñ to reduce the variance of the estimator consider B random splits of the data set for each random split b we get θ X b and ˆ sb θ: b ﬁnal scoring function p S 1 B B ¸ b 1 ˆ sb θ: b 17 / 32

Slide 18

Slide 18 text

Anomaly detection Mass Volume curve Extreme regions Toy example Minimum Volume set estimation with One-Class SVM and B 50 (α 0.95) 10 5 0 5 5 0 5 Estimated Estimated True True 18 / 32

Slide 19

Slide 19 text

Anomaly detection Mass Volume curve Extreme regions Experiments Given an anomaly detection algorithm A, compare our approach ÝÑ stuned a priori fixed hyperparameters ÝÑ sfixed Performance criterion: relative gain GA pstuned, sfixed q AMVI psfixed q¡AMVI pstuned q AMVI psfixed q where AMVI is area under MV curve over interval I r0.9, 0.99s, computed on left out data. If GA pstuned, sfixed q ¡ 0 then stuned better than sfixed 19 / 32

Slide 20

Slide 20 text

Anomaly detection Mass Volume curve Extreme regions Results For stuned we consider 50 random splits (80/20) GM, d=2 GM, d=4 Banana HTTP 20 0 20 40 60 80 100 Relative gain (%) aKLPE KLPE OCSVM KS iForest n 100 GM, d=2 GM, d=4 Banana HTTP 20 0 20 40 60 80 100 Relative gain (%) aKLPE KLPE OCSVM KS iForest n 500 KLPE (Sricharan & Hero, 2011) — aKLPE: Average KLPE (Qian & Saligrama, 2012) — OCSVM: One-Class SVM (Sch¨ olkopf et al., 2001) — iForest: Isolation Forest (Liu et al., 2008) — KS: Kernel Smoothing 20 / 32

Slide 21

Slide 21 text

Anomaly detection Mass Volume curve Extreme regions Consistency of y MVs We used y MVs as an estimate of MVs where dα r0, 1q, y MVs pαq λtx, spxq ¥ p α ¡1 s pαqu with p α ¡1 s the generalized inverse of p αs. 2 questions Consistency of y MVs as n Ñ V? How to build conﬁdence regions? 21 / 32

Slide 22

Slide 22 text

Anomaly detection Mass Volume curve Extreme regions Consistency Let s be a scoring function and ε p0, 1s . Assumptions on the random variable spXq and its distribution Fs λs is C2 Theorem (Cl´ emen¸ con and Thomas, 2017) (i) Consistency. With probability one, sup α r0,1 ¡ε s |y MVs pαq¡MVs pαq| ÝÑ n Ñ V 0 (ii) Functional CLT. There exists a sequence of Brownian bridges tBn pαqu α r0,1 s such that, almost-surely, uniformly over r0, 1 ¡εs, as n Ñ V, c n ¡ y MVs pαq¡MVs pαq © λ I s pα ¡1 s pαqq fs pα ¡1 s pαqq Bn pαq Opn ¡1 {2 log nq 22 / 32

Slide 23

Slide 23 text

Anomaly detection Mass Volume curve Extreme regions Conﬁdence bands using smoothed bootstrap 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Mass 0 10 20 30 40 Volume MVf 90% confidence band Up to log n factors (Cl´ emen¸ con and Thomas, 2017), Functional CLT: rate in Opn ¡1 {2q but requires knowledge of fs Naive (non-smoothed) bootstrap: rate in Opn ¡1 {4q Smoothed bootstrap: rate in Opn ¡2 {5q 23 / 32

Slide 24

Slide 24 text

Anomaly detection Mass Volume curve Extreme regions Anomaly detection in extreme regions Goal of multivariate Extreme Value Theory: model the tail of the distribution F of a multivariate random variable X pX p1 q , . . . , X pd qq Motivation for unsupervised anomaly detection Anomalies are likely to be located in extreme regions, i.e. regions far from the mean E rXs Lack of data in these regions makes it diﬃcult to distinguish between large normal instances and anomalies Relying on multivariate Extreme Value Theory we suggest an algorithm to detect anomalies in extreme regions. 24 / 32

Slide 25

Slide 25 text

Anomaly detection Mass Volume curve Extreme regions Multivariate Extreme Value Theory Common approach Standardization to unit Pareto: TpXq V Rd where V pj q 1 1 ¡Fj pX pj qq, dj t1, . . . , du, Fj , 1 ¤ j ¤ d, being the margins. Note that X and V share the same dependence structure/copula. 25 / 32

Slide 26

Slide 26 text

Anomaly detection Mass Volume curve Extreme regions Multivariate Extreme Value Theory prpvq, ϕpvqq p}v}V, v{}v}Vq polar coordinates Sd ¡1 positive orthant of the unit hypercube Theorem (Resnick, 1987) With mild assumptions on the distribution F, there exists a ﬁnite (angular) measure Φ on Sd ¡1 such that for all Ω Sd ¡1, Φt pΩq def t ¤P prpVq ¡ t, ϕpVq Ωq ÝÑ t ÑV ΦpΩq 26 / 32

Slide 27

Slide 27 text

Anomaly detection Mass Volume curve Extreme regions Main idea Anomalies in extreme regions: observations that deviate from the dependence structure of the tail Dependence Independence To ﬁnd the most likely directions of the extreme observations we estimate a Minimum Volume set of the angular measure Φ 27 / 32

Slide 28

Slide 28 text

Anomaly detection Mass Volume curve Extreme regions Minimum volume set estimation on the sphere Solve empirical optimization problem (Scott and Nowak, 2006) min Ω G λd pΩq subject to p Φn,k pΩq ¥ α ¡ψk pδq where p Φn,k is estimated from t ¤P prpVq ¡ t, ϕpVq Ωq with t n{k: p Φn,k pΩq 1 k n ¸ i 1 1tr pV i q¥n k , ϕ pV i qΩ u n/k n/k 28 / 32

Slide 29

Slide 29 text

Anomaly detection Mass Volume curve Extreme regions Theorem Ω ¦ α being the true Minimum Volume set, RpΩq pλd pΩq¡λd pΩ ¦ α qq α ¡Φn {k pΩq¨ Theorem (Thomas, Cl´ emen¸ con, Gramfort, Sabourin, 2017) Assume that the margins Fj are known the class G is of ﬁnite VC dimension common assumptions for existence and uniqueness of MV sets are fulﬁlled by Φn {k Then there exists a constant C ¡ 0 such that E rRpp Ωα qs ¤ ¡ inf Ω Gα λd pΩq¡λd pΩ ¦ α q© C log k k where Gα tΩ G, Φn {k pΩq ¥ αu. 29 / 32

Slide 30

Slide 30 text

Anomaly detection Mass Volume curve Extreme regions Numerical experiments Global scoring function ˆ sprpvq, ϕpvqq 1{rpvq2 ¤ ˆ sϕ pϕpvqq. Data set OCSVM Isolation Forest Score ˆ s shuttle 0.981 0.963 0.987 SF 0.478 0.251 0.660 http 0.997 0.662 0.964 ann 0.372 0.610 0.518 forestcover 0.540 0.516 0.646 ROC-AUC are computed on test sets made of normal and abnormal instances and restricted to the extreme region (k c n). 30 / 32

Slide 31

Slide 31 text

Anomaly detection Mass Volume curve Extreme regions References Mass Volume curves and anomaly ranking. S. Cl´ emen¸ con, A. Thomas. Submitted. 2017 Learning hyperparameters for unsupervised anomaly detection. A. Thomas, S. Cl´ emen¸ con, V. Feuillard, A. Gramfort. Anomaly detection Workshop, ICML 2016. Anomaly detection in extreme regions via empirical MV sets on the sphere. A. Thomas, S. Cl´ emen¸ con, A. Gramfort, A. Sabourin. AISTATS 2017. Code for hyperparameter tuning available on github https://github.com/albertcthomas/anomaly_tuning 31 / 32

Slide 32

Slide 32 text

Anomaly detection Mass Volume curve Extreme regions Thank you 32 / 32