curve, a performance metric for unsupervised anomaly detection Rare Events, Extremes and Machine Learning Workshop May 24th, 2018 Albert Thomas Huawei Technologies - T´ el´ ecom ParisTech - Airbus Group Innovations Joint work with Stephan Cl´ emen¸ con, Alexandre Gramfort, Vincent Feuillard and Anne Sabourin. 1 / 32
(Polonik, 1997) A density level set is a Minimum Volume set, i.e. a solution of min Ω B pRd qλpΩq such that PpΩq ¥ α 6 4 2 0 2 4 6 6 4 2 0 2 4 6 5 / 32
density level set 1. A density level set is always a minimum volume set 2. If f has no flat parts, a minimum volume set is a density level set. (Einmahl and Mason, 1992), (Polonik, 1997), (Nunez-Garcia et al., 2003) 6 / 32
algorithms Common approach for most algorithms 1 Learn a scoring function ˆ s : x Rd ÞÑ R such that the smaller ˆ spxq the more abnormal is x. 2 Threshold ˆ s at an offset q such that p Ωα tx, ˆ spxq ¥ qu is an estimation of the Minimum Volume set with mass α. Ñ density estimation (Cadre et al., 2013), One-Class SVM/Support Vector Data Description (Sch¨ olkopf et al., 2001), (Vert and Vert, 2006), (Tax et al., 2004), k-NN (Sricharan & Hero, 2011) Ñ Isolation Forest (Liu et al., 2008) and Local Outlier Factor (Breunig et al., 2000) 7 / 32
Ideal scoring functions s preserve the order induced by density f spx1 q spx2 q ðñ f px1 q f px2 q i.e. strictly increasing transform of f . 0.00 0.25 0.50 0.75 1.00 0 1 2 3 4 5 density f f(x − 0.05) f(x) + 2 s does not need to be close to f in the sense of a Lp norm 8 / 32
the One-Class SVM 0.0 0.2 Gaussian mixture density f 0 5 10 0 20 One-Class SVM scoring function s Asymptotically constant near the modes and proportional to the density in the low density regions (Vert and Vert, 2006) 9 / 32
data set Sn pX1, . . . , Xn q, a hyperparameter space Θ and an unsupervised anomaly detection algorithm A : Sn ¢Θ Ñ RRd pSn, θq ÞÑ ˆ sθ How to assess the performance of ˆ sθ without a labeled data set? (Anomaly Detection Workshop, Thomas, Cl´ emen¸ con, Feuillard, Gramfort, ICML 2016) 11 / 32
X P, scoring function s : Rd ÝÑ R, t-level set of s: tx, spxq ¥ tu αs ptq P pspXq ¥ tq mass of the t-level set λs ptq λptx, spxq ¥ tuq volume of t-the level set. 12 / 32
Mass Volume curve MVs of a scoring function s (Cl´ emen¸ con and Jakubowicz, 2013), (Cl´ emen¸ con and Thomas, 2017) t R ÞÑ pαs ptq, λs ptqq 4 3 2 1 0 1 2 3 4 x 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Score f s Scoring functions 0.0 0.2 0.4 0.6 0.8 1.0 Mass 0 1 2 3 4 5 6 7 8 Volume MVf MVs Mass Volume curves 13 / 32
Easier to work with the following definition: MVs is defined as the plot of the function MVs : α p0, 1q ÞÑ λs pα ¡1 s pαqq λptx, spxq ¥ α ¡1 s pαquq where α ¡1 s generalized inverse of αs. Property (Cl´ emen¸ con and Jakubowicz, 2013), (Cl´ emen¸ con and Thomas, 2017) Assume that the underlying density f has no flat parts, then for all scoring functions s, dα p0, 1q, MV ¦pαq def MVf pαq ¤ MVs pαq The lower is MVs the better is s 14 / 32
ˆ sθ ApSn, θq Choose θ ¦ minimizing area under MVˆ sθ As MVˆ sθ depends on P, use empirical MV curve estimated on a test set y MVs : α r0, 1q ÞÑ λs pp α ¡1 s pαqq where p αs ptq 1 n ° n i 1 1tx,s px q¥t upXi q. 15 / 32
random split we may obtain a different θ X Ñ to reduce the variance of the estimator consider B random splits of the data set for each random split b we get θ X b and ˆ sb θ: b final scoring function p S 1 B B ¸ b 1 ˆ sb θ: b 17 / 32
anomaly detection algorithm A, compare our approach ÝÑ stuned a priori fixed hyperparameters ÝÑ sfixed Performance criterion: relative gain GA pstuned, sfixed q AMVI psfixed q¡AMVI pstuned q AMVI psfixed q where AMVI is area under MV curve over interval I r0.9, 0.99s, computed on left out data. If GA pstuned, sfixed q ¡ 0 then stuned better than sfixed 19 / 32
MVs We used y MVs as an estimate of MVs where dα r0, 1q, y MVs pαq λtx, spxq ¥ p α ¡1 s pαqu with p α ¡1 s the generalized inverse of p αs. 2 questions Consistency of y MVs as n Ñ V? How to build confidence regions? 21 / 32
extreme regions Goal of multivariate Extreme Value Theory: model the tail of the distribution F of a multivariate random variable X pX p1 q , . . . , X pd qq Motivation for unsupervised anomaly detection Anomalies are likely to be located in extreme regions, i.e. regions far from the mean E rXs Lack of data in these regions makes it difficult to distinguish between large normal instances and anomalies Relying on multivariate Extreme Value Theory we suggest an algorithm to detect anomalies in extreme regions. 24 / 32
Theory Common approach Standardization to unit Pareto: TpXq V Rd where V pj q 1 1 ¡Fj pX pj qq, dj t1, . . . , du, Fj , 1 ¤ j ¤ d, being the margins. Note that X and V share the same dependence structure/copula. 25 / 32
Theory prpvq, ϕpvqq p}v}V, v{}v}Vq polar coordinates Sd ¡1 positive orthant of the unit hypercube Theorem (Resnick, 1987) With mild assumptions on the distribution F, there exists a finite (angular) measure Φ on Sd ¡1 such that for all Ω Sd ¡1, Φt pΩq def t ¤P prpVq ¡ t, ϕpVq Ωq ÝÑ t ÑV ΦpΩq 26 / 32
in extreme regions: observations that deviate from the dependence structure of the tail Dependence Independence To find the most likely directions of the extreme observations we estimate a Minimum Volume set of the angular measure Φ 27 / 32
estimation on the sphere Solve empirical optimization problem (Scott and Nowak, 2006) min Ω G λd pΩq subject to p Φn,k pΩq ¥ α ¡ψk pδq where p Φn,k is estimated from t ¤P prpVq ¡ t, ϕpVq Ωq with t n{k: p Φn,k pΩq 1 k n ¸ i 1 1tr pV i q¥n k , ϕ pV i qΩ u n/k n/k 28 / 32
scoring function ˆ sprpvq, ϕpvqq 1{rpvq2 ¤ ˆ sϕ pϕpvqq. Data set OCSVM Isolation Forest Score ˆ s shuttle 0.981 0.963 0.987 SF 0.478 0.251 0.660 http 0.997 0.662 0.964 ann 0.372 0.610 0.518 forestcover 0.540 0.516 0.646 ROC-AUC are computed on test sets made of normal and abnormal instances and restricted to the extreme region (k c n). 30 / 32
curves and anomaly ranking. S. Cl´ emen¸ con, A. Thomas. Submitted. 2017 Learning hyperparameters for unsupervised anomaly detection. A. Thomas, S. Cl´ emen¸ con, V. Feuillard, A. Gramfort. Anomaly detection Workshop, ICML 2016. Anomaly detection in extreme regions via empirical MV sets on the sphere. A. Thomas, S. Cl´ emen¸ con, A. Gramfort, A. Sabourin. AISTATS 2017. Code for hyperparameter tuning available on github https://github.com/albertcthomas/anomaly_tuning 31 / 32