Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Mass Volume curve, a performance metric for unsupervised anomaly detection

Albert
May 24, 2018

The Mass Volume curve, a performance metric for unsupervised anomaly detection

Slides of my talk at the workshop on Rare Events, Extremes and Machine Learning - Telecom ParisTech, May 2018

Albert

May 24, 2018
Tweet

More Decks by Albert

Other Decks in Research

Transcript

  1. Anomaly detection Mass Volume curve Extreme regions The Mass Volume

    curve, a performance metric for unsupervised anomaly detection Rare Events, Extremes and Machine Learning Workshop May 24th, 2018 Albert Thomas Huawei Technologies - T´ el´ ecom ParisTech - Airbus Group Innovations Joint work with Stephan Cl´ emen¸ con, Alexandre Gramfort, Vincent Feuillard and Anne Sabourin. 1 / 32
  2. Anomaly detection Mass Volume curve Extreme regions Outline 1 Unsupervised

    anomaly detection 2 The Mass Volume curve 3 Anomaly detection in extreme regions 2 / 32
  3. Anomaly detection Mass Volume curve Extreme regions Unsupervised anomaly detection

    1. Find anomalies in an unlabeled data set X1, . . . , Xn 2. Anomalies are assumed to be rare events 3 / 32
  4. Anomaly detection Mass Volume curve Extreme regions Unsupervised anomaly detection

    X1, . . . , Xn € Rd unlabeled data set iid  P density f w.r.t. Lebesgue measure λ anomalies  rare events Estimate a density level set 4 / 32
  5. Anomaly detection Mass Volume curve Extreme regions Minimum Volume set

    (Polonik, 1997) A density level set is a Minimum Volume set, i.e. a solution of min Ω €B pRd qλpΩq such that PpΩq ¥ α 6 4 2 0 2 4 6 6 4 2 0 2 4 6 5 / 32
  6. Anomaly detection Mass Volume curve Extreme regions Minimum volume set

     density level set 1. A density level set is always a minimum volume set 2. If f has no flat parts, a minimum volume set is a density level set. (Einmahl and Mason, 1992), (Polonik, 1997), (Nunez-Garcia et al., 2003) 6 / 32
  7. Anomaly detection Mass Volume curve Extreme regions Unsupervised anomaly detection

    algorithms Common approach for most algorithms 1 Learn a scoring function ˆ s : x € Rd ÞÑ R such that the smaller ˆ spxq the more abnormal is x. 2 Threshold ˆ s at an offset q such that p Ωα  tx, ˆ spxq ¥ qu is an estimation of the Minimum Volume set with mass α. Ñ density estimation (Cadre et al., 2013), One-Class SVM/Support Vector Data Description (Sch¨ olkopf et al., 2001), (Vert and Vert, 2006), (Tax et al., 2004), k-NN (Sricharan & Hero, 2011) Ñ Isolation Forest (Liu et al., 2008) and Local Outlier Factor (Breunig et al., 2000) 7 / 32
  8. Anomaly detection Mass Volume curve Extreme regions Ideal scoring functions

    Ideal scoring functions s preserve the order induced by density f spx1 q spx2 q ðñ f px1 q f px2 q i.e. strictly increasing transform of f . 0.00 0.25 0.50 0.75 1.00 0 1 2 3 4 5 density f f(x − 0.05) f(x) + 2 s does not need to be close to f in the sense of a Lp norm 8 / 32
  9. Anomaly detection Mass Volume curve Extreme regions Scoring function of

    the One-Class SVM 0.0 0.2 Gaussian mixture density f 0 5 10 0 20 One-Class SVM scoring function s Asymptotically constant near the modes and proportional to the density in the low density regions (Vert and Vert, 2006) 9 / 32
  10. Anomaly detection Mass Volume curve Extreme regions Problem One-Class SVM

    with Gaussian kernel kσ The user needs to choose the bandwidth σ of the kernel σ  0.5 10 5 0 5 5 0 5 Estimated True True Overfitting σ  10 10 5 0 5 5 0 5 Estimated True True Underfitting How to automatically choose σ? 10 / 32
  11. Anomaly detection Mass Volume curve Extreme regions Problem Given a

    data set Sn  pX1, . . . , Xn q, a hyperparameter space Θ and an unsupervised anomaly detection algorithm A : Sn ¢Θ Ñ RRd pSn, θq ÞÑ ˆ sθ How to assess the performance of ˆ sθ without a labeled data set? (Anomaly Detection Workshop, Thomas, Cl´ emen¸ con, Feuillard, Gramfort, ICML 2016) 11 / 32
  12. Anomaly detection Mass Volume curve Extreme regions Mass Volume curve

    X  P, scoring function s : Rd ÝÑ R, t-level set of s: tx, spxq ¥ tu αs ptq  P pspXq ¥ tq mass of the t-level set λs ptq  λptx, spxq ¥ tuq volume of t-the level set. 12 / 32
  13. Anomaly detection Mass Volume curve Extreme regions Mass Volume curve

    Mass Volume curve MVs of a scoring function s (Cl´ emen¸ con and Jakubowicz, 2013), (Cl´ emen¸ con and Thomas, 2017) t € R ÞÑ pαs ptq, λs ptqq 4 3 2 1 0 1 2 3 4 x 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Score f s Scoring functions 0.0 0.2 0.4 0.6 0.8 1.0 Mass 0 1 2 3 4 5 6 7 8 Volume MVf MVs Mass Volume curves 13 / 32
  14. Anomaly detection Mass Volume curve Extreme regions Mass Volume curve

    Easier to work with the following definition: MVs is defined as the plot of the function MVs : α € p0, 1q ÞÑ λs pα ¡1 s pαqq  λptx, spxq ¥ α ¡1 s pαquq where α ¡1 s generalized inverse of αs. Property (Cl´ emen¸ con and Jakubowicz, 2013), (Cl´ emen¸ con and Thomas, 2017) Assume that the underlying density f has no flat parts, then for all scoring functions s, dα € p0, 1q, MV ¦pαq def  MVf pαq ¤ MVs pαq The lower is MVs the better is s 14 / 32
  15. Anomaly detection Mass Volume curve Extreme regions Learning hyperparameters Consider

    ˆ sθ  ApSn, θq Choose θ ¦ minimizing area under MVˆ sθ As MVˆ sθ depends on P, use empirical MV curve estimated on a test set y MVs : α € r0, 1q ÞÑ λs pp α ¡1 s pαqq where p αs ptq  1 n ° n i 1 1tx,s px q¥t upXi q. 15 / 32
  16. Anomaly detection Mass Volume curve Extreme regions Algorithm Choose θ

    X minimizing area under Mass Volume curve AMVˆ sθ . 16 / 32
  17. Anomaly detection Mass Volume curve Extreme regions Aggregation For each

    random split we may obtain a different θ X Ñ to reduce the variance of the estimator consider B random splits of the data set for each random split b we get θ X b and ˆ sb θ: b final scoring function p S  1 B B ¸ b 1 ˆ sb θ: b 17 / 32
  18. Anomaly detection Mass Volume curve Extreme regions Toy example Minimum

    Volume set estimation with One-Class SVM and B  50 (α  0.95) 10 5 0 5 5 0 5 Estimated Estimated True True 18 / 32
  19. Anomaly detection Mass Volume curve Extreme regions Experiments Given an

    anomaly detection algorithm A, compare our approach ÝÑ stuned a priori fixed hyperparameters ÝÑ sfixed Performance criterion: relative gain GA pstuned, sfixed q  AMVI psfixed q¡AMVI pstuned q AMVI psfixed q where AMVI is area under MV curve over interval I  r0.9, 0.99s, computed on left out data. If GA pstuned, sfixed q ¡ 0 then stuned better than sfixed 19 / 32
  20. Anomaly detection Mass Volume curve Extreme regions Results For stuned

    we consider 50 random splits (80/20) GM, d=2 GM, d=4 Banana HTTP 20 0 20 40 60 80 100 Relative gain (%) aKLPE KLPE OCSVM KS iForest n  100 GM, d=2 GM, d=4 Banana HTTP 20 0 20 40 60 80 100 Relative gain (%) aKLPE KLPE OCSVM KS iForest n  500 KLPE (Sricharan & Hero, 2011) — aKLPE: Average KLPE (Qian & Saligrama, 2012) — OCSVM: One-Class SVM (Sch¨ olkopf et al., 2001) — iForest: Isolation Forest (Liu et al., 2008) — KS: Kernel Smoothing 20 / 32
  21. Anomaly detection Mass Volume curve Extreme regions Consistency of y

    MVs We used y MVs as an estimate of MVs where dα € r0, 1q, y MVs pαq  λtx, spxq ¥ p α ¡1 s pαqu with p α ¡1 s the generalized inverse of p αs. 2 questions Consistency of y MVs as n Ñ V? How to build confidence regions? 21 / 32
  22. Anomaly detection Mass Volume curve Extreme regions Consistency Let s

    be a scoring function and ε € p0, 1s . Assumptions on the random variable spXq and its distribution Fs λs is C2 Theorem (Cl´ emen¸ con and Thomas, 2017) (i) Consistency. With probability one, sup α €r0,1 ¡ε s |y MVs pαq¡MVs pαq| ÝÑ n Ñ V 0 (ii) Functional CLT. There exists a sequence of Brownian bridges tBn pαqu α €r0,1 s such that, almost-surely, uniformly over r0, 1 ¡εs, as n Ñ V, c n ¡ y MVs pαq¡MVs pαq ©  λ I s pα ¡1 s pαqq fs pα ¡1 s pαqq Bn pαq Opn ¡1 {2 log nq 22 / 32
  23. Anomaly detection Mass Volume curve Extreme regions Confidence bands using

    smoothed bootstrap 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Mass 0 10 20 30 40 Volume MVf 90% confidence band Up to log n factors (Cl´ emen¸ con and Thomas, 2017), Functional CLT: rate in Opn ¡1 {2q but requires knowledge of fs Naive (non-smoothed) bootstrap: rate in Opn ¡1 {4q Smoothed bootstrap: rate in Opn ¡2 {5q 23 / 32
  24. Anomaly detection Mass Volume curve Extreme regions Anomaly detection in

    extreme regions Goal of multivariate Extreme Value Theory: model the tail of the distribution F of a multivariate random variable X  pX p1 q , . . . , X pd qq Motivation for unsupervised anomaly detection Anomalies are likely to be located in extreme regions, i.e. regions far from the mean E rXs Lack of data in these regions makes it difficult to distinguish between large normal instances and anomalies Relying on multivariate Extreme Value Theory we suggest an algorithm to detect anomalies in extreme regions. 24 / 32
  25. Anomaly detection Mass Volume curve Extreme regions Multivariate Extreme Value

    Theory Common approach Standardization to unit Pareto: TpXq  V € Rd where V pj q  1 1 ¡Fj pX pj qq, dj € t1, . . . , du, Fj , 1 ¤ j ¤ d, being the margins. Note that X and V share the same dependence structure/copula. 25 / 32
  26. Anomaly detection Mass Volume curve Extreme regions Multivariate Extreme Value

    Theory prpvq, ϕpvqq  p}v}V, v{}v}Vq polar coordinates Sd ¡1 positive orthant of the unit hypercube Theorem (Resnick, 1987) With mild assumptions on the distribution F, there exists a finite (angular) measure Φ on Sd ¡1 such that for all Ω € Sd ¡1, Φt pΩq def  t ¤P prpVq ¡ t, ϕpVq € Ωq ÝÑ t ÑV ΦpΩq 26 / 32
  27. Anomaly detection Mass Volume curve Extreme regions Main idea Anomalies

    in extreme regions: observations that deviate from the dependence structure of the tail Dependence Independence To find the most likely directions of the extreme observations we estimate a Minimum Volume set of the angular measure Φ 27 / 32
  28. Anomaly detection Mass Volume curve Extreme regions Minimum volume set

    estimation on the sphere Solve empirical optimization problem (Scott and Nowak, 2006) min Ω €G λd pΩq subject to p Φn,k pΩq ¥ α ¡ψk pδq where p Φn,k is estimated from t ¤P prpVq ¡ t, ϕpVq € Ωq with t  n{k: p Φn,k pΩq  1 k n ¸ i 1 1tr pV i q¥n k , ϕ pV i q€Ω u n/k n/k 28 / 32
  29. Anomaly detection Mass Volume curve Extreme regions Theorem Ω ¦

    α being the true Minimum Volume set, RpΩq  pλd pΩq¡λd pΩ ¦ α qq α ¡Φn {k pΩq¨ Theorem (Thomas, Cl´ emen¸ con, Gramfort, Sabourin, 2017) Assume that the margins Fj are known the class G is of finite VC dimension common assumptions for existence and uniqueness of MV sets are fulfilled by Φn {k Then there exists a constant C ¡ 0 such that E rRpp Ωα qs ¤ ¡ inf Ω €Gα λd pΩq¡λd pΩ ¦ α q© C ™ log k k where Gα  tΩ € G, Φn {k pΩq ¥ αu. 29 / 32
  30. Anomaly detection Mass Volume curve Extreme regions Numerical experiments Global

    scoring function ˆ sprpvq, ϕpvqq  1{rpvq2 ¤ ˆ sϕ pϕpvqq. Data set OCSVM Isolation Forest Score ˆ s shuttle 0.981 0.963 0.987 SF 0.478 0.251 0.660 http 0.997 0.662 0.964 ann 0.372 0.610 0.518 forestcover 0.540 0.516 0.646 ROC-AUC are computed on test sets made of normal and abnormal instances and restricted to the extreme region (k  c n). 30 / 32
  31. Anomaly detection Mass Volume curve Extreme regions References Mass Volume

    curves and anomaly ranking. S. Cl´ emen¸ con, A. Thomas. Submitted. 2017 Learning hyperparameters for unsupervised anomaly detection. A. Thomas, S. Cl´ emen¸ con, V. Feuillard, A. Gramfort. Anomaly detection Workshop, ICML 2016. Anomaly detection in extreme regions via empirical MV sets on the sphere. A. Thomas, S. Cl´ emen¸ con, A. Gramfort, A. Sabourin. AISTATS 2017. Code for hyperparameter tuning available on github https://github.com/albertcthomas/anomaly_tuning 31 / 32