Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Anomaly detection in scikit-learn - Ongoing work and future developments

Albert
February 14, 2018

Anomaly detection in scikit-learn - Ongoing work and future developments

Slides for the February 2018 Paris-Saclay Center for Data Science meeting - Outlier detection in scikit-learn

Albert

February 14, 2018
Tweet

More Decks by Albert

Other Decks in Programming

Transcript

  1. Anomaly detection in scikit-learn Ongoing work and future developments CDS

    meeting - February 14th, 2018 Albert Thomas T´ el´ ecom ParisTech - Huawei Technologies 1 / 21
  2. Anomaly detection Imbalanced classification: anomalies and normal data available Novelty

    detection: only normal data available fit on normal data, predict on unlabeled data 2 / 21
  3. Anomaly detection Imbalanced classification: anomalies and normal data available Novelty

    detection: only normal data available fit on normal data, predict on unlabeled data Outlier detection: unlabeled data only, fit and predict on unlabeled data Outliers are rare events, located in the low density regions. 2 / 21
  4. Anomaly detection Novelty detection: only normal data available fit on

    normal data, predict on unlabeled data Outlier detection: unlabeled data only, fit and predict on unlabeled data Outliers are rare events, located in the low density regions. 3 / 21
  5. Outlier and novelty detection algorithms Common approach 1. Learn a

    scoring function s such that the smaller spxq the more abnormal is x 2. Threshold s at an offset q: outlier/novelties are such that tx, spxq ă qu q usually depends on a contamination parameter EllipticEnvelope, OneClassSVM IsolationForest (iForest) and LocalOutlierFactor (LOF) 4 / 21
  6. Outlier Detection .04s Robust covariance .00s One-Class SVM .11s Isolation

    Forest .00s Local Outlier Factor .04s .00s .11s .00s .04s .00s .11s .00s .05s .00s .11s .00s
  7. Novetly detection .03s Robust covariance .00s One-Class SVM .10s Isolation

    Forest .03s .00s .10s .04s .00s .10s .05s .00s .11s
  8. 0.0 0.1 0.2 Gaussian mixture density 2.5 0.0 2.5 5.0

    7.5 10.0 12.5 0 20 One-Class SVM scoring function
  9. Scikit-learn API PR on API consistency last week same decision_function

    for all estimators new score_samples method 9 / 21
  10. Scikit-learn API EllipticEnvelope, OneClassSVM and IsolationForest instantiate your estimator, e.g.

    clf = OneClassSVM() clf.fit(X_train) clf.score_samples(X_test): raw score s clf.decision_function(X_test): thresholded score clf.score_samples(X_test) - clf.offset_ #ě 0 for inliers ă 0 for outliers clf.predict(X_test) #`1 for inliers ´1 for outliers 10 / 21
  11. Scikit-learn API EllipticEnvelope, OneClassSVM and IsolationForest instantiate your estimator, e.g.

    clf = OneClassSVM() clf.fit(X_train) clf.score_samples(X_test): raw score s clf.decision_function(X_test): thresholded score clf.score_samples(X_test) - clf.offset_ #ě 0 for inliers ă 0 for outliers clf.predict(X_test) #`1 for inliers ´1 for outliers Not valid for LOF 10 / 21
  12. Scikit-learn API - Outlier detection LOF is based on k-NN

    distances predict on x P Xtrain : take the k nearest neighbors of x in Xtrainztxu predict on x R Xtrain : take the k nearest neighbors of x in Xtrain Hard to check whether x P Xtrain or not... 11 / 21
  13. Scikit-learn API - Outlier detection 1st solution fit_predict(X_train) to predict

    on X_train predict(X_test) to predict on X_test 12 / 21
  14. Scikit-learn API - Outlier detection 1st solution fit_predict(X_train) to predict

    on X_train predict(X_test) to predict on X_test fit(X).predict(X) != fit_predict(X) 12 / 21
  15. Scikit-learn API - Outlier detection 2nd solution and current one

    only fit_predict is public _score_samples, _decision_function and _predict are private 13 / 21
  16. Scikit-learn API - Outlier detection 2nd solution and current one

    only fit_predict is public _score_samples, _decision_function and _predict are private Novelty detection is not (officially) supported for LOF 13 / 21
  17. Scikit-learn API contamination parameter in p0, 1q / nu for

    OneClassSVM for outlier detection: proportion of outliers in the data set for novelty detection: type-I error (false positive rate) Used to compute the offset q on the training set tspxq ă qu “ tclf.score_samples(x) < clf.offset_u 14 / 21
  18. Scikit-learn API contamination can be set to 'auto' for iForest

    and LOF iForest: score of inliers close to 0 and score of outliers close to -1 clf.offset_ = -0.5 LOF: score of inliers « ´1 clf.offset_ = -1.5 15 / 21
  19. Common tests PR on common tests for outlier detection estimators

    Create an OutlierMixin defines a fit_predict for all outlier and novelty detection estimators For all estimators test fit_predict For novelty detection estimators test score_samples, decision_function, offset_ 16 / 21
  20. OneClassSVM with SGD PR on OneClassSVM using SGD Solves a

    linear version of the OneClassSVM Pipeline with a kernel approximation from sklearn.kernel_approximation import Nystroem from sklearn.linear_model import SGDOneClassSVM nystroem = Nystroem(gamma=gamma) online_ocsvm = SGDOneClassSVM(nu=nu) pipe_online = make_pipeline(nystroem, online_ocsvm) pipe_online.fit(X_train) 17 / 21
  21. Future work Novelty detection for LOF add novelty parameter in

    __init__ novelty=False (default) fit_predict predict/decision_function/scores_samples OK NotImplementedError novelty=True fit_predict predict/decision_function/scores_samples NotImplementedError OK 19 / 21
  22. Future work OutlierMixin add predict? decision_function? Anomaly detection benchmarks one

    script per context (novelty vs outlier detection) convert fast benchmarks into examples New estimators SGDOneClassSVM Local Outlier Probabilities SVDD Online univariate anomaly detection Documentation 20 / 21