Anomaly detection in scikit-learn - Ongoing work and future developments

Anomaly detection in scikit-learn Ongoing work and future developments CDS
meeting - February 14th, 2018 Albert Thomas T´ el´ ecom ParisTech - Huawei Technologies 1 / 21

Anomaly detection Imbalanced classiﬁcation: anomalies and normal data available 2
/ 21

Anomaly detection Imbalanced classiﬁcation: anomalies and normal data available Novelty
detection: only normal data available ﬁt on normal data, predict on unlabeled data 2 / 21

Anomaly detection Imbalanced classification: anomalies and normal data available Novelty
detection: only normal data available fit on normal data, predict on unlabeled data Outlier detection: unlabeled data only, fit and predict on unlabeled data Outliers are rare events, located in the low density regions. 2 / 21

Anomaly detection Novelty detection: only normal data available ﬁt on
normal data, predict on unlabeled data Outlier detection: unlabeled data only, ﬁt and predict on unlabeled data Outliers are rare events, located in the low density regions. 3 / 21

Outlier and novelty detection algorithms Common approach 1. Learn a
scoring function s such that the smaller spxq the more abnormal is x 2. Threshold s at an oﬀset q: outlier/novelties are such that tx, spxq ă qu q usually depends on a contamination parameter EllipticEnvelope, OneClassSVM IsolationForest (iForest) and LocalOutlierFactor (LOF) 4 / 21

Outlier Detection .04s Robust covariance .00s One-Class SVM .11s Isolation
Forest .00s Local Outlier Factor .04s .00s .11s .00s .04s .00s .11s .00s .05s .00s .11s .00s

Local Outlier Factor Local Outlier Factor

Novetly detection .03s Robust covariance .00s One-Class SVM .10s Isolation
Forest .03s .00s .10s .04s .00s .10s .05s .00s .11s

0.0 0.1 0.2 Gaussian mixture density 2.5 0.0 2.5 5.0
7.5 10.0 12.5 0 20 One-Class SVM scoring function

Scikit-learn API PR on API consistency last week same decision_function
for all estimators new score_samples method 9 / 21

Scikit-learn API EllipticEnvelope, OneClassSVM and IsolationForest instantiate your estimator, e.g.
clf = OneClassSVM() clf.fit(X_train) clf.score_samples(X_test): raw score s clf.decision_function(X_test): thresholded score clf.score_samples(X_test) - clf.offset_ #ě 0 for inliers ă 0 for outliers clf.predict(X_test) #`1 for inliers ´1 for outliers 10 / 21

Scikit-learn API EllipticEnvelope, OneClassSVM and IsolationForest instantiate your estimator, e.g.
clf = OneClassSVM() clf.fit(X_train) clf.score_samples(X_test): raw score s clf.decision_function(X_test): thresholded score clf.score_samples(X_test) - clf.offset_ #ě 0 for inliers ă 0 for outliers clf.predict(X_test) #`1 for inliers ´1 for outliers Not valid for LOF 10 / 21

Scikit-learn API - Outlier detection LOF is based on k-NN
distances predict on x P Xtrain : take the k nearest neighbors of x in Xtrainztxu predict on x R Xtrain : take the k nearest neighbors of x in Xtrain Hard to check whether x P Xtrain or not... 11 / 21

Scikit-learn API - Outlier detection 1st solution fit_predict(X_train) to predict
on X_train predict(X_test) to predict on X_test 12 / 21

Scikit-learn API - Outlier detection 1st solution fit_predict(X_train) to predict
on X_train predict(X_test) to predict on X_test fit(X).predict(X) != fit_predict(X) 12 / 21

Scikit-learn API - Outlier detection 2nd solution and current one
only fit_predict is public _score_samples, _decision_function and _predict are private 13 / 21

Scikit-learn API - Outlier detection 2nd solution and current one
only fit_predict is public _score_samples, _decision_function and _predict are private Novelty detection is not (oﬃcially) supported for LOF 13 / 21

Scikit-learn API contamination parameter in p0, 1q / nu for
OneClassSVM for outlier detection: proportion of outliers in the data set for novelty detection: type-I error (false positive rate) Used to compute the oﬀset q on the training set tspxq ă qu “ tclf.score_samples(x) < clf.offset_u 14 / 21

Scikit-learn API contamination can be set to 'auto' for iForest
and LOF iForest: score of inliers close to 0 and score of outliers close to -1 clf.offset_ = -0.5 LOF: score of inliers « ´1 clf.offset_ = -1.5 15 / 21

Common tests PR on common tests for outlier detection estimators
Create an OutlierMixin deﬁnes a fit_predict for all outlier and novelty detection estimators For all estimators test fit_predict For novelty detection estimators test score_samples, decision_function, offset_ 16 / 21

OneClassSVM with SGD PR on OneClassSVM using SGD Solves a
linear version of the OneClassSVM Pipeline with a kernel approximation from sklearn.kernel_approximation import Nystroem from sklearn.linear_model import SGDOneClassSVM nystroem = Nystroem(gamma=gamma) online_ocsvm = SGDOneClassSVM(nu=nu) pipe_online = make_pipeline(nystroem, online_ocsvm) pipe_online.fit(X_train) 17 / 21

Future work Novelty detection for LOF add novelty parameter in
__init__ novelty=False (default) fit_predict predict/decision_function/scores_samples OK NotImplementedError novelty=True fit_predict predict/decision_function/scores_samples NotImplementedError OK 19 / 21

Future work OutlierMixin add predict? decision_function? Anomaly detection benchmarks one
script per context (novelty vs outlier detection) convert fast benchmarks into examples New estimators SGDOneClassSVM Local Outlier Probabilities SVDD Online univariate anomaly detection Documentation 20 / 21

Thanks to @agramfort, @jnothman, @ngoix, @amueller, @lesteve, @ogrisel, @TomDLT the
scikit-learn community 21 / 21

Anomaly detection in scikit-learn - Ongoing wor...

Anomaly detection in scikit-learn - Ongoing work and future developments

Albert

More Decks by Albert

Other Decks in Programming

Featured

Transcript

Anomaly detection in scikit-learn Ongoing work and future developments CDS

Anomaly detection Imbalanced classiﬁcation: anomalies and normal data available 2

Anomaly detection Imbalanced classiﬁcation: anomalies and normal data available Novelty

Anomaly detection Imbalanced classiﬁcation: anomalies and normal data available Novelty

Anomaly detection Novelty detection: only normal data available ﬁt on

Outlier and novelty detection algorithms Common approach 1. Learn a

Outlier Detection .04s Robust covariance .00s One-Class SVM .11s Isolation

Local Outlier Factor Local Outlier Factor

Novetly detection .03s Robust covariance .00s One-Class SVM .10s Isolation

0.0 0.1 0.2 Gaussian mixture density 2.5 0.0 2.5 5.0

Scikit-learn API PR on API consistency last week same decision_function

Scikit-learn API EllipticEnvelope, OneClassSVM and IsolationForest instantiate your estimator, e.g.

Scikit-learn API EllipticEnvelope, OneClassSVM and IsolationForest instantiate your estimator, e.g.

Scikit-learn API - Outlier detection LOF is based on k-NN

Scikit-learn API - Outlier detection 1st solution fit_predict(X_train) to predict

Scikit-learn API - Outlier detection 1st solution fit_predict(X_train) to predict

Scikit-learn API - Outlier detection 2nd solution and current one

Scikit-learn API - Outlier detection 2nd solution and current one

Scikit-learn API contamination parameter in p0, 1q / nu for

Scikit-learn API contamination can be set to 'auto' for iForest

Common tests PR on common tests for outlier detection estimators

OneClassSVM with SGD PR on OneClassSVM using SGD Solves a

Future work Novelty detection for LOF add novelty parameter in

Future work OutlierMixin add predict? decision_function? Anomaly detection benchmarks one

Thanks to @agramfort, @jnothman, @ngoix, @amueller, @lesteve, @ogrisel, @TomDLT the