From billions to hundreds - How machine learning helps experts detect sensitive data leaks

Conﬁdential © CybelAngel 2021 1 21/10/2021 - Giulia Bianchi From
billions to hundreds How machine learning helps experts detect sensitive data leaks

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 2 Giulia Bianchi Previously
Data Scientist @ Xebia Currently Senior Data Scientist @ CybelAngel Always passionate about sharing knowledge @Giuliabianchl

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 5 CybelAngel DATA LIFECYCLE
STAGES DATA PIPELINE DOCUMENTS / DAY DATA COLLECTION (RAW FEED) DATA PROCESSING (CLIENT-SPECIFIC) HUMAN INTELLIGENCE (REFINED FEED) SAAS DELIVERY CLIENT PLATFORM ANALYST PLATFORM COMPREHENSIVE SCANNING BILLIONS OF DOCUMENTS THOUSANDS OF SERVERS HUNDREDS OF ALERTS QUALIFIED ALERTS CLIENT KEYWORDS FILTERING HUMAN ANALYSIS MACHINE LEARNING KEYWORD MATCHING

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 6 Some numbers 5
days in July - 1 scope 50 k discarded alerts by ML 5 k alerts in feed treated by analysts 50 investigated alerts 10 sent alerts

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 7 One of the
main challenges for our models is that sensitive leaks are very few compared to the large number of alerts

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 8 Unbalanced class problem
Binary classiﬁcation

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl It's not sensitive 9
Balanced classes Binary classiﬁcation Training data 50 % sensitive documents 50 % non sensitive docs Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model predictions

Balanced classes Binary classiﬁcation Accuracy score 8/10 = 0.8 Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model performance

Unbalanced classes Binary classiﬁcation Training data 10 % sensitive documents 90 % non sensitive docs Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model predictions

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 12 Unbalanced classes Binary
classiﬁcation Accuracy score 8/10 = 0.8 Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) Model performance It's sensitive It's not sensitive

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 13 Same good performance,
but second model is useless!

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 14 How to tackle
the problem

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 15 Possible actions at
different levels 1. Data preparation 2. Model training 3. Model evaluation 4. Final prediction The hypothesis is that you already have a signiﬁcant representative sample, and the feature engineering is ok

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 16 Possible actions at
different levels 1. Data preparation → restore class balance 2. Model training → update class weight 3. Model evaluation → use an appropriate metric 4. Final prediction → tweak the probability threshold

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 17 1. Restore class
balance Data preparation

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 18 Over- and undersampling
Objective: increase minority class proportion by adding or removing samples from the original dataset Oversam pling Undersam pling imbalanced-learn/over-sampling imbalanced-learn/under-sampling

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 19 Oversampling Objective: increase
minority class proportion Add samples of the minority class (that just don't exist -.-) Mainly achievable in 2 ways • Duplicate samples of the minority class and reuse them ◦ Easy but high risk of overﬁt • Synthesize samples from existing ones → SMOTE ◦ More sophisticated but risk of feeding data that will never be seen in production imbalanced-learn/over-sampling Synthetic Minority Over-sampling Technique

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 20 Undersampling Objective: increase
minority class proportion Remove samples from the majority class (there are plenty!) Easy peasy • Randomly exclude samples from the majority class ◦ Potential loss of precious information imbalanced-learn/under-sampling

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 21 2. Increase minority
class weight Model training

Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 22 Errors on minority
class should count more class_weight parameter Technically, the weights of misclassification errors are modified in the cost function during training • Classical log-loss in logistic regression • Modified log-loss How to Improve Class Imbalance using Class Weights in Machine Learning

Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 23 Errors on minority
class should count more class_weight parameter Technically, the weights of misclassification errors are modified in the cost function during training • Modified log-loss • W 0 < W 1 • W 0 weight for majority class, W 1 weight for minority class • In scikit-learn this is achieved by setting parameter class_weight sklearn - class_weight

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 24 3. Choose an
appropriate metric Model evaluation

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 25 Better capture model
performance ROC curve • Maximise TP (and TN) • Minimize FN and FP • With unbalanced classes TN is naturally big and FP is naturally small Ideally TPR=1 and FPR=0 Receiver operating characteristic

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 26 FPR and TPR
Balanced classes It's not sensitive It's sensitive Accuracy score 8/10 = 0.8 👍 False Positive Rate FP/(FP+TN) = 1/(1+4) = 0.2 👍 True Positive Rate TP/(TP+ FN) = 4/(4+1) = 0.8 👍 TP TN TP TP TP FP TN TN TN FN

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 27 FPR and TPR
Unbalanced classes Accuracy score 8/10 = 0.8 👍 False Positive Rate FP/(FP+TN) = 1/(1+8) = 0.11 👍 True Positive Rate TP/(TP+ FN) = 0/(0+1) = 0 🤨 It's not sensitive It's sensitive FP TN TN TN TN FN TN TN TN TN

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 28 Better capture model
performance Precision-Recall curve • Maximise TP • Minimize FN and FP • With unbalanced classes TN is naturally big and FP is naturally small 👉 TN are not in the metrics anymore Ideally Precision=1 and Recall=1 Precision and Recall

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 29 Precision and Recall
Balanced classes It's not sensitive It's sensitive Accuracy score 8/10 = 0.8 👍 Precision TP/(TP+FP) = 4/(1+4) = 0.8 👍 Recall TP/(TP+FN) = 4/(4+1) = 0.8 👍 TP TN TP TP TP FP TN TN TN FN

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 30 Precision and Recall
Unbalanced classes Accuracy score 8/10 = 0.8 👍 Precision TP/(TP+FP) = 0/(0+1) = 0 😭 Recall TP/(TP+FN) = 0/(0+1) = 0 😭 It's not sensitive It's sensitive FP TN TN TN TN FN TN TN TN TN

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 32 Prediction: class vs.
probability The model outputs model.predict_proba • It outputs a score • It varies between 0 and 1 • According to the model it can represent a probability 👉 if the score is well calibrated it represents the proportion of positive samples model.predict • It outputs a label (the class) • Its value is 1 if score ≥ 0.5 • Its value is 0 if score < 0.5 👉 the 0.5 threshold is not adapted to unbalanced classes where the proportion of positive samples is less than half 💡 use a more appropriate threshold

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 33 How to choose
the best threshold? 🧠 Threshold impacts TP, TN, FP, FN… and all the metrics derived! Quantitative point of view • Choose the metrics that optimises a given metric → evaluate the evolution of metrics in ROC and precision-recall curve according to the threshold Qualitative point of view • Assign business meaning to metrics → quantify the impact of prioritizing recall or precision

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 34 How to choose
the best threshold? 🧠 Translation of precision and recall in CybelAngel business language Conservative choice 👉 higher recall, lower precision 👉 big volume 👉 keep all true leaks 👉 too much time to go through the feed + need more analysts Non conservative choice 👉 higher precision and lower recall 👉 smaller feed 👉 possible miss 👉 client not happy 💸 loss of money

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 36 Generic pipeline count_vectorizer
= CountVectorizer( strip_accents="ascii", lowercase=True, stop_words="english", ngram_range=(1, 3), max_features=500, min_df=.01, max_df=.90 ) tfidf_transformer = TfidfTransformer() model = <MODEL> pipeline = Pipeline([ ('vect', count_vectorizer), ('tfidf', tfidf_transformer), ('clf', model) ]) pipeline = pipeline.fit(X_train, y_train) predicted_probabilities = pipeline.predict_proba(X_test) pos_th = <THRESHOLD> predicted_classes = np.array([ True if i[-1]>=pos_th else False for i in predicted_probabilities ]) [precision, recall, f1, support] = precision_recall_fscore_support(y_test, predicted_classes) acc = accuracy_score(y_test, predicted_classes) perf = { "accuracy score": round(acc, 2), "precision": round(precision[-1], 2), "recall": round(recall[-1], 2) } Kaggle - Random acts of pizza

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 37 Results comparison 25%
positive class vs. 75% negative class Kaggle - Random acts of pizza <MODEL> <THRESHOLD> accuracy precision recall FPR MultinomialNB() 0.5 0.75 0.5 0.01 0.003 MultinomialNB() 0.246 0.59 0.3 0.5 0.38 LogisticRegression() 0.5 0.76 0.58 0.06 0.01 LogisticRegression( class_weight=“balanced" ) 0.5 0.63 0.35 0.57 0.35

positive class vs. 75% negative class Kaggle - Random acts of pizza

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 41 Spend time on
due diligence (good data & feat. eng.) ⚠ Pay attention to too optimistic performance and overﬁt 🏆 Combine and test different technical solutions 🧠 Interpret results according to your use case 🚀 Enjoy!

From billions to hundreds - How machine learnin...

From billions to hundreds - How machine learning helps experts detect sensitive data leaks

More Decks by Giulia

Other Decks in Programming

Featured

Transcript