From billions to hundreds - How machine learning helps experts detect sensitive data leaks

Slide 1

Slide 1 text

Conﬁdential © CybelAngel 2021 1 21/10/2021 - Giulia Bianchi From billions to hundreds How machine learning helps experts detect sensitive data leaks

Slide 2

Slide 2 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 2 Giulia Bianchi Previously Data Scientist @ Xebia Currently Senior Data Scientist @ CybelAngel Always passionate about sharing knowledge @Giuliabianchl

Slide 3

Slide 3 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 5 CybelAngel DATA LIFECYCLE STAGES DATA PIPELINE DOCUMENTS / DAY DATA COLLECTION (RAW FEED) DATA PROCESSING (CLIENT-SPECIFIC) HUMAN INTELLIGENCE (REFINED FEED) SAAS DELIVERY CLIENT PLATFORM ANALYST PLATFORM COMPREHENSIVE SCANNING BILLIONS OF DOCUMENTS THOUSANDS OF SERVERS HUNDREDS OF ALERTS QUALIFIED ALERTS CLIENT KEYWORDS FILTERING HUMAN ANALYSIS MACHINE LEARNING KEYWORD MATCHING

Slide 4

Slide 4 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 6 Some numbers 5 days in July - 1 scope 50 k discarded alerts by ML 5 k alerts in feed treated by analysts 50 investigated alerts 10 sent alerts

Slide 5

Slide 5 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 7 One of the main challenges for our models is that sensitive leaks are very few compared to the large number of alerts

Slide 6

Slide 6 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 8 Unbalanced class problem Binary classiﬁcation

Slide 7

Slide 7 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl It's not sensitive 9 Balanced classes Binary classiﬁcation Training data 50 % sensitive documents 50 % non sensitive docs Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model predictions

Slide 8

Slide 8 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl It's not sensitive 10 Balanced classes Binary classiﬁcation Accuracy score 8/10 = 0.8 Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model performance

Slide 9

Slide 9 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl It's not sensitive 11 Unbalanced classes Binary classiﬁcation Training data 10 % sensitive documents 90 % non sensitive docs Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model predictions

Slide 10

Slide 10 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 12 Unbalanced classes Binary classiﬁcation Accuracy score 8/10 = 0.8 Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) Model performance It's sensitive It's not sensitive

Slide 11

Slide 11 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 13 Same good performance, but second model is useless!

Slide 12

Slide 12 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 14 How to tackle the problem

Slide 13

Slide 13 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 15 Possible actions at different levels 1. Data preparation 2. Model training 3. Model evaluation 4. Final prediction The hypothesis is that you already have a signiﬁcant representative sample, and the feature engineering is ok

Slide 14

Slide 14 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 16 Possible actions at different levels 1. Data preparation → restore class balance 2. Model training → update class weight 3. Model evaluation → use an appropriate metric 4. Final prediction → tweak the probability threshold

Slide 15

Slide 15 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 17 1. Restore class balance Data preparation

Slide 16

Slide 16 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 18 Over- and undersampling Objective: increase minority class proportion by adding or removing samples from the original dataset Oversam pling Undersam pling imbalanced-learn/over-sampling imbalanced-learn/under-sampling

Slide 17

Slide 17 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 19 Oversampling Objective: increase minority class proportion Add samples of the minority class (that just don't exist -.-) Mainly achievable in 2 ways ● Duplicate samples of the minority class and reuse them ○ Easy but high risk of overﬁt ● Synthesize samples from existing ones → SMOTE ○ More sophisticated but risk of feeding data that will never be seen in production imbalanced-learn/over-sampling Synthetic Minority Over-sampling Technique

Slide 18

Slide 18 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 20 Undersampling Objective: increase minority class proportion Remove samples from the majority class (there are plenty!) Easy peasy ● Randomly exclude samples from the majority class ○ Potential loss of precious information imbalanced-learn/under-sampling

Slide 19

Slide 19 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 21 2. Increase minority class weight Model training

Slide 20

Slide 20 text

Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 22 Errors on minority class should count more class_weight parameter Technically, the weights of misclassification errors are modified in the cost function during training ● Classical log-loss in logistic regression ● Modified log-loss How to Improve Class Imbalance using Class Weights in Machine Learning

Slide 21

Slide 21 text

Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 23 Errors on minority class should count more class_weight parameter Technically, the weights of misclassification errors are modified in the cost function during training ● Modified log-loss ● W 0 < W 1 ● W 0 weight for majority class, W 1 weight for minority class ● In scikit-learn this is achieved by setting parameter class_weight sklearn - class_weight

Slide 22

Slide 22 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 24 3. Choose an appropriate metric Model evaluation

Slide 23

Slide 23 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 25 Better capture model performance ROC curve ● Maximise TP (and TN) ● Minimize FN and FP ● With unbalanced classes TN is naturally big and FP is naturally small Ideally TPR=1 and FPR=0 Receiver operating characteristic

Slide 24

Slide 24 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 26 FPR and TPR Balanced classes It's not sensitive It's sensitive Accuracy score 8/10 = 0.8 👍 False Positive Rate FP/(FP+TN) = 1/(1+4) = 0.2 👍 True Positive Rate TP/(TP+ FN) = 4/(4+1) = 0.8 👍 TP TN TP TP TP FP TN TN TN FN

Slide 25

Slide 25 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 27 FPR and TPR Unbalanced classes Accuracy score 8/10 = 0.8 👍 False Positive Rate FP/(FP+TN) = 1/(1+8) = 0.11 👍 True Positive Rate TP/(TP+ FN) = 0/(0+1) = 0 🤨 It's not sensitive It's sensitive FP TN TN TN TN FN TN TN TN TN

Slide 26

Slide 26 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 28 Better capture model performance Precision-Recall curve ● Maximise TP ● Minimize FN and FP ● With unbalanced classes TN is naturally big and FP is naturally small 👉 TN are not in the metrics anymore Ideally Precision=1 and Recall=1 Precision and Recall

Slide 27

Slide 27 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 29 Precision and Recall Balanced classes It's not sensitive It's sensitive Accuracy score 8/10 = 0.8 👍 Precision TP/(TP+FP) = 4/(1+4) = 0.8 👍 Recall TP/(TP+FN) = 4/(4+1) = 0.8 👍 TP TN TP TP TP FP TN TN TN FN

Slide 28

Slide 28 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 30 Precision and Recall Unbalanced classes Accuracy score 8/10 = 0.8 👍 Precision TP/(TP+FP) = 0/(0+1) = 0 😭 Recall TP/(TP+FN) = 0/(0+1) = 0 😭 It's not sensitive It's sensitive FP TN TN TN TN FN TN TN TN TN

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 32 Prediction: class vs. probability The model outputs model.predict_proba ● It outputs a score ● It varies between 0 and 1 ● According to the model it can represent a probability 👉 if the score is well calibrated it represents the proportion of positive samples model.predict ● It outputs a label (the class) ● Its value is 1 if score ≥ 0.5 ● Its value is 0 if score < 0.5 👉 the 0.5 threshold is not adapted to unbalanced classes where the proportion of positive samples is less than half 💡 use a more appropriate threshold

Slide 31

Slide 31 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 33 How to choose the best threshold? 🧠 Threshold impacts TP, TN, FP, FN… and all the metrics derived! Quantitative point of view ● Choose the metrics that optimises a given metric → evaluate the evolution of metrics in ROC and precision-recall curve according to the threshold Qualitative point of view ● Assign business meaning to metrics → quantify the impact of prioritizing recall or precision

Slide 32

Slide 32 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 34 How to choose the best threshold? 🧠 Translation of precision and recall in CybelAngel business language Conservative choice 👉 higher recall, lower precision 👉 big volume 👉 keep all true leaks 👉 too much time to go through the feed + need more analysts Non conservative choice 👉 higher precision and lower recall 👉 smaller feed 👉 possible miss 👉 client not happy 💸 loss of money

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 36 Generic pipeline count_vectorizer = CountVectorizer( strip_accents="ascii", lowercase=True, stop_words="english", ngram_range=(1, 3), max_features=500, min_df=.01, max_df=.90 ) tfidf_transformer = TfidfTransformer() model = pipeline = Pipeline([ ('vect', count_vectorizer), ('tfidf', tfidf_transformer), ('clf', model) ]) pipeline = pipeline.fit(X_train, y_train) predicted_probabilities = pipeline.predict_proba(X_test) pos_th = predicted_classes = np.array([ True if i[-1]>=pos_th else False for i in predicted_probabilities ]) [precision, recall, f1, support] = precision_recall_fscore_support(y_test, predicted_classes) acc = accuracy_score(y_test, predicted_classes) perf = { "accuracy score": round(acc, 2), "precision": round(precision[-1], 2), "recall": round(recall[-1], 2) } Kaggle - Random acts of pizza

Slide 35

Slide 35 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 37 Results comparison 25% positive class vs. 75% negative class Kaggle - Random acts of pizza accuracy precision recall FPR MultinomialNB() 0.5 0.75 0.5 0.01 0.003 MultinomialNB() 0.246 0.59 0.3 0.5 0.38 LogisticRegression() 0.5 0.76 0.58 0.06 0.01 LogisticRegression( class_weight=“balanced" ) 0.5 0.63 0.35 0.57 0.35

Slide 36

Slide 36 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 38 Results comparison 25% positive class vs. 75% negative class Kaggle - Random acts of pizza

Slide 37

Slide 37 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 39 Results comparison 25% positive class vs. 75% negative class Kaggle - Random acts of pizza

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Conﬁdential © CybelAngel 2021 @devfestnantes @Giuliabianchl 41 Spend time on due diligence (good data & feat. eng.) ⚠ Pay attention to too optimistic performance and overﬁt 🏆 Combine and test different technical solutions 🧠 Interpret results according to your use case 🚀 Enjoy!