Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From billions to hundreds - How machine learnin...

Giulia
November 07, 2021

From billions to hundreds - How machine learning helps experts detect sensitive data leaks

At CybelAngel we scan the internet looking for data leaks. We bring back billions of candidate alerts only to send very few really sensitive leaks to their legitimate owners.

In the process of going from billions to hundreds of alerts to make the work of curation by analysts possible, machine learning is an essential step to filter out false alerts and reduce noise.

As we are looking for a needle in the haystack, one of the challenges we face when training a machine learning model is dealing with highly unbalanced classes. In this talk I am going to present methods to tackle this problem and have a performant model.

DevFest Nantes 2021
Conference 45 minutes

Video https://youtu.be/d8LsGdGS5UY

Giulia

November 07, 2021
Tweet

More Decks by Giulia

Other Decks in Programming

Transcript

  1. Confidential © CybelAngel 2021 1 21/10/2021 - Giulia Bianchi From

    billions to hundreds How machine learning helps experts detect sensitive data leaks
  2. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 2 Giulia Bianchi Previously

    Data Scientist @ Xebia Currently Senior Data Scientist @ CybelAngel Always passionate about sharing knowledge @Giuliabianchl
  3. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 5 CybelAngel DATA LIFECYCLE

    STAGES DATA PIPELINE DOCUMENTS / DAY DATA COLLECTION (RAW FEED) DATA PROCESSING (CLIENT-SPECIFIC) HUMAN INTELLIGENCE (REFINED FEED) SAAS DELIVERY CLIENT PLATFORM ANALYST PLATFORM COMPREHENSIVE SCANNING BILLIONS OF DOCUMENTS THOUSANDS OF SERVERS HUNDREDS OF ALERTS QUALIFIED ALERTS CLIENT KEYWORDS FILTERING HUMAN ANALYSIS MACHINE LEARNING KEYWORD MATCHING
  4. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 6 Some numbers 5

    days in July - 1 scope 50 k discarded alerts by ML 5 k alerts in feed treated by analysts 50 investigated alerts 10 sent alerts
  5. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 7 One of the

    main challenges for our models is that sensitive leaks are very few compared to the large number of alerts
  6. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl It's not sensitive 9

    Balanced classes Binary classification Training data 50 % sensitive documents 50 % non sensitive docs Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model predictions
  7. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl It's not sensitive 10

    Balanced classes Binary classification Accuracy score 8/10 = 0.8 Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model performance
  8. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl It's not sensitive 11

    Unbalanced classes Binary classification Training data 10 % sensitive documents 90 % non sensitive docs Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) It's sensitive Model predictions
  9. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 12 Unbalanced classes Binary

    classification Accuracy score 8/10 = 0.8 Predictions after training Predicted class = sensitive (1 error) Predicted classe = non sensitive (1 error) Model performance It's sensitive It's not sensitive
  10. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 15 Possible actions at

    different levels 1. Data preparation 2. Model training 3. Model evaluation 4. Final prediction The hypothesis is that you already have a significant representative sample, and the feature engineering is ok
  11. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 16 Possible actions at

    different levels 1. Data preparation → restore class balance 2. Model training → update class weight 3. Model evaluation → use an appropriate metric 4. Final prediction → tweak the probability threshold
  12. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 18 Over- and undersampling

    Objective: increase minority class proportion by adding or removing samples from the original dataset Oversam pling Undersam pling imbalanced-learn/over-sampling imbalanced-learn/under-sampling
  13. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 19 Oversampling Objective: increase

    minority class proportion Add samples of the minority class (that just don't exist -.-) Mainly achievable in 2 ways • Duplicate samples of the minority class and reuse them ◦ Easy but high risk of overfit • Synthesize samples from existing ones → SMOTE ◦ More sophisticated but risk of feeding data that will never be seen in production imbalanced-learn/over-sampling Synthetic Minority Over-sampling Technique
  14. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 20 Undersampling Objective: increase

    minority class proportion Remove samples from the majority class (there are plenty!) Easy peasy • Randomly exclude samples from the majority class ◦ Potential loss of precious information imbalanced-learn/under-sampling
  15. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 22 Errors on minority

    class should count more class_weight parameter Technically, the weights of misclassification errors are modified in the cost function during training • Classical log-loss in logistic regression • Modified log-loss How to Improve Class Imbalance using Class Weights in Machine Learning
  16. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 23 Errors on minority

    class should count more class_weight parameter Technically, the weights of misclassification errors are modified in the cost function during training • Modified log-loss • W 0 < W 1 • W 0 weight for majority class, W 1 weight for minority class • In scikit-learn this is achieved by setting parameter class_weight sklearn - class_weight
  17. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 25 Better capture model

    performance ROC curve • Maximise TP (and TN) • Minimize FN and FP • With unbalanced classes TN is naturally big and FP is naturally small Ideally TPR=1 and FPR=0 Receiver operating characteristic
  18. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 26 FPR and TPR

    Balanced classes It's not sensitive It's sensitive Accuracy score 8/10 = 0.8 👍 False Positive Rate FP/(FP+TN) = 1/(1+4) = 0.2 👍 True Positive Rate TP/(TP+ FN) = 4/(4+1) = 0.8 👍 TP TN TP TP TP FP TN TN TN FN
  19. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 27 FPR and TPR

    Unbalanced classes Accuracy score 8/10 = 0.8 👍 False Positive Rate FP/(FP+TN) = 1/(1+8) = 0.11 👍 True Positive Rate TP/(TP+ FN) = 0/(0+1) = 0 🤨 It's not sensitive It's sensitive FP TN TN TN TN FN TN TN TN TN
  20. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 28 Better capture model

    performance Precision-Recall curve • Maximise TP • Minimize FN and FP • With unbalanced classes TN is naturally big and FP is naturally small 👉 TN are not in the metrics anymore Ideally Precision=1 and Recall=1 Precision and Recall
  21. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 29 Precision and Recall

    Balanced classes It's not sensitive It's sensitive Accuracy score 8/10 = 0.8 👍 Precision TP/(TP+FP) = 4/(1+4) = 0.8 👍 Recall TP/(TP+FN) = 4/(4+1) = 0.8 👍 TP TN TP TP TP FP TN TN TN FN
  22. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 30 Precision and Recall

    Unbalanced classes Accuracy score 8/10 = 0.8 👍 Precision TP/(TP+FP) = 0/(0+1) = 0 😭 Recall TP/(TP+FN) = 0/(0+1) = 0 😭 It's not sensitive It's sensitive FP TN TN TN TN FN TN TN TN TN
  23. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 32 Prediction: class vs.

    probability The model outputs model.predict_proba • It outputs a score • It varies between 0 and 1 • According to the model it can represent a probability 👉 if the score is well calibrated it represents the proportion of positive samples model.predict • It outputs a label (the class) • Its value is 1 if score ≥ 0.5 • Its value is 0 if score < 0.5 👉 the 0.5 threshold is not adapted to unbalanced classes where the proportion of positive samples is less than half 💡 use a more appropriate threshold
  24. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 33 How to choose

    the best threshold? 🧠 Threshold impacts TP, TN, FP, FN… and all the metrics derived! Quantitative point of view • Choose the metrics that optimises a given metric → evaluate the evolution of metrics in ROC and precision-recall curve according to the threshold Qualitative point of view • Assign business meaning to metrics → quantify the impact of prioritizing recall or precision
  25. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 34 How to choose

    the best threshold? 🧠 Translation of precision and recall in CybelAngel business language Conservative choice 👉 higher recall, lower precision 👉 big volume 👉 keep all true leaks 👉 too much time to go through the feed + need more analysts Non conservative choice 👉 higher precision and lower recall 👉 smaller feed 👉 possible miss 👉 client not happy 💸 loss of money
  26. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 36 Generic pipeline count_vectorizer

    = CountVectorizer( strip_accents="ascii", lowercase=True, stop_words="english", ngram_range=(1, 3), max_features=500, min_df=.01, max_df=.90 ) tfidf_transformer = TfidfTransformer() model = <MODEL> pipeline = Pipeline([ ('vect', count_vectorizer), ('tfidf', tfidf_transformer), ('clf', model) ]) pipeline = pipeline.fit(X_train, y_train) predicted_probabilities = pipeline.predict_proba(X_test) pos_th = <THRESHOLD> predicted_classes = np.array([ True if i[-1]>=pos_th else False for i in predicted_probabilities ]) [precision, recall, f1, support] = precision_recall_fscore_support(y_test, predicted_classes) acc = accuracy_score(y_test, predicted_classes) perf = { "accuracy score": round(acc, 2), "precision": round(precision[-1], 2), "recall": round(recall[-1], 2) } Kaggle - Random acts of pizza
  27. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 37 Results comparison 25%

    positive class vs. 75% negative class Kaggle - Random acts of pizza <MODEL> <THRESHOLD> accuracy precision recall FPR MultinomialNB() 0.5 0.75 0.5 0.01 0.003 MultinomialNB() 0.246 0.59 0.3 0.5 0.38 LogisticRegression() 0.5 0.76 0.58 0.06 0.01 LogisticRegression( class_weight=“balanced" ) 0.5 0.63 0.35 0.57 0.35
  28. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 38 Results comparison 25%

    positive class vs. 75% negative class Kaggle - Random acts of pizza
  29. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 39 Results comparison 25%

    positive class vs. 75% negative class Kaggle - Random acts of pizza
  30. Confidential © CybelAngel 2021 @devfestnantes @Giuliabianchl 41 Spend time on

    due diligence (good data & feat. eng.) ⚠ Pay attention to too optimistic performance and overfit 🏆 Combine and test different technical solutions 🧠 Interpret results according to your use case 🚀 Enjoy!