DEFCON AI Village 29 - Where We’re Going, We Don’t Need Labels: Anomaly Detection for 2FA

© 2020 Cisco Systems, Inc. and/or its aﬃliates. All rights
reserved. Our approach What can we do with our data? Anomaly Detection for 2FA Becca Lynch & Stefano Meschiari Where We’re Going, We Don’t Need Labels*: 1

reserved. Add photo here Add photo here Hello! Stefano Meschiari Data Scientist Becca Lynch Data Scientist 2

reserved. Agenda Problem Space Attacks, data issues, and a world without labels Lessons Learned Where we fail and fall short in the “real world” Questions Our Approach Self-supervised models and simple(ish) heuristics Conclusions & What’s Next Future research directions and improvements 3

reserved. Problem Space 2FA attacks, lack of labels, data dimensionality 4

reserved. Authenticates to access applications 5

reserved. Typically credentials, something you know 6

reserved. Using an additional “factor”, something you have/are 7

reserved. The factor is the method used for secondary authentication 8

reserved. 9

reserved. Goals Identify successful attacks or abnormal access • Compromised primary credentials • Phishing • Access with no 2FA or low-trust factor • Abnormal based on corporate policy 10

reserved. Goals Identify successful attacks or abnormal access • Compromised primary credentials • Phishing • Access with no 2FA or low-trust factor • Abnormal based on corporate policy Provide accessible, timely, and actionable visibility • Clear and concise info on suspicious access • Enable analysts to make informed decisions on user security and policies gonzo 66.10.10.XXX 11

reserved. Framework 12

reserved. Framework A security analyst who conﬁgures Duo for the users and wants to know if auths are anomalous 13

reserved. Framework ML models trained on historical data used to detect whether new authentications are suspicious 14

reserved. Framework ML models trained on historical data used to detect whether new authentications are suspicious Does the work of detecting and ranking “events”, and is available to the analyst without any setup or extra installation 15

reserved. Framework ML models trained on historical data used to detect whether new authentications are suspicious Does the work of detecting and ranking “events”, and is available to the analyst without any setup or extra installation For a model to be trained, it needs some kind of label in order to diﬀerentiate between suspicious and benign behavior 16

reserved. • Avoiding alarm fatigue for analysts • Detections need an associated explanation ◦ A black box solution won’t cut it! • Threat models and out-of-band information need to be inferred ◦ Not all detections have equal value Constraints 17

reserved. { "customer": "Hollywood Studios", "user": "kfrog", "timestamp": "2020-06-09 12:36:00", "app": "Studio Creator Portal", "factor": "Duo Push", "access_ip": "123.45.678.XXX", "access_device": "Mac OS X, Chrome", "country_code": "US", "result": "FAILURE", "reason": "User not in allowed group", ... } Data • Primarily categorical with many possible levels for each attribute • Makes some algorithms very hard to implement • Certain components of data may be highly censored • Every customer has their own setup for authentication 18

reserved. Labels User-Generated • Users can approve or deny an authentication • If denied, it can be marked as fraudulent • Our ﬁrst label is whether or not the user chose to mark an auth as fraud • Problems: ◦ Users are not security experts ◦ The authentication may timeout ◦ Generally unreliable [email protected] 19

reserved. • Analysts review surfaced authentications • Problems ◦ Analysts lack bandwidth ◦ “Suspicious” is conditional on expertise and providing the right context ◦ Providing feedback is optional Labels Analyst-Generated 20

reserved. Approach Self-supervised anomaly detection, OCCAM, detection pipeline 21

reserved. Approach Our approach includes a number of interacting components that make up a detection pipeline 22

reserved. Approach We will later see how each of the components of our pipeline work together... 23

reserved. Approach ...but ﬁrst we will zoom in on one of the more interesting algorithms We will later see how each of the components of our pipeline work together... 24

reserved. Idea: use an ensemble of supervised models as an “avatar” for the unsupervised problem. • Leverages oﬀ-the-shelf base learning algorithms • Models structure in the data • Applies more naturally to mixed-type data • Reserve labels for careful evaluation, rather than supervision A Diﬀerent Spin on Anomaly Detection Sparse, noisy, long-latency labels Data unsuitable for most common anomaly detection algorithms 25

reserved. • Algorithm we developed specifically for this type of data. • OCCAM recasts an unsupervised problem (no labels) into a self-supervised problem (the data itself provides the supervision). • Intuition: define anomalies as authentications with attributes that are predicted to be unlikely, even given all the contextual information. OCCAM: Self-Supervision on Authentication Data Outlier Classification with Categorical Attribute Models 26

reserved. Ingredients Self-supervised Submodels Anomaly Ratios Submodel Weights Anomaly Score 27

reserved. Ingredient 1: Self-supervised Submodel Ensemble • Goal: recover responses that have been hidden from the model. • Each submodel is trained on historical auth data to recover one relevant attribute of the authentication at a time. • Base learners (random forests) for each submodel return probabilistic predictions for all alternatives. factor ~ S 1 (user, location, device, …) device ~ S 2 (user, location, factor, …) location ~ S 3 (user, device, factor, …) 28

reserved. Ingredient 2: Anomaly Ratios Anomaly ratios r For each auth, the ratios of the predicted probability of the most probable attribute value and the probability of the observed attribute value. 29

reserved. Submodel weight w (Brier score) A proxy for the accuracy of the probabilistic predictions of a given submodel, measured on calibration data. Ingredient 3: Submodel Weight Anomaly ratios r For each auth, the ratios of the predicted probability of the most probable attribute value and the probability of the observed attribute value. 30

reserved. Submodel weight w (Brier score) A proxy for the accuracy of the probabilistic predictions of a given submodel, measured on calibration data. OCCAM Anomaly Score Anomaly score A Average of how “surprised” each submodel is to see the auth’s observed attributes, weighted by the quality of each submodel. Anomaly ratio r For each auth, the ratios of the predicted probability of the most probable attribute value and the probability of the observed attribute value. 31

reserved. LOW ANOMALY SCORE (ratio < 1) Observed Alternate HIGH ANOMALY SCORE (ratio >> 1) Alternate Observed p(Italy | Lee, Personal Device, Saturday, …) p(United States | Lee, Work Device, Saturday, …) w location ✖ most likely alternative location actual location model weight 32

reserved. What does this buy us? • Self-supervision for categorical data. • Models latent structure. Even if an attribute is rare, it might be expected given other properties of the auth. • Provides a backbone for iteration. OCCAM provides a framework for evaluating and incorporating new features and base learners. 33

reserved. • OCCAM • Detectors that focus on speciﬁc aspects of risk or trust Example: 2FA executed via SMS or by higher-risk user groups; oﬃce locations • Expert rules and heuristics Example: administrator action enabling 2FA bypass • Meta-models Example: data drift detector Detection Pipeline 34

reserved. 35

reserved. 36

reserved. 37

reserved. 38

reserved. Lessons learned Successes and failures 39

reserved. Measuring Success • We observed initial customers using the system and ﬁnding relevant events; reports of true positives • Our pipeline surfaces a variable number of daily detections for analysts to triage • Can we use these feedback labels to measure metrics like precision, recall, etc. and evaluate models at scale? 40

reserved. Measuring Success 41

reserved. Interpreting feedback is hard Prioritization Low volume of triageable detections, from multiple layers of models and rules Presentation Semantic gap will inﬂuence feedback if not enough explanation and context is provided Human factors UX, actionability, experience, time constraints, out-of-band knowledge are an implicit part of feedback 42

reserved. Practical Consequences • Low volume of labels • Determining ground truth and get a handle on FN and TN is diﬃcult and sensitive • Disentangling model performance from human-in-the-loop considerations is hard • Introspecting decisions is complex Build mechanisms for observability and introspection Prioritization Presentation Human factor 43

reserved. • Collaboration with Product & Design to help deﬁne qualitative and quantitative metrics • Regular internal “dog-fooding” • Before A/B testing, recruit customers for interviews to test hypotheses and watching out for semantic gaps Iteration & Collaboration in ML development 44

reserved. Watching out for silent failure • A lack of labels exacerbates any robustness and data quality issues ◦ Data and concept drift ◦ Authentication semantics ◦ Lack of data • Build meta-models that can monitor and correct for data issues • When all else fails, reduce scope 45

reserved. Conclusions & Future Directions 46

reserved. Tying it all together... • Combination of probabilistic sub-models, detectors based on understanding of universally risky/trusted behavior, and rule-based heuristics 47

reserved. Tying it all together... • Combination of probabilistic sub-models, detectors based on understanding of universally risky/trusted behavior, and rule-based heuristics • Feedback from customers is interpreted from both feedback data (labels from experts) and customer engagement 48

reserved. Tying it all together... In reality… • Feedback is optional, sparse, and highly subjective • Straightforward performance evaluation is nearly impossible • Progress can be made by solving proxy problems, careful monitoring and interpretation of what feedback we do have 49

reserved. Future Work Ask us more in Discord! User Clustering + Cluster users based on authentication data + Understand patterns and outliers Geolocation + Combine disparate sources of supervision in more principled manner (anomaly scores, focused detectors, threat feeds, expert labels, ...) + Location as a density function for each customer + Improve precision reducing false positives occurring in likely locations, and identify unknown anomalies in unusual locations within typical country Weak Supervision 50

reserved. Where We’re Going We Don’t Need Labels 51

reserved. If You Don’t Have Labels... Be Careful Where You’re Going 52 Thanks! See you Discord

DEFCON AI Village 29 - Where We’re Going, We Do...

DEFCON AI Village 29 - Where We’re Going, We Don’t Need Labels: Anomaly Detection for 2FA

More Decks by Stefano Meschiari

Featured

Transcript