Detecting Fraudulent Skype Users via Machine Learning

Detecting Fraudulent Skype Users via Machine Learning Presentation by Kevin
Markham March 17, 2014 Based on the Research Paper: “Early Security Classification of Skype Users via Machine Learning” http://research.microsoft.com/pubs/205472/aisec10- leontjeva.pdf Paper and figures are copyright 2013 ACM

What is Skype? • Tool for: – Voice-over-IP calls –
Webcam videos – Instant messaging • Released in 2003, Microsoft bought in 2011 • At least 250 million monthly users

Fraud on Skype • Credit card fraud • Online payment
fraud • Spam instant messages • etc.

Detecting Fraud on Skype Skype already employs techniques for detecting
fraud: • “Majority of fraudulent users are detected within one day” Some challenges in fraud detection: • Legitimate accounts get hijacked and don’t necessarily “look” fraudulent • Sparse data

Improving Fraud Detection Why is it worth improving? • Manual
fraud detection is very expensive Who wrote this paper? • Team from Microsoft Research What was their goal? • “Detect stealthy fraudulent users” that fool Skype’s existing defenses for a long period of time

Classification • Classification problem: Predicting whether a user is fraudulent
(yes or no) • Data consists of features (or “variables” or “predictors”) and a response • Contrasts with regression problem: Predicting a continuous response like stock price

Data Used in the Study • Anonymized snapshot provided by
Skype • “Does not contain information about individual calls and their contents”

Classification Workflow

Feature Type 1: Profile Information • Gender • Age •
Country • OS platform • etc.

Feature Type 2: Skype Product Usage • Activity logs: –
Connected days – Audio call days – Video call days – Chat days • Data is not “rich”: – Only indicates the number of days per month that the user performed that activity – Does not distinguish which pair of users communicated, number of calls per day, etc.

Feature Type 3: Local Social Activity • Activity logs (graph
data): – Adding a user – Being added by a user – Deleting a user – Being deleted by a user • Number of connections in their list • Acceptance rate of outbound friend requests

Type 4: Global Social Activity • “PageRank” and “local clustering
coefficient” computed for each user

Classification Workflow • Pre-processing is unnecessary for profile info, but
necessary for other feature types

Pre-processing Activity Logs • Why? – Activity logs are time
series data – Doesn’t make sense to use every data point as a feature – Makes more sense to “compress” the data into a single number • How? – For a given feature (e.g., audio calls), build a model of what “normal” user activity looks like and another model of what fraudulent activity looks like – For each user, score them based upon which model they are closer to – This is called computing “log-likelihood ratios”

Computing Global Social Scores • PageRank: – Invented by Google
– Give users a high score if they have many connections and if they have connections from other high-scoring users • Local clustering coefficient: – Measure of how well your connections are connected to one another

Classification Workflow

Choosing a Classifier • Trained several classifiers: Random Forests, support
vector machines, logistic regression • Estimated prediction accuracy using cross- validation • Chose Random Forests because it had the best initial performance

Rating Model Accuracy • ROC curve: Plots “true positive” rate
vs “false positive” rate • Ideal classifier hugs top left corner

Rating Model Accuracy (cont’d) • Best result is obtained by
using all four feature types • At a false positive rate of 5%, true positive rate was 68% • Acceptable false positive rate is a business decision

Projected Model Effects

Performance on Different Fraud Types • Fraud types are defined
by Skype but are not public • Type II is most common, and the classifier works best on that type

Possible Model Improvements • Optimize separate models for each fraud
type • Attempt to detect points in time when accounts are hijacked • Prevent fraudsters from evading the model

Other Possible Applications • Predicting credit card fraud • Predicting
failure in data center disks • Any environment in which user behavior can be monitored and fraudulent behavior “looks different” from normal behavior

Thank You! Research Paper: http://research.microsoft.com/pubs/205472/aisec10- leontjeva.pdf My blog post: http://www.dataschool.io/detecting-fraudulent-skype-
users-via-machine-learning/

Detecting Fraudulent Skype Users via Machine Le...

Detecting Fraudulent Skype Users via Machine Learning

Kevin Markham

More Decks by Kevin Markham

Other Decks in Research

Featured

Transcript