Detecting scams using AI, for real

Detecting Scams Using AI [for real] How we built one
of the best fraud prevention systems on the market 27th January 2020 - High Tech Campus

Massimo Belloni Data Scientist / Machine Learning Engineer [email protected] @massibelloni
@massibelloni Data Scientist / Machine Learning Engineer HousingAnywhere • Started as an intern ▪ Penelope - Scam Detection ▪ RentRadar - Price suggestion Msc Computer Science and Engineering Politecnico di Milano • BSc Computer Science and Engineering ▪ Focus on Machine Learning ▪ Generic interest in Algebra and Philosophy of Mind

▪ Founded in 2009, located in Rotterdam ▪ Housing platform
two-sided marketplace, 100+ countries ▪ Deloitte NL Fast50 Ranking (2018, 2019) ▪ ~110 employees and counting ▪ 30k+ listings created per year (+100% YoY)

What does a scam look like?

▪ 0 scammed tenants ▪ 30k+ listings created per year
▪ 2000+ detected scams Trust as a core value What are users paying for?

▪ Dataset collection ▪ Feature engineering ▪ Algorithm selection ▪
Ensemble architecture Agenda Data Science ▪ Model deployment ▪ Retraining procedure ▪ Monitoring ML Engineering ▪ Model usage ▪ Notification management ▪ Platform integration Operations

Dataset, models & metrics Data Science

Dataset collection 100k records Collected since Nov 2016 Imbalanced 10%
scam Diﬀerent sources Listing and user info retrieved from the data warehouse

Social features ▪ Users using numbers in the email are
106% more likely to be scammers ▪ Users using more than 3 dots in the email address are 200% more likely to be scammers. Geographical features ▪ Users logging in from Nigeria are 0.74% of all the listings but 5% of all the scams. ▪ Listings located in countries where the user isn’t logging in from are 137% more likely to be scam. Feature Engineering The knowledge of the people, who deal with the task everyday, is crucial. A Listing’s Quality features ▪ Listings located in Amsterdam are 18% of all the scams (1% of all). London 4% (0.56%). ▪ 70% of all the scams have a cheaper price than the City Reference Price.

Gradient Boosted Decision Trees 101

Decision Tree Usually low performances (overfitting) Weak learner Easy to
explain and understand

Tree Ensembles From Random Forests to XGBoost

Improving XGBoost LightGBM Gradient-based One-Side Sampling (GOSS) Downsampling over the
instances’ space based on the gradient. Retrain all the instances with a large gradient and just sample over the instances with a small gradient. Exclusive Feature Bundling (EFB) Due to encoding a lot of features are never non-zero together. These features can be bundled together reducing the features’ space size. 15-20x faster than XGBoost comparable performances better built-in tricks Reduce number of samples Reduce number of features

6 models ensemble trained on diﬀerent subsets of the dataset
training set is sampled horizontally and vertically for each model LightGBM LogReg

Dataset shift Concepts to be learned can change with time.
Cross-validation isn’t a reliable approach for testing performances Temporal holdout to predict over the latest samples only Accuracy of a classifier predicting the samples’ period as a good indicator for shifting Vertical selection on the dataset while training

Iterating on the model Greedy experimentation New features, diﬀerent imputation
and encoding strategies Tracking Datasets, parameters, weights. Deliver fast Prevent explosion of complexity and brute force over the space of the experiments. Deploy MVP quickly to bring value fast. Overcome SCRUM every experiment is planned after another. Evolving Epics. Clear scope. Agile software development comprises various approaches to software development under which requirements and solutions evolve through the collaborative eﬀort of self-organizing and cross-functional teams and their customer(s)/end user(s). It advocates adaptive planning, evolutionary development, early delivery, and continual improvement, and it encourages rapid and flexible response to change (Wikipedia) Being Agile

Deployment, retraining & monitoring ML Engineering

Incoming requests are managed by a Flask API A Redis
server is used for managing the queue of jobs and for communication between the two containers. Logic and models in the same Docker image. Retraining the model requires a PR on Github. Docker images are pushed to GCR. flux picks the latest and updates the Helm chart.

Problem Performances start decreasing over time and models need to
be retrained. The more often you do it, the quicker and more robust the process has to be. Retraining the model Risk Live data isn't always the same as training. How can one prevent shipping a worse model? Challenge Why release a new (heavy) image when releasing a new model?

Move the model on Google Cloud Storage and retrieve it
when the pod starts. The codebase only contains the logic to accept requests and schedule jobs. The model is retrained asynchronously and pushed to GCS if it passes some performance checks in a simulated environment as similar as possible to the production one.

Monitoring ▪ How many requests is your endpoint receiving? (How
many jobs are executed?) ▪ How much time is needed to execute a job? ▪ How many times does the model send out alerts? PENELOPE_LATENCY = Histogram( 'penelope_latency_seconds', 'Time needed to execute prediction jobs', registry=registry, ) elapsed = job_ended - job_started PENELOPE_LATENCY.observe(elapsed) push_to_gateway( PROMETHEUS_PUSHGATEWAY_ADDRESS, job='penelope_worker-%s-%s' % ( instance_id, pid ), registry=registry ) Engineering best practices

A shared Grafana instance shows them in a structured way
providing an alerting infrastructure. A shared Prometheus instance retrieves the metrics. Celery jobs are ephemeral and need a Pushgateway to store statistics. Changes in the code are required to define and collect metrics.

Use the model eﬀiciently and have an impact Business Process

Set up a measurable process Define a business metric that
can prove the model’s impact. Beyond oﬀline precision and recall. Take risks, trust the model Try to limit human intervention as much as possible by automating whatever you can. Leverage the output probabilities.

-93% Time needed to detect a scam Mes2Det (median) from
15.2 to 1.03 hours -94% Active conversations with a scammer 73 (2018) vs 4 (2019) -55% Time spent on the task by Customer Solutions 40 (2018) vs 18 (2019) hours per month Fraud Prevention metrics December 2019 vs 2018, 0 scammed tenants

TL; DR Define a process to optimise first. The best
model won’t have any impact on a poor, untracked, process. Deliver a PoC quickly, don’t overthink best practices. Decouple model and prediction logic. Higher scalability for team and project.

Questions?

Detecting scams using AI, for real

Detecting scams using AI, for real

Massimo Belloni
PRO

More Decks by Massimo Belloni

Other Decks in Programming

Featured

Transcript