Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Detecting scams using AI, for real

Detecting scams using AI, for real

A one year journey to develop Penelope, a Machine Learning product able to keep HousingAnywhere scammers-free: challenges, intuitions and discoveries. HousingAnywhere is an accommodation marketplace with listings in 400+ cities around the world where advertisers publish hundreds of listings everyday. How we moved from a rule-based engine to a semi-automated process powered by Machine Learning.


Massimo Belloni

January 27, 2020

More Decks by Massimo Belloni

Other Decks in Programming


  1. Detecting Scams Using AI [for real] How we built one

    of the best fraud prevention systems on the market 27th January 2020 - High Tech Campus
  2. Massimo Belloni Data Scientist / Machine Learning Engineer massimo@housinganywhere.com @massibelloni

    @massibelloni Data Scientist / Machine Learning Engineer HousingAnywhere • Started as an intern ▪ Penelope - Scam Detection ▪ RentRadar - Price suggestion Msc Computer Science and Engineering Politecnico di Milano • BSc Computer Science and Engineering ▪ Focus on Machine Learning ▪ Generic interest in Algebra and Philosophy of Mind
  3. ▪ Founded in 2009, located in Rotterdam ▪ Housing platform

    two-sided marketplace, 100+ countries ▪ Deloitte NL Fast50 Ranking (2018, 2019) ▪ ~110 employees and counting ▪ 30k+ listings created per year (+100% YoY)
  4. What does a scam look like?

  5. None
  6. None
  7. ▪ 0 scammed tenants ▪ 30k+ listings created per year

    ▪ 2000+ detected scams Trust as a core value What are users paying for?
  8. ▪ Dataset collection ▪ Feature engineering ▪ Algorithm selection ▪

    Ensemble architecture Agenda Data Science ▪ Model deployment ▪ Retraining procedure ▪ Monitoring ML Engineering ▪ Model usage ▪ Notification management ▪ Platform integration Operations
  9. Dataset, models & metrics Data Science

  10. Dataset collection 100k records Collected since Nov 2016 Imbalanced 10%

    scam Different sources Listing and user info retrieved from the data warehouse
  11. Social features ▪ Users using numbers in the email are

    106% more likely to be scammers ▪ Users using more than 3 dots in the email address are 200% more likely to be scammers. Geographical features ▪ Users logging in from Nigeria are 0.74% of all the listings but 5% of all the scams. ▪ Listings located in countries where the user isn’t logging in from are 137% more likely to be scam. Feature Engineering The knowledge of the people, who deal with the task everyday, is crucial. A Listing’s Quality features ▪ Listings located in Amsterdam are 18% of all the scams (1% of all). London 4% (0.56%). ▪ 70% of all the scams have a cheaper price than the City Reference Price.
  12. Gradient Boosted Decision Trees 101

  13. Decision Tree Usually low performances (overfitting) Weak learner Easy to

    explain and understand
  14. Tree Ensembles From Random Forests to XGBoost

  15. Improving XGBoost LightGBM Gradient-based One-Side Sampling (GOSS) Downsampling over the

    instances’ space based on the gradient. Retrain all the instances with a large gradient and just sample over the instances with a small gradient. Exclusive Feature Bundling (EFB) Due to encoding a lot of features are never non-zero together. These features can be bundled together reducing the features’ space size. 15-20x faster than XGBoost comparable performances better built-in tricks Reduce number of samples Reduce number of features
  16. 6 models ensemble trained on different subsets of the dataset

    training set is sampled horizontally and vertically for each model LightGBM LogReg
  17. Dataset shift Concepts to be learned can change with time.

    Cross-validation isn’t a reliable approach for testing performances Temporal holdout to predict over the latest samples only Accuracy of a classifier predicting the samples’ period as a good indicator for shifting Vertical selection on the dataset while training
  18. Iterating on the model Greedy experimentation New features, different imputation

    and encoding strategies Tracking Datasets, parameters, weights. Deliver fast Prevent explosion of complexity and brute force over the space of the experiments. Deploy MVP quickly to bring value fast. Overcome SCRUM every experiment is planned after another. Evolving Epics. Clear scope. Agile software development comprises various approaches to software development under which requirements and solutions evolve through the collaborative effort of self-organizing and cross-functional teams and their customer(s)/end user(s). It advocates adaptive planning, evolutionary development, early delivery, and continual improvement, and it encourages rapid and flexible response to change (Wikipedia) Being Agile
  19. Deployment, retraining & monitoring ML Engineering

  20. Incoming requests are managed by a Flask API A Redis

    server is used for managing the queue of jobs and for communication between the two containers. Logic and models in the same Docker image. Retraining the model requires a PR on Github. Docker images are pushed to GCR. flux picks the latest and updates the Helm chart.
  21. Problem Performances start decreasing over time and models need to

    be retrained. The more often you do it, the quicker and more robust the process has to be. Retraining the model Risk Live data isn't always the same as training. How can one prevent shipping a worse model? Challenge Why release a new (heavy) image when releasing a new model?
  22. Move the model on Google Cloud Storage and retrieve it

    when the pod starts. The codebase only contains the logic to accept requests and schedule jobs. The model is retrained asynchronously and pushed to GCS if it passes some performance checks in a simulated environment as similar as possible to the production one.
  23. Monitoring ▪ How many requests is your endpoint receiving? (How

    many jobs are executed?) ▪ How much time is needed to execute a job? ▪ How many times does the model send out alerts? PENELOPE_LATENCY = Histogram( 'penelope_latency_seconds', 'Time needed to execute prediction jobs', registry=registry, ) elapsed = job_ended - job_started PENELOPE_LATENCY.observe(elapsed) push_to_gateway( PROMETHEUS_PUSHGATEWAY_ADDRESS, job='penelope_worker-%s-%s' % ( instance_id, pid ), registry=registry ) Engineering best practices
  24. A shared Grafana instance shows them in a structured way

    providing an alerting infrastructure. A shared Prometheus instance retrieves the metrics. Celery jobs are ephemeral and need a Pushgateway to store statistics. Changes in the code are required to define and collect metrics.
  25. Use the model efficiently and have an impact Business Process

  26. Set up a measurable process Define a business metric that

    can prove the model’s impact. Beyond offline precision and recall. Take risks, trust the model Try to limit human intervention as much as possible by automating whatever you can. Leverage the output probabilities.
  27. -93% Time needed to detect a scam Mes2Det (median) from

    15.2 to 1.03 hours -94% Active conversations with a scammer 73 (2018) vs 4 (2019) -55% Time spent on the task by Customer Solutions 40 (2018) vs 18 (2019) hours per month Fraud Prevention metrics December 2019 vs 2018, 0 scammed tenants
  28. TL; DR Define a process to optimise first. The best

    model won’t have any impact on a poor, untracked, process. Deliver a PoC quickly, don’t overthink best practices. Decouple model and prediction logic. Higher scalability for team and project.
  29. Questions?