Slide 1

Slide 1 text

Soledad Galli, PhD Data Scientist London PyData 6th May 2017

Slide 2

Slide 2 text

Who are Zopa?

Slide 3

Slide 3 text

Zopa today World’s first peer-to-peer lending platform since 2004 Lent £2.33 billion to date, and our growth is accelerating 246,000 people have taken a Zopa loan 59,000 actively invest through Zopa 1st

Slide 4

Slide 4 text

Why I joined Zopa State of the art machine learning and data analytics Agile, proprietary technology End to end team collaboration from R&D to deployment

Slide 5

Slide 5 text

Investor Borrower £££ to lend out £££ repayments £££ interest + capital £££ loan Peer to peer lending at Zopa

Slide 6

Slide 6 text

6 Machine Learning at Zopa • Credit Risk • Fraud • ID verification • Marketing • Document tampering • Pricing • Customer segmentation

Slide 7

Slide 7 text

7 Credit Risk Assessment Strictly Private & Confidential Loan Repayment Borrower • Affect borrower’ s financial situation • Loss of investors’ capital and interest • Damaged reputation as a responsible lender

Slide 8

Slide 8 text

Strictly Private & Confidential 8 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 9

Slide 9 text

Strictly Private & Confidential 9 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 10

Slide 10 text

Strictly Private & Confidential 10 Dataset Credit Agency I Credit Agencies Applicant’s snapshot Intended use of loan, Job, salary. Etc. Dataset 100s thousands borrowers ~ 3000 characteristics Features / variables Financial Information • Mortgages • Credit cards • Current accounts, etc. • Balances and Payments • Predictive variables

Slide 11

Slide 11 text

Strictly Private & Confidential 11 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 12

Slide 12 text

Strictly Private & Confidential 12 Target Definition • Mortgages • Credit cards • Current accounts • Others • Signs of financial difficulty: • Default • Missed payments •  borrower with risk (0) • Signs of financial health •  borrower without risk (1)

Slide 13

Slide 13 text

Strictly Private & Confidential 13 Dataset + Target Credit Agencies Risk = 0 No Risk = 1 No Risk = 1 No Risk = 1 Risk = 0 100s thousands borrowers Rows ~ 3000 characteristics Features / variables / columns 1/ 0 Target vector

Slide 14

Slide 14 text

Strictly Private & Confidential 14 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 15

Slide 15 text

Strictly Private & Confidential 15 Feature Engineering I – Categorical variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with most frequent category • Add an additional category for null values 2. Rare values • Replace with random sample of data • Replace with most frequent category • Add an additional category for Rare Values 3. Convert to numbers 1. Replace by the number of items in that category 2. Assign order when ordinal 3. One hot encoding 4. Assign value according to target mean per category

Slide 16

Slide 16 text

Strictly Private & Confidential 16 Feature Engineering II – Numerical variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with median or mean • Replace with number far out in the distribution 2. Outliers (Only for Linear Models) • Replace with random sample of data • Replace with median or mean • Replace with number at ends of distribution

Slide 17

Slide 17 text

Strictly Private & Confidential 17 Feature Engineering III – Variable Pre-Processing Credit Agencies 1. Normalisation (Linear Models and Neural Networks) 2. Linearization (Linear Models) • Transformation (log, sqrt, etc) • Rebinning • Replace by the probability output of shallow tree

Slide 18

Slide 18 text

Strictly Private & Confidential 18 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 19

Slide 19 text

Strictly Private & Confidential 19 Feature Selection ~ 3000 Credit Agencies ~ tens Credit Agencies  Less is more • Easier implementation • Faster • More reliable

Slide 20

Slide 20 text

Strictly Private & Confidential 20 Feature Selection First Stage (from few 1000s to few 100s) • Single variable predictive value vs Target • Feature importance in Random Forests ~ 3000 Credit Agencies ~ tens Credit Agencies Second Stage (from few 100s to few 10s) • Recursive feature elimination • Logit • Random Forests • XGB

Slide 21

Slide 21 text

Strictly Private & Confidential 21 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 22

Slide 22 text

Strictly Private & Confidential 22 Machine Learning Model Building Linear Models Logistic Regression MARS Tree Models Random Forests Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment

Slide 23

Slide 23 text

Strictly Private & Confidential 23 Machine Learning Model Building Linear Models Logistic Regression MARS Tree Models Random Forests Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment The models learn to assign: • High probability to customers that can repay their loans. • Low probability to customers that may encounter difficulties in repaying their loans.

Slide 24

Slide 24 text

Strictly Private & Confidential 24 Machine Learning Model - Performance Tree Models Random Forests Gradient Boosted Trees Neural Networks Linear Models Logistic Regression MARS • ROC-AUC • For each probability value: • How many times the model made a good assessment • How many times the model made a wrong assessment.

Slide 25

Slide 25 text

25 Ensemble Models - Trees Average Probability Credit Agencies Loan Repayment Strictly Private & Confidential Random Forest and Gradient Boosted Trees

Slide 26

Slide 26 text

26 Improving Logit – Bagging of the Predictors Average Probability Credit Agencies Loan Repayment Strictly Private & Confidential Improves ROC-AUC at the second decimal

Slide 27

Slide 27 text

Strictly Private & Confidential 27 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 28

Slide 28 text

28 Model Stacking – Meta Ensembling Strictly Private & Confidential Logit RF GBT NN MARS Prob Prob Prob Prob Prob Meta Model Loan Repayment

Slide 29

Slide 29 text

29 Model Stacking – Meta Ensembling Strictly Private & Confidential Logit RF GBT NN MARS Prob Prob Prob Prob Prob Meta Model 1. Average Probability 2. Meta Machine Learning Model No performance improvement

Slide 30

Slide 30 text

Strictly Private & Confidential 30 Machine Learning Model Building Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment Average Probability

Slide 31

Slide 31 text

Strictly Private & Confidential 31 Credit Risk Assessment Journey • Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Slide 32

Slide 32 text

Strictly Private & Confidential 32 Predictor Python integrated framework for variable pre-processing and ML model building PREDICTOR Seaborn

Slide 33

Slide 33 text

Strictly Private & Confidential 33 Predictor Python integrated framework for variable pre-processing and ML model building PREDICTOR Fast • End to end model building without re-writing code Easy deploy • Decreases overhead between model building and model deployment (few weeks) Flexible • Can add layers of complexity on the go Proprietary • We are thinking of open sourcing Predictor

Slide 34

Slide 34 text

Strictly Private & Confidential 34