Machine Learning in Finance - London PyData 2017

Soledad Galli, PhD Data Scientist London PyData 6th May 2017

Who are Zopa?

Zopa today World’s first peer-to-peer lending platform since 2004 Lent
£2.33 billion to date, and our growth is accelerating 246,000 people have taken a Zopa loan 59,000 actively invest through Zopa 1st

Why I joined Zopa State of the art machine learning
and data analytics Agile, proprietary technology End to end team collaboration from R&D to deployment

Investor Borrower £££ to lend out £££ repayments £££ interest
+ capital £££ loan Peer to peer lending at Zopa

6 Machine Learning at Zopa • Credit Risk • Fraud
• ID verification • Marketing • Document tampering • Pricing • Customer segmentation

7 Credit Risk Assessment Strictly Private & Confidential Loan Repayment
Borrower • Affect borrower’ s financial situation • Loss of investors’ capital and interest • Damaged reputation as a responsible lender

Strictly Private & Confidential 8 Credit Risk Assessment Journey •
Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment

Strictly Private & Confidential 10 Dataset Credit Agency I Credit
Agencies Applicant’s snapshot Intended use of loan, Job, salary. Etc. Dataset 100s thousands borrowers ~ 3000 characteristics Features / variables Financial Information • Mortgages • Credit cards • Current accounts, etc. • Balances and Payments • Predictive variables

Strictly Private & Confidential 12 Target Definition • Mortgages •
Credit cards • Current accounts • Others • Signs of financial difficulty: • Default • Missed payments •  borrower with risk (0) • Signs of financial health •  borrower without risk (1)

Strictly Private & Confidential 13 Dataset + Target Credit Agencies
Risk = 0 No Risk = 1 No Risk = 1 No Risk = 1 Risk = 0 100s thousands borrowers Rows ~ 3000 characteristics Features / variables / columns 1/ 0 Target vector

Strictly Private & Confidential 15 Feature Engineering I – Categorical
variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with most frequent category • Add an additional category for null values 2. Rare values • Replace with random sample of data • Replace with most frequent category • Add an additional category for Rare Values 3. Convert to numbers 1. Replace by the number of items in that category 2. Assign order when ordinal 3. One hot encoding 4. Assign value according to target mean per category

Strictly Private & Confidential 16 Feature Engineering II – Numerical
variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with median or mean • Replace with number far out in the distribution 2. Outliers (Only for Linear Models) • Replace with random sample of data • Replace with median or mean • Replace with number at ends of distribution

Strictly Private & Confidential 17 Feature Engineering III – Variable
Pre-Processing Credit Agencies 1. Normalisation (Linear Models and Neural Networks) 2. Linearization (Linear Models) • Transformation (log, sqrt, etc) • Rebinning • Replace by the probability output of shallow tree

Strictly Private & Confidential 19 Feature Selection ~ 3000 Credit
Agencies ~ tens Credit Agencies  Less is more • Easier implementation • Faster • More reliable

Strictly Private & Confidential 20 Feature Selection First Stage (from
few 1000s to few 100s) • Single variable predictive value vs Target • Feature importance in Random Forests ~ 3000 Credit Agencies ~ tens Credit Agencies Second Stage (from few 100s to few 10s) • Recursive feature elimination • Logit • Random Forests • XGB

Strictly Private & Confidential 22 Machine Learning Model Building Linear
Models Logistic Regression MARS Tree Models Random Forests Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment

Strictly Private & Confidential 23 Machine Learning Model Building Linear
Models Logistic Regression MARS Tree Models Random Forests Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment The models learn to assign: • High probability to customers that can repay their loans. • Low probability to customers that may encounter difficulties in repaying their loans.

Strictly Private & Confidential 24 Machine Learning Model - Performance
Tree Models Random Forests Gradient Boosted Trees Neural Networks Linear Models Logistic Regression MARS • ROC-AUC • For each probability value: • How many times the model made a good assessment • How many times the model made a wrong assessment.

25 Ensemble Models - Trees Average Probability Credit Agencies Loan
Repayment Strictly Private & Confidential Random Forest and Gradient Boosted Trees

26 Improving Logit – Bagging of the Predictors Average Probability
Credit Agencies Loan Repayment Strictly Private & Confidential Improves ROC-AUC at the second decimal

28 Model Stacking – Meta Ensembling Strictly Private & Confidential
Logit RF GBT NN MARS Prob Prob Prob Prob Prob Meta Model Loan Repayment

29 Model Stacking – Meta Ensembling Strictly Private & Confidential
Logit RF GBT NN MARS Prob Prob Prob Prob Prob Meta Model 1. Average Probability 2. Meta Machine Learning Model No performance improvement

Strictly Private & Confidential 30 Machine Learning Model Building Gradient
Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment Average Probability

Strictly Private & Confidential 32 Predictor Python integrated framework for
variable pre-processing and ML model building PREDICTOR Seaborn

Strictly Private & Confidential 33 Predictor Python integrated framework for
variable pre-processing and ML model building PREDICTOR Fast • End to end model building without re-writing code Easy deploy • Decreases overhead between model building and model deployment (few weeks) Flexible • Can add layers of complexity on the go Proprietary • We are thinking of open sourcing Predictor

Strictly Private & Confidential 34

Machine Learning in Finance - London PyData 2017

Machine Learning in Finance - London PyData 2017

More Decks by Sole

Other Decks in Technology

Featured

Transcript