Machine Learning in Finance - London PyData 2017

9d52ecb5233b7c5b8778d451d52b1034?s=47 Sole
May 23, 2017

Machine Learning in Finance - London PyData 2017

Slides presented during my talk at London PyData 2017

9d52ecb5233b7c5b8778d451d52b1034?s=128

Sole

May 23, 2017
Tweet

Transcript

  1. Soledad Galli, PhD Data Scientist London PyData 6th May 2017

  2. Who are Zopa?

  3. Zopa today World’s first peer-to-peer lending platform since 2004 Lent

    £2.33 billion to date, and our growth is accelerating 246,000 people have taken a Zopa loan 59,000 actively invest through Zopa 1st
  4. Why I joined Zopa State of the art machine learning

    and data analytics Agile, proprietary technology End to end team collaboration from R&D to deployment
  5. Investor Borrower £££ to lend out £££ repayments £££ interest

    + capital £££ loan Peer to peer lending at Zopa
  6. 6 Machine Learning at Zopa • Credit Risk • Fraud

    • ID verification • Marketing • Document tampering • Pricing • Customer segmentation
  7. 7 Credit Risk Assessment Strictly Private & Confidential Loan Repayment

    Borrower • Affect borrower’ s financial situation • Loss of investors’ capital and interest • Damaged reputation as a responsible lender
  8. Strictly Private & Confidential 8 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  9. Strictly Private & Confidential 9 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  10. Strictly Private & Confidential 10 Dataset Credit Agency I Credit

    Agencies Applicant’s snapshot Intended use of loan, Job, salary. Etc. Dataset 100s thousands borrowers ~ 3000 characteristics Features / variables Financial Information • Mortgages • Credit cards • Current accounts, etc. • Balances and Payments • Predictive variables
  11. Strictly Private & Confidential 11 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  12. Strictly Private & Confidential 12 Target Definition • Mortgages •

    Credit cards • Current accounts • Others • Signs of financial difficulty: • Default • Missed payments •  borrower with risk (0) • Signs of financial health •  borrower without risk (1)
  13. Strictly Private & Confidential 13 Dataset + Target Credit Agencies

    Risk = 0 No Risk = 1 No Risk = 1 No Risk = 1 Risk = 0 100s thousands borrowers Rows ~ 3000 characteristics Features / variables / columns 1/ 0 Target vector
  14. Strictly Private & Confidential 14 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  15. Strictly Private & Confidential 15 Feature Engineering I – Categorical

    variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with most frequent category • Add an additional category for null values 2. Rare values • Replace with random sample of data • Replace with most frequent category • Add an additional category for Rare Values 3. Convert to numbers 1. Replace by the number of items in that category 2. Assign order when ordinal 3. One hot encoding 4. Assign value according to target mean per category
  16. Strictly Private & Confidential 16 Feature Engineering II – Numerical

    variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with median or mean • Replace with number far out in the distribution 2. Outliers (Only for Linear Models) • Replace with random sample of data • Replace with median or mean • Replace with number at ends of distribution
  17. Strictly Private & Confidential 17 Feature Engineering III – Variable

    Pre-Processing Credit Agencies 1. Normalisation (Linear Models and Neural Networks) 2. Linearization (Linear Models) • Transformation (log, sqrt, etc) • Rebinning • Replace by the probability output of shallow tree
  18. Strictly Private & Confidential 18 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  19. Strictly Private & Confidential 19 Feature Selection ~ 3000 Credit

    Agencies ~ tens Credit Agencies  Less is more • Easier implementation • Faster • More reliable
  20. Strictly Private & Confidential 20 Feature Selection First Stage (from

    few 1000s to few 100s) • Single variable predictive value vs Target • Feature importance in Random Forests ~ 3000 Credit Agencies ~ tens Credit Agencies Second Stage (from few 100s to few 10s) • Recursive feature elimination • Logit • Random Forests • XGB
  21. Strictly Private & Confidential 21 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  22. Strictly Private & Confidential 22 Machine Learning Model Building Linear

    Models Logistic Regression MARS Tree Models Random Forests Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment
  23. Strictly Private & Confidential 23 Machine Learning Model Building Linear

    Models Logistic Regression MARS Tree Models Random Forests Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment The models learn to assign: • High probability to customers that can repay their loans. • Low probability to customers that may encounter difficulties in repaying their loans.
  24. Strictly Private & Confidential 24 Machine Learning Model - Performance

    Tree Models Random Forests Gradient Boosted Trees Neural Networks Linear Models Logistic Regression MARS • ROC-AUC • For each probability value: • How many times the model made a good assessment • How many times the model made a wrong assessment.
  25. 25 Ensemble Models - Trees Average Probability Credit Agencies Loan

    Repayment Strictly Private & Confidential Random Forest and Gradient Boosted Trees
  26. 26 Improving Logit – Bagging of the Predictors Average Probability

    Credit Agencies Loan Repayment Strictly Private & Confidential Improves ROC-AUC at the second decimal
  27. Strictly Private & Confidential 27 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  28. 28 Model Stacking – Meta Ensembling Strictly Private & Confidential

    Logit RF GBT NN MARS Prob Prob Prob Prob Prob Meta Model Loan Repayment
  29. 29 Model Stacking – Meta Ensembling Strictly Private & Confidential

    Logit RF GBT NN MARS Prob Prob Prob Prob Prob Meta Model 1. Average Probability 2. Meta Machine Learning Model No performance improvement
  30. Strictly Private & Confidential 30 Machine Learning Model Building Gradient

    Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment Average Probability
  31. Strictly Private & Confidential 31 Credit Risk Assessment Journey •

    Gathering the Dataset • Target Definition • Feature / Variable Optimisation • Feature / Variable Pre-Processing • Feature / Variable selection • Machine Learning Model building and optimisation • Model stacking • Credit Risk Model deployment
  32. Strictly Private & Confidential 32 Predictor Python integrated framework for

    variable pre-processing and ML model building PREDICTOR Seaborn
  33. Strictly Private & Confidential 33 Predictor Python integrated framework for

    variable pre-processing and ML model building PREDICTOR Fast • End to end model building without re-writing code Easy deploy • Decreases overhead between model building and model deployment (few weeks) Flexible • Can add layers of complexity on the go Proprietary • We are thinking of open sourcing Predictor
  34. Strictly Private & Confidential 34