Agencies Applicant’s snapshot Intended use of loan, Job, salary. Etc. Dataset 100s thousands borrowers ~ 3000 characteristics Features / variables Financial Information • Mortgages • Credit cards • Current accounts, etc. • Balances and Payments • Predictive variables
variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with most frequent category • Add an additional category for null values 2. Rare values • Replace with random sample of data • Replace with most frequent category • Add an additional category for Rare Values 3. Convert to numbers 1. Replace by the number of items in that category 2. Assign order when ordinal 3. One hot encoding 4. Assign value according to target mean per category
variables Credit Agencies 1. Null values • Replace with random sample of data • Replace with median or mean • Replace with number far out in the distribution 2. Outliers (Only for Linear Models) • Replace with random sample of data • Replace with median or mean • Replace with number at ends of distribution
Pre-Processing Credit Agencies 1. Normalisation (Linear Models and Neural Networks) 2. Linearization (Linear Models) • Transformation (log, sqrt, etc) • Rebinning • Replace by the probability output of shallow tree
few 1000s to few 100s) • Single variable predictive value vs Target • Feature importance in Random Forests ~ 3000 Credit Agencies ~ tens Credit Agencies Second Stage (from few 100s to few 10s) • Recursive feature elimination • Logit • Random Forests • XGB
Models Logistic Regression MARS Tree Models Random Forests Gradient Boosted Trees Neural Networks ~ tens Credit Agencies Loan Repayment The models learn to assign: • High probability to customers that can repay their loans. • Low probability to customers that may encounter difficulties in repaying their loans.
Tree Models Random Forests Gradient Boosted Trees Neural Networks Linear Models Logistic Regression MARS • ROC-AUC • For each probability value: • How many times the model made a good assessment • How many times the model made a wrong assessment.
variable pre-processing and ML model building PREDICTOR Fast • End to end model building without re-writing code Easy deploy • Decreases overhead between model building and model deployment (few weeks) Flexible • Can add layers of complexity on the go Proprietary • We are thinking of open sourcing Predictor