Variable Mean Min Max Median 25% 75% Default (Response) 6.68% 0 1 - - - Utilization 6.04 0 50,708 15% 2.9% 56% Age 52.29 0 109 52 41 63 # 30-59 Past Due 0.42 0 98 0 0 0 Debt Ratio 353 0 329,664 17% 36% 86% Monthly Income 6,670 0 300,875 5,400 3,400 8,250 # Open Credit Line 8.45 0 58 8 5 11 # 90 Days Late 0.26 0 98 0 0 0 # Real Estate Loan 1.02 0 54 1 0 2 # 60-89 Past Due 0.24 0 98 0 0 0 # Dependents 0.76 0 20 0 0 1 • Outliers in Utilization and Debt Ratio • Heavy right skewed for counts of 30-90 Past Due, Monthly Income, Open Credit Line, 90 Days Late, Real Estate Loan, 60-89 Past Due, and Dependents
RIGHTS RESERVED -6 -4 -2 0 0% 20% 40% 60% 80% 100% Empirical Logit Utilization Univariate Plot * Non-linear trend Outliners (average utilization=115) • Overall utilization showed positive linear correlation with logit of the default • At the lower end (utilization<5%), empirical logit is decreased then increased • There are outliers with utilization > 1000. Those observations need to be treat separately * Empirical logit is calculated as log((Y+0.5)/(N-Y+0.5)), where Y is count of default and N is count of observation in given bin * Distribution Plot is generated from Python Seaborn package Distribution Plot Long tail
QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 2 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of 30-50 Past Due Univariate Plot Linear trend • From 0 to 7, #of 30-59 Past Due showed positive linear correlation with logit of default • From 7 to 15, the trend is not clear • Above 15, the logit of default is stably high Distribution Plot * Outliners count=96,98
RIGHTS RESERVED -4 -2 0 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 Empirical Logit Debt Ratio Univariate Plot • Customers with debt ratio <5% showed stable low default • Between 5% and 20%, debt ratio is negatively correlated with empirical logit of default • From 20% to 80%, debt ratio showed positive correlation with empirical logit • Above 80%, the default rate is stably high * Outliners (average ratio>50) Non-linear trend Distribution Plot Long tail
RESERVED -4 -2 0 (1.00) 9,999.00 19,999.00 29,999.00 Empirical Logit Income Univariate Plot • Customers with missing income show relatively low default • For customer with income <$3K, income is positively correlated with default • For income between $3K and $10K, default rate decrease with increase of income • For income above 10K, default rate is stable and low Distribution Plot Long tail
QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 0.00 5.00 10.00 15.00 20.00 25.00 Empirical Logit # of Open Credit Line Univariate Plot • For <5 open credit line, the increase of open credit line count predict lower default • For >5 open credit line, the default rate is stable Non-linear trend Distribution Plot
2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 2 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of 90 Days Late Univariate Plot Positive correlation Negative correlation (counter-intuitive) • From 0 to 5, #of 90 Days Late showed positive linear correlation with logit of default • From 7 to 15, #of 90 Days Late showed negative linear correlation with logit of default • Above 15, the logit of default is stably high Distribution Plot * Outliners count=96,98
2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 2 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of Real Estate Loan Univariate Plot Positive correlation Trend not clear Negative correlation • From 0 to 6, #of Real Estate Loan showed positive linear correlation with logit of default • From 7 to 10, #of Real Estate Loan showed negative linear correlation with logit of default • Above 10, the logit of default is trend is not clear Distribution Plot
ALL RIGHTS RESERVED -4 -2 0 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of Dependent Univariate Plot Positive correlation Negative correlation • From 0 to 5, # of Dependent showed positive linear correlation with logit of default • From 6 to 10, # of Dependent showed negative linear correlation with logit of default • Above 10, the logit of default is high Distribution Plot
ALL RIGHTS RESERVED All Population # 90 Days Late=0 # 90 Days Late<2 Revolving Uitl<60% # 30-59 PD=0 # 30-59 PD>0 # obs: 100,200 Bad Rate:1.7% # obs: 12,082 Bad Rate:8.5% Revolving Uitl>=60% # 30-59 PD=0 # 30-59 PD>0 # obs: 21,838 Bad Rate:8.8% # obs: 7,551 Bad Rate:23.8% Revolving Uitl<50% Revolving Uitl>=50% # obs: 1,676 Bad Rate:20.7% # obs: 3567 Bad Rate:39.7% # 90 Days Late>=2 # obs: 3095 Bad Rate:55/1% Low Risk Mid Risk High Risk * Results are based on Decision Tree Classification in scikit-learn
RIGHTS RESERVED Variable Cleansing & Transformation Utilization split at 5%, cap at 100%, missing impute at median value Age Cap at 85, missing impute at median value # 30-59 Past Due Zero indicator, split at 7, cap at 15, missing impute at median value Debt Ratio Split at 20%, cap at 80%, missing impute at median value Monthly Income Split at 3K, cap at 10K, missing impute at median value # Open Credit Line Square transformation, cap at 10, missing impute at median value # 90 Days Late Zero indicator, split at 5, cap at 15, missing impute at median value # Real Estate Loan Zero indicator, split at 6 and 10, >10 with high value indicator, missing impute at median value # 60-89 Past Due Zero indicator, split at 6 and 10, >10 with high value indicator, missing impute at median value # Dependents split at 5 and 10, >10 with high value indicator, missing impute at median value
RIGHTS RESERVED • Only consider interactions among top 5 important variables • The predict powers of two way interaction terms are test through random forest variable importance method Interaction Term Importance Age*MonthlyIncome 0.048944 Age*DebtRatio 0.043506 DebtRatio*MonthlyIncome 0.039051 Age*NumberOfTimes90DaysLate 0.023346 Example: Age*Monthly Income -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0 2000 4000 6000 8000 10000 12000 Empirical Logit Income Two Way Analysis Plot age<=40 40<age<=60 age>60 Mid Age showed deeper trend between income and default than young and senior population