Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Give Me Some Credit Analysis

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Qize Le Qize Le
September 25, 2016
110

Give Me Some Credit Analysis

Give_Me_Some_Credit_Analysis

Avatar for Qize Le

Qize Le

September 25, 2016
Tweet

Transcript

  1. Data Overview 9/25/2016 2 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED

    Variable Mean Min Max Median 25% 75% Default (Response) 6.68% 0 1 - - - Utilization 6.04 0 50,708 15% 2.9% 56% Age 52.29 0 109 52 41 63 # 30-59 Past Due 0.42 0 98 0 0 0 Debt Ratio 353 0 329,664 17% 36% 86% Monthly Income 6,670 0 300,875 5,400 3,400 8,250 # Open Credit Line 8.45 0 58 8 5 11 # 90 Days Late 0.26 0 98 0 0 0 # Real Estate Loan 1.02 0 54 1 0 2 # 60-89 Past Due 0.24 0 98 0 0 0 # Dependents 0.76 0 20 0 0 1 • Outliers in Utilization and Debt Ratio • Heavy right skewed for counts of 30-90 Past Due, Monthly Income, Open Credit Line, 90 Days Late, Real Estate Loan, 60-89 Past Due, and Dependents
  2. Univariate – Revolving Utilization 9/25/2016 3 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL

    RIGHTS RESERVED -6 -4 -2 0 0% 20% 40% 60% 80% 100% Empirical Logit Utilization Univariate Plot * Non-linear trend Outliners (average utilization=115) • Overall utilization showed positive linear correlation with logit of the default • At the lower end (utilization<5%), empirical logit is decreased then increased • There are outliers with utilization > 1000. Those observations need to be treat separately * Empirical logit is calculated as log((Y+0.5)/(N-Y+0.5)), where Y is count of default and N is count of observation in given bin * Distribution Plot is generated from Python Seaborn package Distribution Plot Long tail
  3. Univariate – Age 9/25/2016 4 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS

    RESERVED -6 -4 -2 0 25.00 35.00 45.00 55.00 65.00 75.00 85.00 95.00 Empirical Logit Age Univariate Plot • Overall Age showed negative linear correlation with logit of default Distribution Plot
  4. Univariate – #of 30-59 Past Due 9/25/2016 5 © 2016

    QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 2 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of 30-50 Past Due Univariate Plot Linear trend • From 0 to 7, #of 30-59 Past Due showed positive linear correlation with logit of default • From 7 to 15, the trend is not clear • Above 15, the logit of default is stably high Distribution Plot * Outliners count=96,98
  5. Univariate – Debt Ratio 9/25/2016 6 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL

    RIGHTS RESERVED -4 -2 0 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 Empirical Logit Debt Ratio Univariate Plot • Customers with debt ratio <5% showed stable low default • Between 5% and 20%, debt ratio is negatively correlated with empirical logit of default • From 20% to 80%, debt ratio showed positive correlation with empirical logit • Above 80%, the default rate is stably high * Outliners (average ratio>50) Non-linear trend Distribution Plot Long tail
  6. Univariate – Income 9/25/2016 7 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS

    RESERVED -4 -2 0 (1.00) 9,999.00 19,999.00 29,999.00 Empirical Logit Income Univariate Plot • Customers with missing income show relatively low default • For customer with income <$3K, income is positively correlated with default • For income between $3K and $10K, default rate decrease with increase of income • For income above 10K, default rate is stable and low Distribution Plot Long tail
  7. Univariate – Open Credit Line Count 9/25/2016 8 © 2016

    QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 0.00 5.00 10.00 15.00 20.00 25.00 Empirical Logit # of Open Credit Line Univariate Plot • For <5 open credit line, the increase of open credit line count predict lower default • For >5 open credit line, the default rate is stable Non-linear trend Distribution Plot
  8. Univariate – # of 90 Days Late 9/25/2016 9 ©

    2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 2 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of 90 Days Late Univariate Plot Positive correlation Negative correlation (counter-intuitive) • From 0 to 5, #of 90 Days Late showed positive linear correlation with logit of default • From 7 to 15, #of 90 Days Late showed negative linear correlation with logit of default • Above 15, the logit of default is stably high Distribution Plot * Outliners count=96,98
  9. Univariate – # of Real Estate Loan 9/25/2016 10 ©

    2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 2 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of Real Estate Loan Univariate Plot Positive correlation Trend not clear Negative correlation • From 0 to 6, #of Real Estate Loan showed positive linear correlation with logit of default • From 7 to 10, #of Real Estate Loan showed negative linear correlation with logit of default • Above 10, the logit of default is trend is not clear Distribution Plot
  10. Univariate – # of 60-89 Days Past Due 9/25/2016 11

    © 2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED -4 -2 0 2 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of 60-89 Days Past Due Univariate Plot Positive correlation Negative correlation • From 0 to 6, # of 60-89 Days PD showed positive linear correlation with logit of default • From 7 to 10, # of 60-89 Days PD showed negative linear correlation with logit of default • Above 10, the logit of default is trend is not clear Distribution Plot
  11. Univariate – # of Dependent 9/25/2016 12 © 2016 QIZESHOWCASE.WORDPRESS.COM

    ALL RIGHTS RESERVED -4 -2 0 0.00 5.00 10.00 15.00 20.00 Empirical Logit # of Dependent Univariate Plot Positive correlation Negative correlation • From 0 to 5, # of Dependent showed positive linear correlation with logit of default • From 6 to 10, # of Dependent showed negative linear correlation with logit of default • Above 10, the logit of default is high Distribution Plot
  12. Default Risk Indicator (Rule Based) 9/25/2016 13 © 2016 QIZESHOWCASE.WORDPRESS.COM

    ALL RIGHTS RESERVED All Population # 90 Days Late=0 # 90 Days Late<2 Revolving Uitl<60% # 30-59 PD=0 # 30-59 PD>0 # obs: 100,200 Bad Rate:1.7% # obs: 12,082 Bad Rate:8.5% Revolving Uitl>=60% # 30-59 PD=0 # 30-59 PD>0 # obs: 21,838 Bad Rate:8.8% # obs: 7,551 Bad Rate:23.8% Revolving Uitl<50% Revolving Uitl>=50% # obs: 1,676 Bad Rate:20.7% # obs: 3567 Bad Rate:39.7% # 90 Days Late>=2 # obs: 3095 Bad Rate:55/1% Low Risk Mid Risk High Risk * Results are based on Decision Tree Classification in scikit-learn
  13. Variable Cleansing & Transformation 9/25/2016 14 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL

    RIGHTS RESERVED Variable Cleansing & Transformation Utilization split at 5%, cap at 100%, missing impute at median value Age Cap at 85, missing impute at median value # 30-59 Past Due Zero indicator, split at 7, cap at 15, missing impute at median value Debt Ratio Split at 20%, cap at 80%, missing impute at median value Monthly Income Split at 3K, cap at 10K, missing impute at median value # Open Credit Line Square transformation, cap at 10, missing impute at median value # 90 Days Late Zero indicator, split at 5, cap at 15, missing impute at median value # Real Estate Loan Zero indicator, split at 6 and 10, >10 with high value indicator, missing impute at median value # 60-89 Past Due Zero indicator, split at 6 and 10, >10 with high value indicator, missing impute at median value # Dependents split at 5 and 10, >10 with high value indicator, missing impute at median value
  14. Variable Importance 9/25/2016 15 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL RIGHTS RESERVED

    Features Importance RevolvingUtilizationOfUnsecuredLines 0.13842 RevolvingUtilizationOfUnsecuredLines_s2 0.12707 age 0.091988 DebtRatio 0.061434 MonthlyIncome 0.060277 MonthlyIncome_s2 0.04758 DebtRatio_s2 0.045814 NumberOfTimes90DaysLate_0 0.034609 NumberOfTimes90DaysLate_s1 0.032311 NumberOfTime30_59DaysPastDueNotWorse_s1 0.031688 NumberOfTime30_59DaysPastDueNotWorse 0.030462 NumberOfTimes90DaysLate 0.028534 NumberOfOpenCreditLinesAndLoans_sq 0.02847 NumberOfOpenCreditLinesAndLoans 0.028289 NumberOfTime30_59DaysPastDueNotWorse_0 0.027215 RevolvingUtilizationOfUnsecuredLines_s1 0.022843 DebtRatio_s1 0.021536 MonthlyIncome_s1 0.018943 NumberOfDependents_s1 0.018521 NumberOfDependents 0.018506 NumberOfTime60_89DaysPastDueNotWorse_s1 0.01789 NumberRealEstateLoansOrLines 0.016533 NumberRealEstateLoansOrLines_s1 0.016209 NumberOfTime60_89DaysPastDueNotWorse 0.01503 NumberOfTime60_89DaysPastDueNotWorse_0 0.014257 NumberRealEstateLoansOrLines_0 0.004053 NumberRealEstateLoansOrLines_s2 0.000735 NumberOfTimes90DaysLate_s2 0.000257 NumberRealEstateLoansOrLines_h 0.00024 NumberOfDependents_s2 0.000176 NumberOfTime30_59DaysPastDueNotWorse_s2 0.000079 NumberOfTime60_89DaysPastDueNotWorse_s2 0.000015 NumberOfTime60_89DaysPastDueNotWorse_h 0.000009 NumberOfDependents_h 0.000007 Importance>0.01 * Results are based on Random Forest Model in scikit-learn
  15. Two Way Interaction Terms 9/25/2016 16 © 2016 QIZESHOWCASE.WORDPRESS.COM ALL

    RIGHTS RESERVED • Only consider interactions among top 5 important variables • The predict powers of two way interaction terms are test through random forest variable importance method Interaction Term Importance Age*MonthlyIncome 0.048944 Age*DebtRatio 0.043506 DebtRatio*MonthlyIncome 0.039051 Age*NumberOfTimes90DaysLate 0.023346 Example: Age*Monthly Income -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0 2000 4000 6000 8000 10000 12000 Empirical Logit Income Two Way Analysis Plot age<=40 40<age<=60 age>60 Mid Age showed deeper trend between income and default than young and senior population
  16. Logistic Regression Modeling & Performance 9/25/2016 17 © 2016 QIZESHOWCASE.WORDPRESS.COM

    ALL RIGHTS RESERVED Features Model 1 Model 2 Model 3 Intercept -7.38E-07 -7.5E-07 -7.52E-07 RevolvingUtilizationOfUnsecuredLines 0.0011574 0.0004254 0.0002984 RevolvingUtilizationOfUnsecuredLines_s2 0.0011327 0.0004164 0.0002922 age -0.003671 -0.001266 -0.000971 DebtRatio -9.56E-06 -8.16E-06 -5.25E-06 MonthlyIncome -0.000619 -0.000613 -0.000593 MonthlyIncome_s2 0.0007445 0.0007743 0.000733 DebtRatio_s2 -1.49E-05 -8.77E-06 -6.3E-06 NumberOfTimes90DaysLate_0 -0.000676 -0.000309 -0.000225 NumberOfTimes90DaysLate_s1 -0.000791 -0.000485 -0.000374 NumberOfTime30_59DaysPastDueNotWorse_s1 -0.002283 -0.000904 -0.000634 NumberOfTime30_59DaysPastDueNotWorse 0.0019156 0.0008296 0.0005876 NumberOfTimes90DaysLate 0.0002765 0.0003848 0.0003156 NumberOfOpenCreditLinesAndLoans_sq 0.0019003 -0.000481 -0.000491 NumberOfOpenCreditLinesAndLoans -0.000253 -0.000213 -0.000165 NumberOfTime30_59DaysPastDueNotWorse_0 -0.001155 -0.000429 -0.000302 RevolvingUtilizationOfUnsecuredLines_s1 -2.48E-05 -9.02E-06 -6.16E-06 DebtRatio_s1 -5.49E-06 -7.65E-07 -1.21E-06 MonthlyIncome_s1 -0.00085 -0.000863 -0.000931 NumberOfDependents_s1 -0.000604 -0.000269 -0.000178 NumberOfDependents 0.0006077 0.0002669 0.0001752 NumberOfTime60_89DaysPastDueNotWorse_s1 -0.00083 -0.00037 -0.000273 NumberRealEstateLoansOrLines 0.0002401 6.264E-05 2.884E-05 NumberRealEstateLoansOrLines_s1 -0.000147 -3.92E-05 -6.32E-06 NumberOfTime60_89DaysPastDueNotWorse -0.003474 -0.000477 -0.000249 NumberOfTime60_89DaysPastDueNotWorse_0 -0.000686 -0.000271 -0.000193 Age*MonthlyIncome -5.29E-06 -5.66E-06 -5.49E-06 Age*DebtRatio -0.002222 -0.000944 -0.000673 DebtRatio*MonthlyIncome 0.000141 0.000131 0.0001203 Age*NumberOfTimes90DaysLate 0.0203988 0.0189057 0.0154811 Fold 1 Fold 2 Fold 3 Training Data Testing Data 3 Folds Cross Validation Model 1 Model 2 Model 3 0.744 0.742 0.745 Training AUC Test AUC 0.744