Skip to main content
Log in

Effects of missing data in credit risk scoring. A comparative analysis of methods to achieve robustness in the absence of sufficient data

  • Part 1: Consumer Credit Risk Modelling
  • Published:
Journal of the Operational Research Society

Abstract

The 2004 Basel II Accord has pointed out the benefits of credit risk management through internal models using internal data to estimate risk components: probability of default (PD), loss given default, exposure at default and maturity. Internal data are the primary data source for PD estimates; banks are permitted to use statistical default prediction models to estimate the borrowers’ PD, subject to some requirements concerning accuracy, completeness and appropriateness of data. However, in practice, internal records are usually incomplete or do not contain adequate history to estimate the PD. Current missing data are critical with regard to low default portfolios, characterised by inadequate default records, making it difficult to design statistically significant prediction models. Several methods might be used to deal with missing data such as list-wise deletion, application-specific list-wise deletion, substitution techniques or imputation models (simple and multiple variants). List-wise deletion is an easy-to-use method widely applied by social scientists, but it loses substantial data and reduces the diversity of information resulting in a bias in the model's parameters, results and inferences. The choice of the best method to solve the missing data problem largely depends on the nature of missing values (MCAR, MAR and MNAR processes) but there is a lack of empirical analysis about their effect on credit risk that limits the validity of resulting models. In this paper, we analyse the nature and effects of missing data in credit risk modelling (MCAR, MAR and NMAR processes) and take into account current scarce data set on consumer borrowers, which include different percents and distributions of missing data. The findings are used to analyse the performance of several methods for dealing with missing data such as likewise deletion, simple imputation methods, MLE models and advanced multiple imputation (MI) alternatives based on MarkovChain-MonteCarlo and re-sampling methods. Results are evaluated and discussed between models in terms of robustness, accuracy and complexity. In particular, MI models are found to provide very valuable solutions with regard to credit risk missing data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

  1. A complete analysis on causes, prevention and treatment of item non-response can be consulted in De Leeuw et al (2003).

  2. Variables are dichotomized though the substitution of observed values by 1 and missing values by 0.

  3. The degree on mean square error will often be more than one standard error and its direction will depend on the application, pattern of missing data and model estimated (Sherman, 2000).

  4. Obtained by assuming that the imputed data set is the complete data set and calculating the usual variance estimate.

  5. Where vec(·) operator stacks the unique elements.

  6. Information on this data set is available at http://mlearn.ics.uci.edu/databases/credit-screening/. Original data used in this paper can be obtained at http://mlearn.ics.uci.edu/databases/credit-screening/crx.data (information on missing values is included in the ‘crx.data’ file, see ‘?’ symbols).

References

  • Anderson AB, Basilevski A and Hum DPJ (1983). Missing data: A review of the literature . In: Rossi PE, Wright JD and Anderson AB (eds). Handbook of Survey Research. Academic Press: New York, pp. 415–494.

    Chapter  Google Scholar 

  • Baesens B, Viaene S, van Gestel TM, Suykens JAL, Dedene G, de Moor B and Vanthienen J (2000). An Empirical Assessment of KerneylType Performance for Least Squares SVM Classifiers. Dept. Applied Economic Sciences, Katholieke Universiteit Leuven: Leuven, Belgium.

  • Basel Committee on Banking Supervision (BCBS) (2004). International Convergence of Capital Measurements and Capital Standards. A Revised Framework. Bank for International Settlements: Basel, June.

  • Basel Committee on Banking Supervision (BCBS) (2005a). Studies on the Validation of Internal Rating Systems. Working Paper no. 14, Bank for International Settlements, Basel, February.

  • Basel Committee on Banking Supervision (BCBS) (2005b). Validation of low-default portfolios in the Basel II Framework. Newsletter no. 6, Bank for International Settlements, Basel, September.

  • Basel Committee on Banking Supervision (BCBS) (2006). The IRB Use Test: Background and Implementation. Newsletter no. 9, Bank for International Settlements, Basel, September.

  • Benjamin N, Catheart A and Ryan K (2006). Low default portfolios: A proposal for conservative estimation of default probabilities. Working Paper, Financial Services Authority, London, April.

  • British Bankers' Association (BBA) (2004). The IRB Approach for Low Default Portfolios (LDPs). BBA: London, August.

  • Carey M and Hrycay M (2001). Parameterizing credit risk models with rating data . J Bank Financ 25: 197–270.

    Article  Google Scholar 

  • Carpenter J, Kenward M, Evans S and White I (2004). Last observation carry-forward and last observation analysis . Stat Med 23: 3241–3244.

    Article  Google Scholar 

  • Carpenter J, Kenward M and Vansteelandt S (2006). A comparison of multiple imputation and doubly robust estimation for analyses with missing data . J R Stat Soc Ser A 169: 571–584.

    Article  Google Scholar 

  • Cavaretta MJ and Chellapilla L (1999). Data mining using genetic programming: The implications of parsimony on generalization error. In: Angeline PJ, Michalewicz Z, Schoenauer M, Yao X and Zalzala A (eds) Proceedings of the Congress on Evolutionary Computation 2, IEEE Press: Washington, DC, pp 1330–1337.

  • Cheng TG and Victoria-Feser MP (2000). Robust correlation estimation with missing data. Cahiers du department d'econométrie n. 2000-5, University of Geneva.

  • Collins LM, Schafer JL and Kam CM (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures . Psychol Methods 6: 330–351.

    Article  Google Scholar 

  • DeLeeuw E, Hox J and Huisman M (2003). Prevention and treatment of item nonresponse . Journal of Official Statistics 19: 153–176.

    Google Scholar 

  • Dempster AP, Laird NM and Rubin DB (1977). Maximum Likelihood from incomplete data via the EM algorithm . Journal of the Royal Statistical Society Series B 39: 1–22.

    Google Scholar 

  • Eggermont J, Kok JN and Kosters WA (2004). Genetic programming for data classification: Partitioning the search space. In: Haddad HM, Omicini A, Wainwright RL and Liebrock LM (eds) Proceedings of the 2004 Symposium on Applied Computing, ACM Press: NewYork, pp 1001–1005.

  • European Central Bank (ECB) (2004). Credit Risk Transfer by EU Banks: Activities, Risks and Risk Management. European Central Bank: Frankfurt am Main, May.

  • Financial Services Authority (FSA) (2005). Expert Group Paper on Low Default Portfolios. FSA: London, August.

  • Hair JF, Anderson RE, Tatham RL and Black WC (1999). Multivariate analysis . Prentice-Hall: New York.

    Google Scholar 

  • Honaker J and King G (2006). What to do about missing values in time series cross-section data. Working Paper, Harvard University, September.

  • Horton NJ and Kleinman KP (2007). Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models . Am Stat 61: 79–90.

    Article  Google Scholar 

  • Horton NJ, Lipstitz SR and Parzen M (2003). A potential for bias when rounding in multiple imputation . Am Stat 57: 229–232.

    Article  Google Scholar 

  • Huang CL, Chen MC and Wang CJ (2006). Credit scoring with a data mining approach based on support vector machines . Expert Sys Appl 33: 847–856.

    Article  Google Scholar 

  • Ibrahim JG (1990). Incomplete data in generalized linear models . J Am Stat Assoc 85: 765–769.

    Article  Google Scholar 

  • Ibrahim JG, Chen MH, Lipsitz SR and Herring AH (2005). Missing data methods for generalized linear models: A comparative review . J Am Stat Assoc 100(469): 332–346.

    Article  Google Scholar 

  • Jacobson T and Roszbach K (2003). Bank lending policy, credit scoring and value-at-risk . J Ban Finan 27: 615–633.

    Article  Google Scholar 

  • Jansen I, Beunckens C, Molenberghs G, Verbeke G and Malinckrodt C (2006). Analyzing incomplete discrete longitudinal clinical trial data . Stat Sci 21: 222–230.

    Article  Google Scholar 

  • King G, Honaker J, Joseph A and Scheve K (2001). Analyzing incomplete political science data: An alternative algorithm for multiple imputation . Am Polit Sci Rev 95(1): 49–69.

    Google Scholar 

  • Little JR (1988). Missing-data adjustments in large surveys (with discussion) . J Bus Econ Stat 6: 287–301.

    Google Scholar 

  • Little JR and Rubin D (2002). Statistical Analysis with Missing Data . Wiley: New York.

    Book  Google Scholar 

  • Little JR and Schenker N (1995). Missing data . In: Arminger G, Clogg CC and Sobel ME (eds). Handbook of Statistical Modeling for the Social and Behavioral Sciences. Plenum: New York, pp. 39–75.

    Chapter  Google Scholar 

  • Moons KGM, Donders RA, Stijnen T and Harrell FE (2006). Using the outcome for imputation of missing predictor values was preferred . J Clin Epidemiol 59: 1092–1101.

    Article  Google Scholar 

  • Oesterreichische Nationalbank (OeNB) (2004). Rating Models and Validation. Guidelines on Credit Risk Management. Oesterreischische Nationalbank and Austrian Financial Authority: Vienna, November.

  • Pluto K and Tasche D (2004). Estimating the probabilities of default for low default portfolios. Working Paper, Deustche Bundesbank, April.

  • Quinlan RS (1979). Discovering rules by induction from large collections of examples . In: Michie D (ed). Expert Systems in the Microelectronic Age. Edinburgh University Press: Edinburgh, pp. 168–201.

    Google Scholar 

  • Quinlan RS (1992). C4.5: Programs for Machine Learning . Morgan Kaufmann: New York.

    Google Scholar 

  • Rocke DM (1996). Robustness properties of S-estimators of multivariate location and shape in high dimension . Ann Stat 24: 1327–1345.

    Article  Google Scholar 

  • Rubin DB (1976). Inference and missing data . Biometrika 63: 581–590.

    Article  Google Scholar 

  • Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys . Wiley and Sons: New York.

    Book  Google Scholar 

  • Schafer JL (1997). Analysis of Incomplete Multivariate Data . Chapman & Hall: New York.

    Book  Google Scholar 

  • Schuermann T and Hanson S (2005). Confidence intervals for probabilities of default. Working Paper, Federal Reserve Bank of New York, July.

  • Sherman RP (2000). Tests of certain types of ignorable nonresponse in surveys subject to item nonresponse or attrition . Am J Polit Sci 44: 356–368.

    Article  Google Scholar 

  • Staten ME and Cate FH (2003). The impact of Opt-In Privacy Rules on retail credit markets: A case study of MBNA. Duke Law J 52: 745–786; June.

  • van Buuren S, Boshuizen HC and Knook DL (1999). Multiple imputation of missing blood pressure covariates in survival analysis . Stat Med 18: 681–694.

    Article  Google Scholar 

Download references

Acknowledgements

The author gratefully acknowledges the helpful comments and questions of two anonymous reviewers.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R Florez-Lopez.

Appendix A

Appendix A

Tables A1, A2, A3, A4, A5 and A6.

Table 8 Models for dealing with missing values. Results for Listwise Deletion.
Table 9 Models for dealing with missing values. Results for Mean Substitution.
Table 10 Models for dealing with missing values. Results for EM algorithm.
Table 11 Models for dealing with missing values. Results for IP algorithm.
Table 12 Models for dealing with missing values. Results for EMis algorithm.
Table 13 Models for dealing with missing values. Results for EMB algorithm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Florez-Lopez, R. Effects of missing data in credit risk scoring. A comparative analysis of methods to achieve robustness in the absence of sufficient data. J Oper Res Soc 61, 486–501 (2010). https://doi.org/10.1057/jors.2009.66

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/jors.2009.66

Keywords

Navigation