1 Introduction

Credit allows accessing to resources today with an agreement to repay over a period of time, usually at regular intervals. The resources may be financial, or they may consist of products or services. Credit has now turned into a very important component in everyday life. Although credit cards are currently the most popular form of credit, other credit plans include residential mortgages, auto loans, student loans, small business loans, trade financing and bonds, among others.

The main risk for banks and financial institutions comes from the difficulty to distinguish the creditworthy applicants from those who will probably default on repayments. The recent world financial crisis has aroused increasing attention of financial institutions on credit risk prediction and assessment. The decision to grant credit to an applicant was traditionally based upon subjective judgments made by human experts, using past experiences and some guiding principles. Common practice was to consider the classic 3 C's, 4 C's or 5 C's of credit: character, capacity, capital, collateral and conditions (Abrahams and Zhang, 2008). This method suffers, however, from high training costs, frequent incorrect decisions, and inconsistent decisions made by different experts for the same application.

These shortcomings have led to a rise in more formal and accurate methods to assess the risk of default. In this context, automatic credit scoring has become a primary tool for financial institutions to evaluate credit risk, improve cash flow, reduce possible risks, and make managerial decisions (Thomas et al, 2002). Credit scoring is the set of decision models and their underlying methods that help lenders determine whether credit should be approved to an applicant. The ultimate goal of credit scoring is to assess credit worthiness and discriminate between ‘good’ and ‘bad’ debts, depending on how likely applicants are to default with their repayments (Lim and Sohn, 2007). Compared with the subjective methods, automatic credit scoring models present a number of interesting advantages (Rosenberg and Gleit, 1994; Thomas et al, 2002; Blöchlinger and Leippold, 2006): (i) reduction in the cost of the credit evaluation process and the expected risk of being a bad loan; (ii) time and effort savings; (iii) consistent recommendations based on objective information, thus eliminating human biases and prejudices; (iv) facilities to incorporate changes in policy and/or economy into the system; and (v) the performance of the credit scoring model can be monitored, tracked, and adjusted at any time.

From the seminal reference to credit scoring in the introductory paper by Altman (1968), many other developments have been subsequently proposed (Baesens et al, 2003; Bahrammirzaee, 2010; Abdou and Pointon, 2011). After the Basel II recommendations issued by the Basel Committee on Banking Supervision in 2004, it has become almost a regulatory requirement for banks and financial organizations to utilize advanced credit scoring models in order for enhancing the efficiency of capital allocation. The effects of Basel II on the competitive advantage of an institution are so large that research on the adequacy, applicability, and validity of the adopted systems is set to be a cardinal research initiative (Crook et al, 2007).

The most classical approaches to credit scoring employ statistical methods (discriminant analysis, linear and logistic regression, multivariate adaptive regression splines, classification and regression trees, nonparametric smoothing, survival analysis) or operations research models (linear programming, quadratic programming, integer programming, multiple criteria programming, dynamic programming), but it is also possible to find more sophisticated techniques belonging to the area of computational intelligence (often referred to as data mining or soft computing) such as neural networks, support vector machines, fuzzy systems, rough sets, artificial immune systems, and evolutionary algorithms.

The evolutionary computing methods are highly capable of extracting meaning from imprecise data and detecting trends that are too complex to be discovered by either humans or other conventional techniques. However, Fensterstock (2005) points out two main weaknesses for the application of these techniques to credit risk assessment: (i) they require a major computational effort, and (ii) they continue to be seldom known by the financial and business analysts. Despite these criticisms, one can find a considerable number of articles where evolutionary computing has been applied to credit scoring, what reflects the interest of the scientific community in this subject.

The aim of this review article is to synthesize, analyse, and evaluate relevant evolutionary computing techniques (considered as a subfield of computational intelligence) that have been applied to credit risk management, exploring the most recent investigation trends and identifying weaknesses and suggestions for further research. The review will focus on the three major areas or problem types where evolutionary computing has been used in credit scoring: classification, variable selection, and parameter optimization.

Hereafter, the paper is organized as follows. Section 2 gives a brief introduction to the evolutionary computing techniques. Section 3 outlines the research methodology, defines credit scoring as a classification problem, and provides a general description of feature or variable selection. Section 4 examines the existing research on evolutionary computation applied to credit scoring. Section 5 discusses the key points of the review, putting the emphasis on the shortcomings and the potential problems and topics that should be addressed in the future. Finally, Section 6 summarizes the general findings from this study.

2 Evolutionary computation

Computational intelligence is defined as the study of adaptive mechanisms to enable or facilitate intelligent behaviour in complex and changing environments (Bezdek, 1994; Engelbrecht, 2007). These mechanisms include artificial intelligence concepts, paradigms, algorithms, and implementations that exhibit an ability to learn or adapt to new situations, to generalize, abstract, discover, and associate.

To tackle complex real-world problems, scientists have been looking into natural processes and creatures, both as model and metaphor. Optimization is at the heart of many natural processes including Darwinian evolution, social group behaviour, and foraging strategies. Over the last few decades, there has been growing interest in the field of biologically or nature-inspired search and optimization algorithms. Currently these techniques are applied to a variety of problems, ranging from scientific research to finance, industry, and commerce to name but a few. The two main families of methods that primarily comprise this field are evolutionary computing and swarm intelligence.

In particular, evolutionary computation constitutes a subfield of computational intelligence, which involves combinatorial optimization problems. Thus evolutionary computing is the generic term for a set of problem-solving techniques based on the Darwinian principles of natural selection and evolution (Eiben and Smith, 2007). It is inspired by biological processes of inheritance, mutation, natural selection, and the genetic crossover that occurs when parents mate to produce offspring. It makes use of the concept of survival of the fittest by progressively accepting better solutions to the problem.

The common underlying idea behind all variants of evolutionary algorithms is that, given a population of individuals, the environment causes natural selection and this produces a rise in the fitness of the population. Given a quality function to be maximized, we can randomly generate a set of candidate solutions (a group of IF-THEN rules) and apply the quality function as an abstract fitness measure. Based on this, some of the better candidates are chosen to seed the next generation by applying selection, mutation, and crossover operators. These operators lead to a set of new candidates that compete with the old ones for a place in the next generation. This process can be iterated until a candidate with sufficient quality is found or a previously fixed computational limit is reached.

In general, evolutionary algorithms are stochastic search algorithms that operate on a population of candidate solutions. Traditionally, they refer to four categories of similar but independently developed techniques (Fogel, 2000): genetic algorithms, evolutionary strategies, evolutionary programming, and genetic programming. From these, the genetic algorithms and genetic programming approaches have been the most widely used in credit scoring applications.

2.1. Genetic algorithms and genetic programming

Genetic algorithms (Holland, 1975, 1992) provide a method to perform randomized global search in a solution space. They operate on a population of potential solutions applying the principle of survival of the fittest to produce (hopefully) better and better approximations to a solution. At each generation, a new set of approximations is created by the process of selecting individuals according to their level of fitness in the problem domain and breeding them together using operators borrowed from natural genetics. This process leads to the evolution of populations of individuals that are better suited to their environment than the individuals that they were created from.

Usually, the algorithm starts with a random population of N candidate solutions, which are internally encoded as chromosomes (in the form of a string). Next the quality of each chromosome x in the population is evaluated by a fitness function φ(x), and the best two are selected to crossover and form a new solution (offspring). A further genetic operator, called mutation, may be then applied to the new offspring, which causes the individual genetic representation to be changed according to some probabilistic rule. After recombination and mutation, the process continues through subsequent generations and it terminates either after a predefined number of iterations or if the best member of the latest populations has not improved during a certain number of iterations.

The three most important aspects of using genetic algorithms are: (i) definition of the fitness function, (ii) definition and implementation of the genetic representation, and (iii) definition and implementation of the genetic operators.

On the other hand, genetic programming proposed by Koza (1992) is an extension to the original concept of genetic algorithms. The population in genetic programming is composed by variable length tree-like candidate solutions whose size, shape, and complexity can dynamically change during the process. Each of these individual candidates, called program, may have functional nodes, enabling the solution to perform arbitrarily large actions. Genetic programming also uses crossover and mutation as the transformation operators to change candidate solutions into new candidate solutions as genetic algorithms do.

Differences between genetic programming and genetic algorithms refer to the particular representation of the solution: genetic programming produces computer programs or programming language expressions as the solution, whereas genetic algorithms give a string of numbers that represent the solution. The major advantage of genetic algorithms and genetic programming is that one does not have to specify all the details of a problem in advance, but potential solutions are evaluated by a fitness function representing the problem to be solved.

3 Research methodology

In order to review the literature that deals with evolutionary computation applied to credit scoring, two groups of keywords were used to cross-search for related papers in seven databases of scientific publications: ProQuest ABI/Inform, Google Scholar, IEEE Xplore, Emerald, ScienceDirect, ACM Digital Library, and Scopus. The first group of descriptors includes computational intelligence, soft computing, artificial intelligence, data mining, machine learning, evolutionary computation, evolutionary programming, genetic algorithm, and genetic programming; the second group comprises credit scoring, credit risk, financial risk, credit evaluation, credit assessment, credit analysis, credit decision, and credit management.

The present review was confined to conference papers and journal articles published within the period 2000–2012 in order to explore and analyse the current state-of-the-art of credit scoring models based on evolutionary computation techniques. The reason why this review focuses on works from the last decade is that there already exist some reviews of past literature (up to 2000) relative to Statistics and Artificial Intelligence applied to general topics of finance, accounting, human resources, marketing, production, and distribution (Dimitras et al, 1996; Hand and Henley, 1997; Refenes et al, 1997; Wong et al, 1997; Vellido et al, 1999; Thomas, 2000; Metaxiotis and Psarras, 2004), but to the best of our knowledge not specifically to credit scoring.

The methodological framework of this research was defined by analysing a variety of problem types related to general applications of computational intelligence. Thus the articles were classified into three main groups according to the purpose with which the evolutionary computing techniques have been applied to credit scoring: (i) classification, (ii) variable selection, and (iii) parameter optimization. In addition, a fourth group was also included to gather a miscellany of other problems barely covered by evolutionary methods.

3.1. Classification

Classification is one of the main applications of intelligent systems (Komar, 2005) and from a practical point of view, credit scoring can be deemed as a classical classification or prediction problem where a new input sample (customer) must be categorized into one of the predefined classes based on a number of observed variables (features, attributes) related to that sample. In this context, the input of the classifier consists of a variety of information that describes socio-demographic characteristics and economic conditions of the applicant (age, gender, number of dependents, loan purpose, annual income, marital status, educational level, occupation, number of years employed, house owned or rented, estimated assets), and the credit scoring model will produce the output in terms of the customer creditworthiness. In its most usual form, credit scoring aims at assigning credit applicants to one of two classes: good customers (those who are liable to reimburse the financial obligation) and bad customers (those who should be denied credit because of the high probability of defaulting on repayments).

More formally, given a data set of n customers, S={(x 1, y 1), …, (x n ,y n )}, where each customer x i =(x i1, x i2, …, x iD ) is characterized by D variables defined on an input feature space X D, and y i ∈{good, bad} denoting the type of customer, then a credit scoring model (the classifier) can be defined as a mapping function f: X D → {good, bad} that predicts the value y for a new credit applicant x, that is, f(x)=y.

There are five major characteristics of intelligent systems that make them especially appealing for solving financial and banking problems, such as is the case of credit scoring and assessment (Goonatilake and Treleaven, 1995):

  • Learning: The ability to learn decisions and tasks from historical data and/or previous examples.

  • Adaptation: The ability to adapt to a changing environment such as policy changes, new government regulations or changes in economic conditions.

  • Flexibility: The ability to make decisions and perform tasks even in the presence of incomplete or unreliable data.

  • Transparency: The ability to explain how the decisions were reached. In some cases, due to legal or organizational reasons, it is compulsory to provide explanations of how decisions have been made.

  • Discovery: The ability to discover new processes or relationships that were previously unknown.

3.2. Variable selection

Variable or feature selection is the process of finding the best subset from a given set of variables in a data set (Dash and Liu, 1997). This is an important issue in designing classification systems such as scorecards. It is imperative to select the most relevant variables and to limit the number of input features in a scorecard in order to have a more predictive and less computationally intensive model. Furthermore, with a smaller set of variables to decide upon credit approval, the scorecard analyst will gain more insight into the model and better comprehensibility of the decisions made.

The objective of variable selection is three-fold (Guyon and Elisseeff, 2003): improving the performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

Suppose a large set of D features is given from which we have to find a small subset d according to some optimality criteria. Although it may not be apparent, the subset selection problem may become prohibitive because of its computational complexity. In general, it will not be possible to rank the variables simply based on their individual properties and select only the best. The feature properties may depend strongly on each other and a subset of individually weak variables may prove to be rather good because of positive interaction effects. Because of this uncertainty, the only apparent way of searching for optimal subsets is to evaluate all the 2D possible variable combinations. However, testing all subsets is a combinatorial problem that requires an exponential amount of computational time and consequently, a broad range of sub-optimal feature selection methods has been developed over the last decades.

Traditionally, two families of variable selection methods have been considered: filters and wrappers (George, 2000). Filter techniques use some evaluation function (information theoretic measures, distance measures, consistency measures, correlation measures) that relies solely on properties of the data, thus being independent of the choice of the learning algorithm used to construct the classifier; they attempt to remove irrelevant and redundant variables before training the model. Conversely, wrapper methods utilize the learning algorithm itself to evaluate the candidate subsets of features (Kohavi and John, 1997). In its most general formulation, the wrapper methodology consists of using the prediction performance of a given learning algorithm to assess the relative usefulness of all possible feature subsets. It is argued that, compared with wrappers, filters propose a faster and generic selection of variables, not tuned for a particular learning algorithm (Guyon and Elisseeff, 2003).

4 Evolutionary computing in credit scoring

This section surveys relevant research works carried out in the context of evolutionary computation applied to credit scoring. The review has been organized around the main problems tackled through the development of genetic algorithms and genetic programming. Table 1 provides a summary with the articles that will be analysed in the next sections.

Table 1 Summary of research in the period 2000–2012

4.1. Classification

In the context of the evolutionary approaches to credit scoring, one has a number of scorecards that mutate and blend together according to their fitness function at classification (Fogerty and Ireson, 1993). One of the earliest papers comparing genetic algorithms with other credit scoring models is that by Yobas et al (2000). This work compares the predictive performance of a traditional statistical method (linear discriminant analysis, LDA) with that of three computational intelligence techniques (a neural network, a decision tree, and a genetic algorithm), using a small sample of credit scoring data. The authors find that linear discriminant analysis is superior to genetic algorithms and neural networks. Fritz and Hosemann (2000) also conclude that the linear discriminant analysis model is better than genetic algorithms and three other approaches to credit scoring, including artificial neural networks, the M6 decision tree, and the k-nearest neighbours decision rule (KNN). However, validity of their conclusions can be questioned because they do not utilize the same training and test sets for experimenting with the different techniques. In fact, these results are inconsistent with those of other previous studies (Desai et al, 1996, 1997; Varetto, 1998). More recently, Ong et al (2005) observe that credit classification based on standard genetic programming outperforms other models constructed with artificial neural networks (ANN), decision trees (CART and C4.5), rough sets, and logistic regression (LogR).

Huang et al (2007) compare the performance of genetic programming against neural network, support vector machine (SVM), and C4.5 decision tree models in a credit scoring application using the Australian and German benchmarking databases. Their study reveals better classification accuracy with genetic programming than with the other techniques, although differences are marginal.

Tsakonas et al (2006) evaluate four computational intelligence methods with a sample of enterprises that applied for a loan in a European banking organization, paying special attention to the comprehensibility and the ease of use for the acquired decision models. Two methods employ grammar-guided genetic programming for the generation of either classification trees (Tsakonas and Dounias, 2002) or fuzzy rule-based systems (Tsakonas et al, 2001). The approaches based on genetic programming appear to perform rather low, but comprehensibility and generalization ability are higher.

Mukerjee et al (2002) present a novel tool for the application of a multi-objective genetic algorithm, namely NSGA-II (Deb, 2001), to a bank credit management problem, with the aim of giving an appropriate trade-off between the multiple objectives of return maximization and risk minimization. The new technique produces an approximation to the set of Pareto-optimal solutions, which may increase the decision flexibility available to the bank manager, and is also computationally efficient when compared with the traditional multi-objective method of epsilon-constraints.

Finlay (2006, 2009) introduces an evolutionary approach to the implementation of credit scoring models, demonstrating that genetic algorithms can perform as well as, if not marginally better than, competing models constructed using neural networks, OLS regression, and logistic regression. For the experiments, data are supplied by Experian UK and contain details of credit applications made in a time period.

Chen and Huang (2003) combine genetic algorithms and artificial neural networks for inverse classification in order to reassign the rejected instances (bad customers) to the preferable accepted class (good customers), trying also to analyse the reasons for rejection of applicants. Zhang et al (2007) develop a combined model based on back-propagation neural networks, genetic programming, and support vector machines, concluding that it can perform better than any single classifier over the Australian and German databases. Jiang et al (2011) use simulated annealing to optimize a genetic algorithm and construct a combined credit scoring model, which results with the highest prediction rates in comparison with back-propagation and RBF neural networks.

Cai et al (2009) present a genetic algorithm that employs an appropriate fitness evaluation function, taking into account the values of both good and bad customers in the training set; experiments over the German credit database reflect an acceptable accuracy rate on the class of good customers, but poor performance on the samples belonging to bad customers.

The problem of classifying credit data with unequal misclassification costs is studied by Bedingfield and Smith (2003): an evolutionary algorithm is employed to generate systems of classification rules. In addition to the misclassification costs, various other properties of the classification systems generated by the evolutionary algorithm, such as accuracy and coverage, are also considered and discussed.

A two-stage genetic programming (called 2SGP) is introduced by Huang et al (2006) to address the credit scoring problem by incorporating the advantages of the IF-THEN rules and the discriminant functions. From the experimental results over the German and Australian credit databases, the authors conclude that 2SGP is more flexible and performs significantly better than the other models tested (multi-layer perceptron neural network, CART and C4.5 decision trees, k-nearest neighbours rule, rough sets and logistic regression).

Zhang et al (2008a) propose a hybrid credit scoring model by profiting from the advantages of genetic programming and support vector machines. In a first stage, genetic programming is used to derive simple decision rules and in a second stage, those samples that have not satisfied any rules or have satisfied more than one rule are employed to train a support vector machine and build the discriminant function. In this paper, genetic programming utilizes a reduced set of logical connectives (AND, OR, NOT), the relation operators (⩽, =, ⩾), and the conditional operator (IF-THEN). Moreover, the variables with continuous values are first discretized because many studies have shown that rules with discrete values are often shorter and more understandable and on the other hand, discretization can lead to improved predictive accuracy. The hybrid model is empirically compared with support vector machines, single genetic programming, C4.5 decision tree, logistic regression, and back-propagation neural network, demonstrating better accuracy of the hybrid approach over the German and Australian databases.

Abdou (2009) investigates the ability of genetic programming in the analysis of credit scoring models using data from Egyptian public sector banks, comparing this technique with probit analysis (a well-studied alternative to logistic regression (Dionne et al, 1996)) and weight of evidence (WOE) measure (a credit scoring model that focuses on the odds ratio of good scores to bad scores (Bailey, 2001)). Experimental results show that genetic programming achieves the highest accuracy rate and also the lowest type-I and type-II errors when compared with the other two techniques. However, the genetic programming approach does not provide the lowest estimated misclassification cost, which corresponds to WOE.

Despite the powerful learning capabilities of genetic algorithms, one major criticism of applying them to credit scoring is their poor comprehensibility (Goldberg, 1989). Several solutions to this opacity problem consist of combining genetic algorithms with other techniques that can be better understood by humans. For instance, Hoffmann et al (2002) employ a boosted genetic fuzzy classification rule (Cordón et al, 1998) on real credit scoring data, finding that it performs better than the NefClass neuro-fuzzy algorithm and the C4.5 decision tree. Similarly, descriptive fuzzy rules are extracted from a genetic algorithm, where all fuzzy rules share a common, linguistically interpretable definition of membership functions (Hoffmann et al, 2007). In this work, the genetic fuzzy rules outperform other classifiers in terms of classification accuracy when applied to the German and Australian databases and two credit scoring data sets from Benelux financial institutions.

4.2. Variable selection

An important application field of evolutionary computing is feature selection. For example, Eklund et al (2001) utilize genetic programming to choose the most explanatory variables for credit risk analysis. Liu et al (2008) argue that in credit scoring applications, characteristics derived from the input data variables are more discriminative and may indicate cross-information among those attributes. Because the process to select this kind of characteristics can be viewed as a combinatorial optimization problem of mathematical symbols and original attributes, the authors propose a hybrid method with genetic programming and human interaction to determine the derived variables with most practical significance. These characteristics are then used in a linear discriminant analysis credit scoring model, concluding that this may reduce the risk of credit prediction.

Sexton et al (2006) apply a genetic-based algorithm, called the neural network simultaneous optimization algorithm, to a credit approval data set. The algorithm determines those input variables that contribute to estimation of the underlying function. This can help with analysis of the problem, improve generalization, and reducing the size of the neural network structure for credit scoring.

In the paper by Huang et al (2007), genetic algorithms are used to simultaneously perform feature selection and parameter optimization in a support vector machine. For the Australian data set, the support vector machine using genetic algorithms achieves better accuracy than the other two approaches using grid search and F-score, and even with a smaller number of variables selected, the results over the German database are similar in terms of accuracy and also with regards to the number of features selected.

Sustersic et al (2009) develop a back-propagation neural network credit scoring model with a data preprocessing stage to select a pseudo-optimal subset of the input features. To this end, the authors apply a genetic-based feature selection algorithm (Zupan and Novic, 1999), producing an important reduction in the number of variables and yet with high classification accuracy. Their initial hypothesis is that the variables with poor information content and co-linear variables might introduce noise to the model, thus leading to a decrease in accuracy.

With the idea of using evolutionary computation to preprocess the data, the K-means clustering algorithm and a genetic algorithm are combined to further improve the accuracy of a credit scoring model based on a C4.5 decision tree (Zhang et al, 2008b). The genetic algorithm is firstly utilized to decrease the number of attributes (by removing redundant variables) and make the decision tree simpler and more understandable, while the clustering algorithm aims at removing noisy data. Computational results show that the use of genetic algorithms and K-means clustering can effectively improve the prediction rate of the credit scoring models analysed, including a decision tree, a neural network, a support vector machine, and a random subspace classifier.

Marinakis et al (2008) propose to optimize the nearest neighbour decision rule through three metaheuristic algorithms: tabu search (Glover, 1989, 1990), genetic algorithms, and ant colony optimization (Dorigo and Stutzle, 2004). The three algorithms are tested over data derived from the loan portfolio of the Commercial Bank of Greece in a credit risk assessment problem. The paper concludes that such algorithms can be successfully applied to the feature selection problem, leading to models that provide good classification results with a small number of attributes. In particular, the ant colony optimization approach appears to be the method with the best accuracy using almost half of the original features.

Similarly, Wang and Huang (2009) use the so-called non-dominated sorting genetic algorithm (Deb et al, 2000) with inter-correlation and intra-correlation for feature selection. The basic idea is to find out the best attribute subset with the property of lower intra-correlation within the set of variables, but higher inter-correlation between the set and the corresponding class, thereby making the set discriminatory. The authors apply the new feature selection algorithm to the Australian and German databases using various classifiers: back-propagation and perceptron multi-layer neural networks, C4.5 decision tree, nearest neighbour rule, support vector machine, naive Bayes classifier, and linear discriminant analysis.

Chi and Hsu (2012) select relevant variables by means of a genetic algorithm to construct a dual scoring model for credit risk management of mortgage accounts, which combines the bank internal behavioural scoring model with the external credit bureau scoring model. It undergoes more accurate risk judgment and segmentation to further discover the parts that are required to be enhanced in management or control from mortgage portfolio.

The study by Huang and Wu (2011) compares several prediction models taken from data mining methods, and improves traditional credit scoring techniques by using boosting and genetic algorithms for feature selection. The empirical results indicate that the use of genetic algorithms improves the performance of underlying classifiers and appears to be more robust than the single prediction methods.

Oreski et al (2012) investigate the performance of seven feature selection techniques over a data set from a Croatian bank, showing that the genetic algorithm gives superior results compared with the other approaches. However, when considering different misclassification costs, the evolutionary-based feature selection method appears to be worst than some other techniques. The authors point out that this is due to the fact that the genetic algorithm intends to optimize the accuracy rate, without taking into account other particular conditions.

4.3. Parameter optimization

Since genetic algorithms are basically an optimization heuristic, Lacerda et al (2005) investigate the use of RBF neural networks designed by means of genetic algorithms and concluded that such an implementation is superior (lowest average classification error and smallest average number of hidden nodes) to other methods in a credit assessment application. Besides, while the other techniques have their parameters defined by a trial-and-error iterative process, the genetic-based approach automatically searches for the parameters of the RBF neural network. A similar but more recent approach (Zhou and Bai, 2008; Zhou et al, 2009) utilizes a genetic algorithm to search for proper parameters of a support vector machine and compares it with other three optimization methods, showing no significant differences in the experimental accuracies. Yu et al (2011) investigate the performance of genetic algorithms for parameter selection in a weighted least squares support vector machine and find that other optimization methods outperform the evolutionary approach.

Wang et al (2008) establish a hybrid neural network based on the combination of genetic and back-propagation algorithms and apply the model in personal credit scoring of commercial banks. The genetic algorithm is applied to optimize the initial weights and biases of a neural network. The experimental results indicate that such a hybrid model can improve the learning ability of a neural network effectively and overcome the drawbacks of the back-propagation algorithm. Analogously, Fu and Liu (2011) construct a hybrid personal credit scoring model by combining RBF neural networks with a genetic algorithm; the new model employs the genetic algorithm to optimize weights and other parameters of the RBF neural network, so that convergence of the model is rapid and presents good generalization over the German database.

With the aim of obtaining both accuracy and interpretability, an evolutionary-neuro-fuzzy method is adopted in the works by Lahsasna et al (2008) and Ainon et al (2009). In a first phase, a fuzzy rule base is automatically extracted from a data set using a clustering method, and then a genetic algorithm is used to increase the performance of the fuzzy inference system. Finally, a multi-objective genetic algorithm is applied to preserve the accuracy of the fuzzy model to a predefined value and enhance the interpretability of the fuzzy model by reducing the fuzzy sets in the rule base.

A quantum-inspired genetic algorithm for a credit assessment application using a neural approach is proposed by de Pinho et al (2009). Specifically, the genetic algorithm is employed to completely configure a feed-forward neural network in terms of selecting the relevant input variables, the number of neurons in the hidden layer and all synaptic weights. The experiments over the Australian database prove the effectiveness of the new model in comparison with other techniques, providing good classification accuracy and high degree of flexibility.

Yao (2009) defines three strategies to construct hybrid fuzzy support vector machine credit scoring models: using the CART decision tree to select input variables, using multivariate adaptive regression splines (MARS) to select input features, and using a genetic algorithm to optimize model parameters. The experiments carried out over the Australian and German databases demonstrate that the hybrid model has the best overall classification accuracy and also the lowest type-II error.

Eletter et al (2010) support the use of genetic algorithms to improve the process of finding better parameters of a multi-layer feed-forward neural network model for a credit scoring application in the Jordanian commercial banks. Also, Li et al (2011) propose a multiple kernels multi-criteria programming model, which optimizes the parameters based on the idea of natural evolution. Experiments over the Australian and German databases achieve successful results in terms of both the classification accuracy and the number of features selected, when compared with a number of prediction models that include linear discriminant analysis, quadratic discriminant analysis (QDA), support vector machine, C4.5 decision tree, logistic regression, and k-nearest neighbours rule.

Mahmoud et al (2010) claim that a genetic algorithm can help to build optimal C4.5 decision trees by selecting appropriate features and generating better decision trees for credit scoring. Under this assumption, the model proposed by the authors selects and combines the best decision tree based on a given optimality criterion and constructs the final decision tree for credit scoring of customers.

Xu et al (2010) propose a novel support vector machine-based ensemble model for credit risk assessment. Firstly, the method employs principle component analysis for feature selection, and then some support vector machines with different kernels are trained by using a genetic algorithm to optimize the parameters. Finally, all results produced by different support vector machines are combined by several ensemble strategies.

It is accepted that artificial neural networks are powerful tools for classification and regression, but it is difficult and time costly to determine the best architecture for a given problem. Correa and González (2011) use genetic algorithms to optimize the architecture of a multi-layer perceptron, in order to improve the predictive power of the credit risk scorecards. The objective function to maximize is the ROC curve and the input variables are the number of hidden layers and units, activation function, use or not of bias and whether it will be a direct connection between the initial and the final layer. The results show that this method outperforms logistic regression and the default neural network architecture.

Vukovic et al (2012) employ genetic algorithms to set the parameters of each preference function and the weights of the attributes in a case-based reasoning model that uses preference theory functions for measuring similarity between samples. More specifically, the model corresponds to the k-nearest neighbours classifier, and the preference functions are based on the Promethee multi-criteria decision-making method. Although the experiments over three credit databases indicate that the k-nearest neighbours rule optimized by genetic algorithms may outperform the traditional model, the study presents some limitations that should be taken into account for further research.

In the paper by Li et al (2012), the parameters of an adaptive support vector machine with a Gaussian kernel are optimized through an evolutionary strategy. The advantages of using a genetic algorithm instead of grid search to conduct parameter optimization refer to very important savings in computation time, what is demonstrated by experiments over the Australian and German benchmarking databases, and a real-life US commercial bank credit data set. Danenas and Garsva (2012) employ genetic algorithms and particle swarm optimization to select the parameters of a support vector machine.

4.4. Other problems covered by evolutionary techniques

As a result of the opacity property of the support vector machines, they can sometimes become difficult to apply to credit risk evaluation. To overcome this limitation, rules can be extracted from the trained machine that are easily interpretable by humans and keep as much of the accuracy of the support vector machine as possible. Martens et al (2007) introduce a rule extraction technique based on genetic programming, which is then tested over two credit scoring databases: the Australian benchmarking data set and one obtained from a major Benelux financial institution. The experiments show that the support vector machine rule extraction technique using genetic programming lose only a small percentage in performance compared with support vector machines, whereas it ranks at the top of comprehensible classification models.

Another problem common to many financial applications is the presence of missing data (samples with missing values in some variables), what may introduce bias and result in misleading conclusions drawn from the analysis carried out. Sakprasat and Sinclair (2007) focus on how to handle missing data in a credit approval database; the experimental results show that genetic programming can be successfully used with credit data that contain missing values, although it appears to be preferable to preprocess the data and then apply the genetic approach for classification.

Finlay (2010) compares predictive models of continuous financial behaviour with binary (good/bad payer) models of customer default, concluding that the use of continuous models for consumer risk assessment outperforms the traditional binary classification approaches. This work also suggests that scoring functions developed to specifically optimize the profit contribution, using genetic algorithms, outperform scoring functions derived from optimizing more general functions such as the sum of squared errors.

Hens and Tiwari (2012) study the computational time required by two implementations of support vector machines (one is based on genetic algorithms, whereas the other uses stratified sampling), an artificial neural network, and genetic programming. The authors conclude that the neural network and the evolutionary models can become too costly because they need to optimize a number of parameters in order for achieving accuracy rates similar to those of the support vector machine based on stratified sampling.

García et al (2012) explore the behaviour of 20 noise filtering techniques applied to instance-based credit risk prediction models. The experiments over eight credit databases demonstrate that the evolutionary algorithms obtain the highest set size reduction rates, but no significant differences are achieved in terms of accuracy. On the other hand, Tsai and Chou (2011) employ a genetic algorithm to simultaneously perform feature selection and data reduction as a preprocessing stage when modelling a support vector machine and k-nearest neighbours rule, concluding that instance selection is much more important than feature selection in the domain of credit risk prediction.

5. Discussion and challenges

From the literature review conducted in the previous section, it can be observed that the application of evolutionary computing techniques to credit scoring is an emerging trend that has attracted much attention of practitioners and academics in the last decade. As can be seen in Table 1, this paradigm of computational intelligence has been used for credit risk prediction, but mostly for feature selection and/or parameter optimization as a preprocessing stage that allows to increase the performance of other classification models.

While credit scoring systems based on evolutionary computation have been tried with some success in terms of predictive power, they have often been criticized because they are considered as black-box techniques whose resulting decisions are not easily interpretable for the financial and business analysts (Lucas, 2001). This is one of the main reasons put forward by decision makers to utilize other more comprehensible classifiers, such as rule-based models and decision trees, rather than evolutionary techniques. In other cases, the genetic algorithms and genetic progamming have been combined with other models aiming at both accuracy and interpretability (Tsakonas et al, 2006; Hoffmann et al, 2007; Lahsasna et al, 2008; Zhang et al, 2008a, 2008b).

Also, it is worth noting that, in a large percentage of the articles here analysed, the evolutionary algorithms have solely been included to be compared with other models and techniques (see, eg, the studies by Yobas et al (2000), Eklund et al (2001), Ong et al (2005), Huang et al (2007), Marinakis et al (2008), Abdou (2009), Cai et al (2009), Yu et al (2011), Chi and Hsu (2012), Oreski et al (2012), García et al (2012)), but there are just a few new developments or modifications of existing algorithms. Examples of relatively new implementations are those proposed by Mukerjee et al (2002), Bedingfield and Smith (2003), Huang et al (2006), Zhang et al (2008a), de Pinho et al (2009).

One of the key points for the evolutionary algorithms to provide good performance is to set up properly several internal parameters, such as the fitness function, the population size, the maximum number of generations, the mutation rate, and the crossover rate. For example, if the mutation rate is too high, the system will never converge towards a stable solution; conversely, if it is too low, the population will usually converge on some local optimum. Albeit this is an important issue that should be clearly addressed in any scientific publication, we have perceived that only a few papers indicate whether or not they have estimated the optimal values of the parameters. Some authors have not even reported the specific parameter settings of the models used in their experiments. The problem is that, without this information, it is difficult for another researcher to replicate the experimentation with the aim of providing a point of comparison for new developments.

The review carried out also raises other relevant issues that will be addressed in the next sections. In particular, a discussion on the performance evaluation measures, the statistical significance tests and the databases used in the experiments may be of great interest to the reader in order for identifying any deficiencies and gaps in existing publications that should be further explored in future research on the application of evolutionary computation to credit scoring. Finally, we also discuss some issues on data preprocessing and modelling that have not been tackled through evolutionary computation.

5.1. Performance evaluation criteria

A frequent problem encountered in the literature reviewed refers to the variety of criteria employed for assessing the model performance. What is even more important is that some of these metrics, however, are not the most appropriate to be used in credit scoring applications because of the different misclassification costs associated to each class (good customers and bad customers).

The accuracy or hit rate, which measures the proportion of the correctly predicted cases, has been considered as a significant criterion in evaluating the classification capability of the scoring models and in fact, it corresponds to the performance evaluator most commonly used by the authors (more than 60% of the papers here studied). However, empirical and theoretical evidences have demonstrated that the accuracy rate is strongly biased with respect to imbalance in class distribution and ignores the different misclassification costs (Fawcett and Provost, 1997). In real applications, it is generally believed that the class of bad customers is under-represented in comparison with the class of good customers and in addition, the costs associated with type-II errors (bad customers misclassified as good) are much higher than the misclassification costs associated with type-I errors (good customers mispredicted as bad) (West, 2000; Baesens et al, 2003).

Even with the important drawbacks related to the accuracy rate, less than 20% of the papers report the type-I and type-II errors, and only two articles (Abdou, 2009; Oreski et al, 2012) employ the estimated misclassification cost (the relative costs of accepting credit applications that become bad versus rejecting applications that would be good). The problem with the estimated misclassification cost criterion is that it is really difficult to obtain reliable estimates of costs.

The receiver operating characteristic (ROC) curve is one of the most widely used methods for visualization and comparison of the performance of classifiers. It is a two-dimensional graph that plots the proportion of bad cases predicted as bad (sensitivity or true positive rate) versus the proportion of good cases predicted as bad (one minus specificity) at all cut-off score values. In the ROC space, the upper left corner corresponds to perfect classification, whereas a diagonal line represents random classification. The ROC curve shows the behaviour of models independently of misclassification costs and class distributions (Thomas et al, 2002). Despite its usefulness, only three papers of this review (Wang and Huang, 2009; Correa and González, 2011; Chi and Hsu, 2012) have employed ROC analysis to assess the performance of the models.

Other metrics used in a few papers include the mean squared error, the Gini coefficient, the area under the ROC curve, the return rate, the Kolmogorov-Smirnov statistic, the sensitivity and specificity (or true negative rate), the coefficient of concordance, the F-measure, and the precision.

Finally, it should be noted that only a small percentage of the studies utilize some measure of efficiency, such as CPU time (Chen and Huang, 2003; Sexton et al, 2006; Liu et al, 2008; Zhou et al, 2009; Correa and González, 2011; Hens and Tiwari, 2012), number of rules generated (Hoffmann et al, 2002, 2007; Martens et al, 2007; Lahsasna et al, 2008; Ainon et al, 2009) and decision tree size (Mahmoud et al, 2010). These two latter metrics, the number of rules and the decision tree size, can also be viewed as a way to measure the comprehensibility or transparency of the model, which has become a very important factor for realistic credit scoring systems.

One can conclude that there does not exist an optimal performance evaluation criterion, but it depends on a number of factors relative to each particular application. More specifically, the choice of a measure or another seems to be determined both by the problem in hand and by the availability of certain data (eg, the misclassification costs).

5.2. Statistical significance tests

It is worth noting that most of the articles here reviewed do not contain any formal statistical test, so it is unclear whether or not the apparent differences in performance are significant.

On the other hand, some papers (Lacerda et al, 2005; Hoffmann et al, 2007; Martens et al, 2007; de Pinho et al, 2009; Oreski et al, 2012) include the paired t-test for assessing statistical significance of difference, but this appears to be conceptually inappropriate and statistically unsafe because parametric tests assume independence, normality, and homogeneity of variance, which are often violated due to the nature of the problems (Demšar, 2006).

In general, the non-parametric procedures, such as the Wilcoxon signed ranks test and the Friedman test with the corresponding post-hoc tests, should be preferred over the parametric ones. In the present review, Sexton et al (2006), García et al (2012) and Vukovic et al (2012) report a Wilcoxon signed rank test to statistically compare the performance of the algorithms, whereas Finlay (2010) employs a Kruskal-Wallis test to show that differences between pairs of techniques are significant.

5.3. Credit databases

Many authors have experimented only with small benchmarking databases such as the well-known German and Australian databases (Frank and Asuncion, 2010) and therefore, the conclusions drawn from a limited amount of results should be taken cautiously because these data sets might not be representative enough of real-life credit scoring applications.

There are also other works that have used data gathered from different sources, such as a bankruptcy data set from a Benelux financial institution (Martens et al, 2007), a data set provided by a subsidiary of the Construction Bank of China (Zhou and Bai, 2008), a loan portfolio of the Commercial Bank of Greece (Marinakis et al, 2008), personal loan data sets from banks in Egypt (Abdou, 2009), data supplied by Experian UK (Finlay, 2006, 2009), a consumer loan database from Jordanian banks (Eletter et al, 2010), data from a US commercial bank (Li et al, 2012), or a credit data set provided by a Croatian bank (Oreski et al, 2012). These papers mainly focus on a single database, converting their models into ad-hoc solutions that can become hard to use under other different conditions.

Unfortunately, there are relatively few studies that have conducted experiments over a sufficiently large number of data sets, thus allowing to generalize their conclusions (see, eg, the investigation by García et al, 2012). This issue is especially important in the case of evolutionary algorithms because they require to set up a lot of parameters whose optimal values strongly depend on the characteristics of each database, among other things.

Although it is not easy to access to customer credit data from banks and financial institutions due to the laws and regulations on security of data and privacy protection of customer information, it seems necessary to create a public repository of credit data that can become available to the academic community, with certain confidentiality restrictions to prevent the identification of individual customers.

5.4. Some data preprocessing issues

Data preprocessing constitutes a critical step in developing high performance classification and prediction systems. The primary aim is to derive a subset of data, which is expected to be a representative sample of the problem to be modelled (Pyle, 1999; Dasu and Johnson, 2003).

There are different tools and methods used for data preprocessing, each one focused on solving some specific problem related to data quality (Zhang et al, 2005). Thus the purpose of sampling is to minimize the size of very large data sets, whereas filtering or cleaning aims at removing noise from data; discretization converts continuous attributes into discrete data, and normalization is a scaling-down transformation of the attributes; feature selection and extraction pull out as much irrelevant and redundant information as possible. Other outstanding issues in real-life credit scoring applications include reject inference (process whereby the performance of previously rejected applicants is inferred primarily using the repayment behaviour of accepted applicants), class imbalance (usually referred to as the low-default portfolio problem in the context of credit scoring), missing values in attributes, and population drift (changes occurring in the population due to unexpected changes in economic conditions and other factors).

The credit data sets are usually heavily imbalanced, that is, the number of default cases is typically very low, what may cause the estimate of probability of default to be statistically unreliable (Flórez-López, 2010; Kennedy et al, 2010). Also, typically we have only information about the repayment behaviour of those who were granted for credit in the past, but not of those previous applicants who were rejected; a model built using data only on accepted applicants will be biased when employed for all applicants. Reject inference investigates on how to include those previous applicants in the modelling process in order to make the classifier more accurate and less biased (Crook and Banasik, 2004). This situation can become even worse with the existence of incomplete data, which may be caused by many circumstances (eg, loss of information or refusal of applicants to answer certain questions) (Flórez-López, 2010). On the other hand, if changes occur in the population of applicants, the credit risk prediction model needs some adaptive learning strategy in order to avoid a significant deterioration in performance (Pavlidis et al, 2012).

While feature selection has widely been handled by means of evolutionary computation, the other problems have not been addressed in depth. As already seen in Section 4.4, Sakprasat and Sinclair (2007) utilize genetic programming to deal with the presence of missing data in a credit approval database, Tsai and Chou (2011) conduct feature and instance selection with a genetic algorithm to model a support vector machine and a k-nearest neighbours rule, and García et al (2012) use several evolutionary algorithms to remove noisy samples and reduce the size of the data set. However, no more articles related to data preprocessing with evolutionary computing in credit scoring have been found in the literature review.

6. Conclusions

The future of credit risk analysis is an increased reliance on computerized credit scoring models. Automatic decision-making will never take the place of the financial analyst, but it can help make quick, consistent decisions to approve or deny the majority of transactions that fall above or below certain credit score thresholds.

This paper has presented a literature review on recent applications of evolutionary techniques to credit scoring and assessment. Now it is generally admitted that genetic algorithms and genetic programming are efficient, flexible methods to find optimal or near-optimal solutions from the search space, what makes them especially appealing to a wide variety of economic and financial applications. In this sense, we have found that evolutionary computation has mainly been applied to variable selection and parameter optimization as two important preprocessing steps for further classification with other prediction models, especially artificial neural networks and support vector machines.

An important subject in credit scoring is the comprehensibility or transparency of the model, which has become a key factor for the lending industry. In general, it has been seen that evolutionary algorithms should preferably be combined with (neuro)-fuzzy rules or decision trees to achieve a higher degree of transparency. Another promising method in this direction is the so-called multi-objective genetic algorithm because it is suitable for systems with conflicting goals, which in the case of credit scoring are to increase the predictive power and to reduce the complexity of the models. On the other hand, the use of evolutionary computation in variable selection leads to a reduction of the complexity, what also allows to increase the comprehensibility of the model.

When focusing on classification, one can observe that there is no single best algorithm across different credit databases. A technique may be the best on some particular data sets, but it will perform worse than the other algorithms on other different data sets. Therefore it is not possible to conclude that genetic algorithms and genetic programming are better or worse than other models, but are simply an alternative to conventional methods. However, the major interest of using evolutionary methods probably comes from their ability to properly optimize the internal parameters of those classifiers that have demonstrated to perform well in credit scoring applications.

The present review has also revealed some weaknesses and limitations that should be taken into account in the future: (i) the lack of credit databases publicly available makes difficult to perform extensive comparisons between methods and thus, several papers have merely included experiments over the German and Australian benchmarking data sets; (ii) there exists a great variety of performance evaluation criteria that have been used in the articles, but some are not adequate due to the different misclassification costs and/or imbalanced class distributions often encountered in credit scoring problems; (iii) the conclusions drawn by most authors have not been supported by statistical tests to demonstrate the significance of the experimental results and, in other cases the statistic applied is inappropriate because of the incorrect assumptions they have made; (iv) although data preprocessing is a very important task in classifier modelling and involves a lot of problems or imperfections in data (eg, missing data, population drift, noisy examples, low-default portfolio), it has been observed that variable selection is the only problem that has largely been tackled with evolutionary techniques.

Finally, it is also interesting to point out that most research in evolutionary computation applied to credit scoring has been published by journals that belong to the field of Computer Science, especially Expert Systems with Applications, but also others such as Applied Soft Computing, Decision Support Systems, Soft Computing, International Journal of Intelligent Systems, and Applied Intelligence. A second group of journals with articles on this topic can be deemed as interdisciplinary: European Journal of Operational Research, Journal of Forecasting, Operations Research, and Journal of Global Optimization. Paradoxically, despite the need of cooperation between the scientific communities of Economics and Computer Science, only a limited number of papers on evolutionary computing applications to credit scoring has been published by journals that belong to fields such as Business, Economics, Finance, or Management (eg, Journal of Banking & Finance, American Journal of Economics and Business Administration, Economic Bulletin, and Intelligent Systems in Accounting, Finance & Management).