miércoles, 22 de octubre de 2014

Some about credit scoring

Data science and credit scoring

I really love reading Data Science Central and, in particular, Analytic Bridge. Several days ago, I have received their newsletter in which they reported on a lot of stuff including the ever popular Big Data. In one of the articles, there is an overview of a selection of case studies on big data.

One of them speaks (briefly, and provides a wrong link) about a German, Hamburg-based fintech company - Kreditech -  that applies sophisticated data analysis for the purposes of credit scoring. Exciting, right? Especially taking into consideration the fact that Kreditech takes into account the behavioral angle. As reported, the company has found and makes use of interesting connections between a person's social media behaviour and their financial credibility. This guys have reached a massive success and hey have expanded the shop to many countries, including Spain and Russia. On their webpage, they say that they rely on big data and complex machine-learning algorithms to make faster and better scoring decisions.

Some time ago, I have read a great article that I have found on LinkedIn. It is a blog post from Simon Gibbons, someone who has worked in credit business for more than 20 years, and he admits that he is glad that there are things that have not changes about this job since then.

Which has brought me to thinking that machine learning is one of them. Indeed, it is no secret that logistic regression is The Algorithm used around to perform credit scoring. And if you want to take  probably the most amazing online course on machine learning existing nowadays, offered by Stanford University on Coursera platform and elsewhere - (do act on this urge!) - you will very soon find out that logistic regression is one of the very basic machine learning algorithms. Statisticians may laugh now. Economists working in scoring may now say they are advanced in machine learning.

Nevertheless, in my humble opinion, logistic regression does a great job, because its output is well-interpretable and flexible. First and foremost, the regression coefficients can be translated to odds of a person being credit-worthy conditioned by a given factor. Second is that the method output is the probability of a person being credible or not. So, when you want to classify people accordingly, you can play a little bit with the probability threshold value. Thus, you can mitigate the risks of misclassification, namely avoid getting to many false positives and the opposite. Third is that because the underlying algorithm is standard, it can be upgraded (considering regularisation terms, for example), and this can improve a model's prediction power on a new data.

A hidden gem: German credit scoring datasets

It is really hard to get personal financial data - it may be even harder than getting clinical data. Luckily,  there exists a wonderful German credit scoring data set . Provided by Munich's LMU university, it has been used extensively by well-known German statistics professors - G. Tutz, L. Fahrmeir and A. Hamerle - for educational purposes.  The dataset can be downloaded from the LMU webpage or as a part of the R package Fahrmeir, but I suggest the first option, because in the package, as least in my download, the dataset appears to be somewhat trimmed.

In his book, "Regression for Categorical Data", particularly, in the  exercise 15.13 Dr. G. Tutz suggests that a reader uses this dataset to fit a bunch of classifiers: including linear, logistic, nearest neighbours, trees or random forest methods. It is further proposed to split the set in the proportion into the training and the test sets, in the proportion 80/20, and compute test errors.

Playground

I, too, could not resist the temptation to put my hands on this dataset, and I've tried several methods. Here below I report on three of them that I am particularly fond of: the application and possible results.

At first, the whole dataset has ben fit by these models in order to see how they approximate it and what features they rely upon at most. The next step was to split the data several times into train and test sets, as recommended in the book, and test how these models, fitted on train data, predict on test data.

Fitting the dataset

Logistic regression

So, to start with, I have run a logistic regression classifier. This has yielded me the following set of significant predictors:



The following factors are among them:

  • balance of current account (categorised), 
  • credit period, 
  • moral, namely,  empirically observed tendency to return lend money, 
  • purpose of credit, 
  • balance of savings account (categorised), 
  • being a woman
  • installment in percent of available income
  • being a garantor
  • time of residence at the current home address, 
  • type of one's housing
  • not being foreign workforce (Gastarbeiter in German). 


Here, the values are rounded to 3 digits. Exponent the coefficients for the conditional odds ratios.

Then, the goodness of fit can be deduced from a bunch of statistical tests for models, or by crosstabbing the actual values and the fit. Here, the threshold issue should be brought up again. The logistic model outputs fitted probabilities. Therefore, one can classify the respective values as setting the decision margin. The most common and intuitive way to do so, also suggested by the shape of the logistic function curve, is to set it to 0.5. This would yield the following fit:


or 78.7% of correctly classified values. In statistics and machine learning, two terms are used quite extensively to elicit the goodness-of-fit of a given classifier: namely, precision, recall and the F1-score:

  • Precision is the share of true positives out of the whole predicted positives.
  • Recall is the share of true positives out of the whole observed positives.
  • F1 score is the weighted average of precision and recall that provides an estimate of how good the classifier is: 0 is the worst value, 1 is the best value.

Depending on what the primary aim of a classifier is, the threshold value can deviate from 0.5. If one wants to predict credit worthiness very confidently, then the threshold can be set higher. If one wants to avoid missing too many worthy people, namely, to avoid false negatives, the threshold can be set lower.

Here below is the table showing what different threshold values yield:


So, as expected, setting threshold to 0.7 yields the best precision, the value of 0.3 provides the best recall and the highest F1-score, and the value of 0.5 leads to the highest overal percentage of correctly classified subjects.

Tree

The CART algorithm introduced in 1985 by Leo Breiman cannot be underestimated. I love this algorithm and I use it a lot. What I particularly like about it is that it can report on variable importance both in terms of GINI importance (or impurity) and information gain. Also, pruning, i.e. reducing the size of a tree, is a very important concept which helps create usable models.

In the beginning, a classification tree has been fit with no prunng involved.

Variable importance (GINI):

  • balance of current account:  31
  • duration:  15
  • purpose of credit:  11
  • credit amount:   11
  • value of savings:    10
  • most valuable assets:   9
  • living at current address:  7
  • previous credits:  2
  • type of housing:  1
  • working for current employer.  1
  • job type: 1
These values of GINI impurities have been rescaled to add up to 100, so one can quickly see the relative importance of the factors.

The first tree with the complexity parameter of 0.01 has resulted to have 81.

In R, a CART tree can be fitted using the package rpart. When it comes to fitted values, rpart can return them, in particular, both as fitted values and as probabilities of belonging to a class. If the latter option is chosen, then one can employ the same moving threshold paradigm.

Opting for the default fitted values has yielded the following classification:


The fit is better than for logistic regression: in each category in particular and overall - 79.7%.

Having refitted the model basing the splits on the information gain, has elicited the following variable importances:

  • balance of current account:  35
  • duration:  14
  • purpose of credit: 11
  • moral:   11
  • value of savings:  10   
  • credit amount:  7
  • most valuable assets:  6
  • living at current address:  2
  • previous credits: 1
  • type of housing:  1
  • age: 1

This is a slightly different set. The fit is a little bit worse - under the same complexity of the tree (0.01) - 79.3%, but still a bit better than for logistic regression.

Pruned to complexity of 0.05, both trees yield the same accuracy of the overal fit: 74.7%


Support Vector Machine

I am a massive fan of SVM. Mostly because of all this dimensionality reduction and the "kernel trick". Support vectors offer a whole different approach to classification, and the underlying models are very flexible - because of kernels and regularisation.

The implementation of SVM in R is amazing and is done via linking the e1071 package. The default kernel is Gaussian, which is referred to as radial basis function, and the selection of other implemented functions include: linear (dot product), polynomial and sigmoid kernels. I think that this is an exhaustive set for basic research needs, but I kind of am interested in implementing other kernels and using them with this classifier. The only minor thing that I'd change is I'd call the radial basis kernel Gaussian - which it actually is. RBF is a broader term: a Laplacian kernel is also a radial basis function. But I'm being picky, perhaps.

Anyway, as my aim was to fit the dataset I considered it ok to massively overfit it and set the regularisation term to whatever works.

The table below reports accuracy of classification for different kernels and cost parameters (in %):


For Gaussian and polynomial (of degree 3, which is the default) kernels the fit improves drastically with the growth of the cost parameter. Here below is a similar table but reflecting the number of support vectors each model relies upon:



The size of the dataset is 1000 observations.

The gamma parameter (or the kernel scale parameter), which is, by default in the function equals 1/(number of features) - including dummies - has remained untouched.


Prediction

The second part of the excercise from the Prof. Tutz's book suggests splitting the dataset several times into train and test parts and then fit the model using the first part and test its predictive performance on the second.

Using random sampling, have split the dataset 10 times, assigning 20% of it to the training set and 80% to the test set. The tree model has used the default complexity parameter of 0.01. The SVM model has been implemented with the use of the Gaussian kernel and the cost parameter of 5.

Below, there are the validation results for each trial reported as the percentage of cases classified in the test set correctly:


And the summary of the results:




As seen from the boxplot, SVM outperforms other methods in prediction accuracy, followed by logistic regression. CART, however, had higher average performance than logistic regression, and the smalles results variability of the three (SD=1.81%).

I then ran the same analysis resampling the data 100 times and have come up with the following results:


The respective standard deviations are:

  • Logistic regression: 2.44 %
  • CART: 2.85 %
  • SVM: 2.75 %

Finally, I have run 1000 inerations of the same analysis. just to see if the results hold. And they hold:



As for variability of results, the respective standard deviations were:

  • Logistic regression: 2.84 %
  • CART: 2.86 %
  • SVM: 2.66 %

This comparison could be taken several steps further. Namely, the data could be split into train, cross-validation and test sets, where the first serves to fit the model, the second - to adjust the parameters, and the third - to test the performance of the resulting classifier. However, there is always room for improvement, and these results already can provide one with an idea of the methods.


If you are still bearing with me, please let me draw your attention to the existence of such an important predictor in the dataset as presence/absence of a telephone in a person's posession. I believe, back in the days they ment stationary phones not even the oversized mobile Motorolas. What would it be now, an iPhone 6? 

No hay comentarios:

Publicar un comentario