Olga Ivina’s blog: octubre 2014

jueves, 30 de octubre de 2014

Lean Startup Munich: from bleeding edge to leading edge

Going lean in Munich

Almost anyone who speaks English and has given a though to founding or joining a startup is most surely aware of the Lean Startup philosophy.

The concept has been invented by Eric Ries, a young and successful U.S. entrepreneur who wrote a great book explaining how to successfully build and run startup businesses. This book has had a major impact, and the idea has grown to an international movement and has met many practitioners across the globe.

Munich too has a top notch group of lean startuppers, and it's been up and running since 2012. The group gathers together once a month, and the meetings are organised through the Meetup platform. Like everything in Bavaria, these meeting are of superb quality: the venues are great, the speakers are awesome, and the public is friendly and inspiring. Also, like everything in Bavaria, these meetups involve beer. Good beer. And pizzas. Which are courtesy of numerous group sponsors.

Yesterday, there was a meetup. I have decided to blog bout it because it was just great.

Here's an outline of talks:

Remote employees for an IT company: yay or nay? A view of a CTO

The first presentation was given by Dimitar Siljanovski, a CTO of Cuponation, a successful startup that offer retail discount coupons to customers and is operating all across Europe. As a tech person himself, but also an executive, Dimitar was talking about hiring and managing remote teams for IT businesses.

In his talk, he explained what is the main motivation for remote hires, which is costs and quality. I found it particularly funny that Dimitar mentioned that they prefer to hire Ukrainians rather than Russians, because the latter are too expensive. He has told the audience about where and how they find the people, and he has been able to justify that the concept actually works: the churn rate among in-house staff is twice as higher as the one amongst remotes! To say nothing of the fact that a very little number of employees have left the company at all.

Dimitar has also elaborated quite a bit on how they at Cuponation maintain the remotes motivated: e.g. promote, relocate etc., and on how otherwise located developers help expand the teams locally. He mentioned that one of the obstacles for foreign developers was their slight insecurity in their English language skills (for example, when there is a necessity to participate at meetings, and how easily this could be overcome). He has mentioned that understanding different cultures is of the highest importance. Dimitar has also admitted that one of the issues he bumped into once was a conflict of two dazzling personalities. He has confessed that he always intends to hire the best, those he himself can learn from, but it is understandable that when two stars collide it might cause problems for the whole team. Such issues are, however, also manageable, as far as he explained.

All in all, remote working in IT is working. The complete presentation could be found here.

Hogwarts for entrepreneurs: The Founder Institute

The second talk was presented by Jan Kennedy from the Founder Institute. This is an educational organisation backed by Microsoft which trains people to be entrepreneurs. They start with testing whether you have a so-called Entrepreneur DNA determining how apt you are to find and run a business. More information on that can be found here, but as far as I have retrieved from Jan's brief presentation, one has to be a determined and well-balanced person in order to succeed in building a company from scratch. Finally, once the test is passed, you can apply for the program. Then, during 4 months you will get mentored on how to be an entrepreneur, and you are promised to get a personal approach to that based on your own strengths and weaknesses revealed by the DNA test.

The enrollment deadline for Munich is approaching, but if you are elsewhere and are curious about the FI activities, you can attend one of their events listed here.

When your train is late, there is a business opportunity there

Finally, the last talk that has been presented by a lean startupper Thomas Hartmann who has taken up a common problem and found a market in it.

Not many people are aware of it, but, according to EU legislation, if your train is 1 hour late, you are entitled to receive a refund of 25% of your travel fare. If the delay reaches or exceeds 2 hours, your compensation rises up to a whopping 50%. This is very well explained here (in German).

However, train companies seem to be reluctant in giving your money back. For example, with DB, the main German train company, you can only request your refund at the station, in person, directly addressing a company's representative, or you can write them a letter with a request. No e-mail or phone calls would be accepted and there is no app for that either.

Or, there was no such app. Until Thomas has come up with an idea to set up a service that would help people get reimbursed without making them fill in and file all these forms or queue up losing time among other frustrated customers. He has created a business called Bahn-Erstattung.de which aims to simplify the refund process for the end users. All you have to do is to send a picture of your ticket and a repayment request. You do so via a smartphone up. The company takes it right from there, and you have your money back in about a month.

This idea, simple and smart, has aroused a great interest in the audience. When people were asking Thomas about what he would have done differently if he was to start all over again, he said that he would not invest that much in the technical part. Similarly to Couponation, he has also hired a remote developer, in this particular case, from India, to create the software. And this has worked well for him.

Is remote work a one solution for all?

To close the loop, I would like to add a couple of thoughts of mine on remote work. The successes provided by hiring remote programmers have got me thinking whether long-distance collaboration is a good idea for any kind of labour. I have found a very interesting article in the MIT Sloan Management Review explaining how to "set up remote workers to thrive". The author lists four challenges standing on the way of distant working and then takes a closer look on each of them suggesting solutions to the problems. In my opinion, this article is not only an absorbing read, but also a result of a thorough research backed by 13 publications. What has surprised me quite a bit is that no particular kind of job is listed anywhere in the text. This leads to a conclusion that remote work challenges are generic, regardless of what you actually do for the company. Since these issues are (allegedly) manageable, one may be able to successfully work from "a Galaxy far, far away" doing many different things, and not necessarily writing code.

This kind of work relationship might not suit every kind of personality and situation but is worth consideration.

miércoles, 22 de octubre de 2014

Some about credit scoring

Data science and credit scoring

I really love reading Data Science Central and, in particular, Analytic Bridge. Several days ago, I have received their newsletter in which they reported on a lot of stuff including the ever popular Big Data. In one of the articles, there is an overview of a selection of case studies on big data.

One of them speaks (briefly, and provides a wrong link) about a German, Hamburg-based fintech company - Kreditech - that applies sophisticated data analysis for the purposes of credit scoring. Exciting, right? Especially taking into consideration the fact that Kreditech takes into account the behavioral angle. As reported, the company has found and makes use of interesting connections between a person's social media behaviour and their financial credibility. This guys have reached a massive success and hey have expanded the shop to many countries, including Spain and Russia. On their webpage, they say that they rely on big data and complex machine-learning algorithms to make faster and better scoring decisions.

Some time ago, I have read a great article that I have found on LinkedIn. It is a blog post from Simon Gibbons, someone who has worked in credit business for more than 20 years, and he admits that he is glad that there are things that have not changes about this job since then.

Which has brought me to thinking that machine learning is one of them. Indeed, it is no secret that logistic regression is The Algorithm used around to perform credit scoring. And if you want to take probably the most amazing online course on machine learning existing nowadays, offered by Stanford University on Coursera platform and elsewhere - (do act on this urge!) - you will very soon find out that logistic regression is one of the very basic machine learning algorithms. Statisticians may laugh now. Economists working in scoring may now say they are advanced in machine learning.

Nevertheless, in my humble opinion, logistic regression does a great job, because its output is well-interpretable and flexible. First and foremost, the regression coefficients can be translated to odds of a person being credit-worthy conditioned by a given factor. Second is that the method output is the probability of a person being credible or not. So, when you want to classify people accordingly, you can play a little bit with the probability threshold value. Thus, you can mitigate the risks of misclassification, namely avoid getting to many false positives and the opposite. Third is that because the underlying algorithm is standard, it can be upgraded (considering regularisation terms, for example), and this can improve a model's prediction power on a new data.

A hidden gem: German credit scoring datasets

It is really hard to get personal financial data - it may be even harder than getting clinical data. Luckily, there exists a wonderful German credit scoring data set . Provided by Munich's LMU university, it has been used extensively by well-known German statistics professors - G. Tutz, L. Fahrmeir and A. Hamerle - for educational purposes. The dataset can be downloaded from the LMU webpage or as a part of the R package Fahrmeir, but I suggest the first option, because in the package, as least in my download, the dataset appears to be somewhat trimmed.

In his book, "Regression for Categorical Data", particularly, in the exercise 15.13 Dr. G. Tutz suggests that a reader uses this dataset to fit a bunch of classifiers: including linear, logistic, nearest neighbours, trees or random forest methods. It is further proposed to split the set in the proportion into the training and the test sets, in the proportion 80/20, and compute test errors.

Playground

I, too, could not resist the temptation to put my hands on this dataset, and I've tried several methods. Here below I report on three of them that I am particularly fond of: the application and possible results.

At first, the whole dataset has ben fit by these models in order to see how they approximate it and what features they rely upon at most. The next step was to split the data several times into train and test sets, as recommended in the book, and test how these models, fitted on train data, predict on test data.

Fitting the dataset

Logistic regression

So, to start with, I have run a logistic regression classifier. This has yielded me the following set of significant predictors:

The following factors are among them:

balance of current account (categorised),
credit period,
moral, namely, empirically observed tendency to return lend money,
purpose of credit,
balance of savings account (categorised),
being a woman,
installment in percent of available income,
being a garantor,
time of residence at the current home address,
type of one's housing,
not being foreign workforce (Gastarbeiter in German).

Here, the values are rounded to 3 digits. Exponent the coefficients for the conditional odds ratios.

Then, the goodness of fit can be deduced from a bunch of statistical tests for models, or by crosstabbing the actual values and the fit. Here, the threshold issue should be brought up again. The logistic model outputs fitted probabilities. Therefore, one can classify the respective values as setting the decision margin. The most common and intuitive way to do so, also suggested by the shape of the logistic function curve, is to set it to 0.5. This would yield the following fit:

or 78.7% of correctly classified values. In statistics and machine learning, two terms are used quite extensively to elicit the goodness-of-fit of a given classifier: namely, precision, recall and the F1-score:

Precision is the share of true positives out of the whole predicted positives.
Recall is the share of true positives out of the whole observed positives.
F1 score is the weighted average of precision and recall that provides an estimate of how good the classifier is: 0 is the worst value, 1 is the best value.

Depending on what the primary aim of a classifier is, the threshold value can deviate from 0.5. If one wants to predict credit worthiness very confidently, then the threshold can be set higher. If one wants to avoid missing too many worthy people, namely, to avoid false negatives, the threshold can be set lower.

Here below is the table showing what different threshold values yield:

So, as expected, setting threshold to 0.7 yields the best precision, the value of 0.3 provides the best recall and the highest F1-score, and the value of 0.5 leads to the highest overal percentage of correctly classified subjects.

Tree

The CART algorithm introduced in 1985 by Leo Breiman cannot be underestimated. I love this algorithm and I use it a lot. What I particularly like about it is that it can report on variable importance both in terms of GINI importance (or impurity) and information gain. Also, pruning, i.e. reducing the size of a tree, is a very important concept which helps create usable models.

In the beginning, a classification tree has been fit with no prunng involved.

Variable importance (GINI):

balance of current account: 31
duration: 15
purpose of credit: 11
credit amount: 11
value of savings: 10
most valuable assets: 9
living at current address: 7
previous credits: 2
type of housing: 1
working for current employer. 1
job type: 1

These values of GINI impurities have been rescaled to add up to 100, so one can quickly see the relative importance of the factors.

The first tree with the complexity parameter of 0.01 has resulted to have 81.

In R, a CART tree can be fitted using the package rpart. When it comes to fitted values, rpart can return them, in particular, both as fitted values and as probabilities of belonging to a class. If the latter option is chosen, then one can employ the same moving threshold paradigm.

Opting for the default fitted values has yielded the following classification:

The fit is better than for logistic regression: in each category in particular and overall - 79.7%.

Having refitted the model basing the splits on the information gain, has elicited the following variable importances:

balance of current account: 35
duration: 14
purpose of credit: 11
moral: 11
value of savings: 10
credit amount: 7
most valuable assets: 6
living at current address: 2
previous credits: 1
type of housing: 1
age: 1

This is a slightly different set. The fit is a little bit worse - under the same complexity of the tree (0.01) - 79.3%, but still a bit better than for logistic regression.

Pruned to complexity of 0.05, both trees yield the same accuracy of the overal fit: 74.7%

Support Vector Machine

I am a massive fan of SVM. Mostly because of all this dimensionality reduction and the "kernel trick". Support vectors offer a whole different approach to classification, and the underlying models are very flexible - because of kernels and regularisation.

The implementation of SVM in R is amazing and is done via linking the e1071 package. The default kernel is Gaussian, which is referred to as radial basis function, and the selection of other implemented functions include: linear (dot product), polynomial and sigmoid kernels. I think that this is an exhaustive set for basic research needs, but I kind of am interested in implementing other kernels and using them with this classifier. The only minor thing that I'd change is I'd call the radial basis kernel Gaussian - which it actually is. RBF is a broader term: a Laplacian kernel is also a radial basis function. But I'm being picky, perhaps.

Anyway, as my aim was to fit the dataset I considered it ok to massively overfit it and set the regularisation term to whatever works.

The table below reports accuracy of classification for different kernels and cost parameters (in %):

For Gaussian and polynomial (of degree 3, which is the default) kernels the fit improves drastically with the growth of the cost parameter. Here below is a similar table but reflecting the number of support vectors each model relies upon:

The size of the dataset is 1000 observations.

The gamma parameter (or the kernel scale parameter), which is, by default in the function equals 1/(number of features) - including dummies - has remained untouched.

Prediction

The second part of the excercise from the Prof. Tutz's book suggests splitting the dataset several times into train and test parts and then fit the model using the first part and test its predictive performance on the second.

Using random sampling, have split the dataset 10 times, assigning 20% of it to the training set and 80% to the test set. The tree model has used the default complexity parameter of 0.01. The SVM model has been implemented with the use of the Gaussian kernel and the cost parameter of 5.

Below, there are the validation results for each trial reported as the percentage of cases classified in the test set correctly:

And the summary of the results:

As seen from the boxplot, SVM outperforms other methods in prediction accuracy, followed by logistic regression. CART, however, had higher average performance than logistic regression, and the smalles results variability of the three (SD=1.81%).

I then ran the same analysis resampling the data 100 times and have come up with the following results:

The respective standard deviations are:

Logistic regression: 2.44 %
CART: 2.85 %
SVM: 2.75 %

Finally, I have run 1000 inerations of the same analysis. just to see if the results hold. And they hold:

As for variability of results, the respective standard deviations were:

Logistic regression: 2.84 %
CART: 2.86 %
SVM: 2.66 %

This comparison could be taken several steps further. Namely, the data could be split into train, cross-validation and test sets, where the first serves to fit the model, the second - to adjust the parameters, and the third - to test the performance of the resulting classifier. However, there is always room for improvement, and these results already can provide one with an idea of the methods.

If you are still bearing with me, please let me draw your attention to the existence of such an important predictor in the dataset as presence/absence of a telephone in a person's posession. I believe, back in the days they ment stationary phones not even the oversized mobile Motorolas. What would it be now, an iPhone 6?

lunes, 13 de octubre de 2014

Unit tests in R made simple

There is noting new under the moon.

During the last six months, I have been working mostly in R. R is great for research purposes, and I am not participating in these endless discussions about what is cooler: R, Python, Matlab, SAS or you name it. As being priviledged by speaking all of the above mentioned languages with a greater or lesser fluency, I can compare, and therefore I think that it all comes down to what you want to do in the end.

One of the things that I have adopted from my working-exclusively-in-Python experience is is the test-triven development (TDD) paradigm. Now, even writing my research code in R, I can't help creating these tests.

There is actually not much new to say about unit testing, because the topic is extensively covered elsewhere. In my humble opinion, this blog post offers the most awesome coverage of unit testing that I have ever seen.

TDD in general and unit tests in partucular are often neglected by R users - unless they are writing a package.

I think the added value of unit tests for research code cannot be overestimated since, despite popular beliefs of people unfamiliar with R, the language is much more than - how one of my classmates liked to put it - "a sophisticated statistical calculator".

Of course, many-many research findings have been successfully made employing script-based code, but when you have to do similar things multple times, and when you can wrap your code up and make it unfold beautifully with every call, testing is comes in very handy.

R has a certain characteristics: there exist at least one implementation (i.e., package) for almost anything. For some things, there are multiple ways to do them. I don't really know why people reinvent the wheel, but my guess is that when the current state of things is not working for them, they prefer to start from scratch rather than to dig into someone's code.

So, if you are eager to to unit test your thoroughly developed work you can opt for - at least - these three packages:

The last one does not seem to be used very often. The second has the fame because it has been developed by the very Hadley Wickham, and is allegedly used by him in his packages. To those unfamiliar with the name, let me just say that he is the reference R guy, a visualisation guru and the ggplot2 creator. He has a $60 worth book published by Springer Verlag and a stackoverflowing reputation on Stack Overflow.

I am using the first package from the list, RUnit, and not for the reason that is has been created by fellow German people working in field of epidemiology. I do so merely because RUnit is so similar to the unit testing framework of Python that I am already familiar with. It is reportedly alike to the unit testing approach implemented in Java. I don't know Java, so I can't tell. What I can tell that RUnit is great for use. It is clear, comprehensive and disambiguous. Moreover, it comes with a terrific reference manual that is a great read - apart from being informative. It provides a simple yet exhaustive explanation of what unit tests are, why they are helpful and how they differ from integration tests. Also, it provides guidelines on how to write unit tests. It is quite unlikely to encounter a line like:

"Once a bug has been found, add a corresponding test case"

or like:

"Develop test cases parallel to implementing your functionality. Keep testing all the time (code - test - simplify cycle)."

in an R document (and I've read quite a few of them).

This blog post provides a nice comparison of RUnit and testthat.

Unfortunately, to my knowledge, there exist no implementation of test suits in any of R IDEs. But this is not a major problem, especially for those R users who, like me, started their journey with R using console only.

So, if you want to define a test suite in R, all you need to do is link the library, defineTestSuite(), runTestSuite() and, if you wish to, printTextProtocol() for your tests.

Like that:

Olga Ivina’s blog