Olga Ivina’s blog: 2014

lunes, 1 de diciembre de 2014

Subjective measures of one's solvency

Some time ago when I was looking into the German credit scoring dataset, different models have shown that a "moral" of a person, namely, their empirically observed tendency to return borrowed money, is one of the main predictor for their credit score.

Which is kind of obvious.

This measure of "moral" can be deduced when data on one's credit history is available. In countries with long-established credit traditions, like Germany, it is normally not a problem to get this empirical evidence. However, in some countries this is not the case. Moreover, you can always stumble upon people who have not applied for any kind of loan, ever, anywhere.

How would you go about this problem?

For example, you can rely on a report coming from an independent credit agency (think SCHUFA in Germany). They track your incomes and expenditure and come up with a verdict. You can reach out to these agencies when you need to be backed up for, say, renting a flat, too. Such ratings are not available everywhere, and also, their estimate is not always the ultimate truth: SCHUFA has allegedly committed some errors in its analyses.

The second option to consider is try to come up with a tailored subjective estimate of your solvency based on your behavior and mindset. These two options should be taken into consideration separately.

In 2012, SCHUFA announced that they were going to use the personal data from people's Facebook profiles in their scoring, which, of course, has raised a vast negative feedback. Nevertheless, Kreditech, a Hamburg-based fintech startup providing microcredits to people, found in 2012, does act on this SCHUFA initiative. According to this article in the Economist, Kreditech asks a potential borrower for access to their data on Facebook, and, based on their profile and contacts, can infer, whether this person is likely to return a loan. The Economist quotes a Kreditech representative as saying that an applicant with a Facebook friend who has defaulted on a firm's loan would most probably be rejected.

This kind of verdict goes well in line with the proverb: "Tell me who your friend is and I'll tell you who you are". However, an average person aged between 25 and 43 has 360 friends on Facebook. This is in the US, but you get my point. So, out of these 360, one could have a friend (or two) who is a taxidermist, a tuk tuk driver or who has otherwise failed in grad school, but that does not help infer anything about the applicant. I am still waiting to get a reply from Kreditech, hopefully, I'll receive it sometime.

Facebook should indeed make a very good living from people's mundane life data. My assumption is supported by the fact that someone very knowledgeable I am friends with left his job as a quantitative analyst in finance to take a role of a software developer in Facebook. Also, their have just announced their new hire, Vladimir Vapnik, which means that they can afford it. The question is, do the users benefit from Facebook to the same extent as Facebook benefits from them.

Back to the task of credit scoring, there is another, indirect way to estimate someone's credibility. These are the psychometric scores based on questionnaires. The major name among companies that are doing such analyses is The Entrepreneurial Finance Lab (or the EFL). It is a business that originates from the Harvard University. It was found in 2006, and it's aim has been to take credit scoring to the next level, making it scalable and independent from data on past credit history, business field or financial documentation in possession.

I have found some evidence of implementation of the psychometrical analyses for credit scoring in the internet, but no example of a questionnaire itself. While I can imagine that analysts may employ some kind of deep learning to establish links between someone's personality traits and their "morale", on a micro and a macro levels, with a reduced knowledge on how this is actually done, this is still a black box to me. So I should just take someone's words that this works. And when it does, this is indeed amazing because it can help bridge so many gaps in crediting uncovered before.

In the meantime, it would be, perhaps, easier to make the potential borrowers go through a polygraphic analysis.

martes, 11 de noviembre de 2014

Quality of sleep

Sleep has been a field of ongoing research for many years now. In is quantified, qualified, widely documented, discussed, argued about etc. Every sensible person in the world knows that they should sleep around 8 hours per day, and there are many guidelines out there, e.g. the ones from the U.S. National Health, Lung and Blood Institute.

For the last year and a half, I have been very concerned about how much I sleep. Since I got an activity tracker, I have realised that on the average I sleep less than the benchmark value of 8 hours.

The absolute majority of portable sleep-tracking devices aimed for both clinical and quotidian use contain an accelerometer inside.

Seriously, check out the last link if you are in search of a sleep device for self: it's a great read!

So, an accelerometer registers your acceleration along the three axes and then the relevant software translates the data into a one-dimensional continuous variable representing activity counts. Based on specific algorithms (e.g. this one), the ability of an accelerometer to track even the tiniest motions and the supposition that when you sleep you don't move, the software classfies your time in bed into periods when you are asleep and when you are awake. The latter is called wake after sleep onset (WASO). The time during which you fall asleep is referred to as the sleep latency. When your portable monitoring device says that you have slept 7 hours it means the time you were asleep, i.e. the time in bed minus WASO minus sleep latency.

You normally also get to know your sleep efficiency, i.e. the percentage that the time you have slept makes up from your total time in bed or from your goal if you are aiming for a specific sleep time.

This efficiency measure is one of the ways to look onto the quality of sleep.

There is a huge body of scientific research that considers sleep quality: 6246 records on Pubmed as of today. Quality of sleep is something that is conceptually simple and complex at the same time, and, to my knowledge, there exist many different and no ultimate ways to measure and report on it.

Three main directions to approach quality of sleep can be outlined. The first one is by mere asking. You can address a respondent with a question about how well they have slept. You can take it one step further and suggest that they rate the quality of their sleep on some scale. In order to boost the validity of your judgement, you can ask them to do so for several days in a row (for example, seven) - and thus you will come up with what is called a sleep diary.

The second way is to employ the renown Pittsburgh Sleep Quality Index (PSQI). This is a questionnaire-based measure developed by the researchers of the University of Pittsburg, and it has been extensively used in research (1680 records on Pubmed by today). The index's values range from "poor" to "good", and various sleep parameters across several domains are used along the way, including: subjective sleep quality, sleep latency, use of sleep medication and others. In a relevant questionnaire, a subject should give answers to all questions based on what they think that applies to the majority of days during the month prior to questioning. It is a highly subjective questionnaire and index, and this is what it is sometimes criticised for.

The third manner to track sleep quality is to use a quantitative objective measure of sleep efficiency mentioned above.

Independently from the way how you measure the quality of sleep, it is obvious that getting quality sleep every night or at least most of the nights is very important for well-being of practically everyone.

The fact that sleep affects performance is soundly backed by scientific evidence, and there is no wonder that articles on how to boost your sleep quality appear in places like Yahoo! Finance. People I am friends with would even purchase blue light blocking glasses to wear them in the evening - the blue light allegedly (I have never checked) shuts down melatonine production in humans.

It is naturally up to you how far you should go in controlling your performance in sleep. I have worn both commercial and professional devices, and I can say that wearing a tracker makes you think more about how much you sleep and move. This is not necessarily a good thing and, depending on your mindset, it can make you upset or anxious .

Anyway, self-awareness is a good thing, and if you don't want to engage yourself in wearing a monitor, you can keep a sleep diary for a while of even compute the value of the Pittsburgh Sleep Quality Index for yourself.

jueves, 30 de octubre de 2014

Lean Startup Munich: from bleeding edge to leading edge

Going lean in Munich

Almost anyone who speaks English and has given a though to founding or joining a startup is most surely aware of the Lean Startup philosophy.

The concept has been invented by Eric Ries, a young and successful U.S. entrepreneur who wrote a great book explaining how to successfully build and run startup businesses. This book has had a major impact, and the idea has grown to an international movement and has met many practitioners across the globe.

Munich too has a top notch group of lean startuppers, and it's been up and running since 2012. The group gathers together once a month, and the meetings are organised through the Meetup platform. Like everything in Bavaria, these meeting are of superb quality: the venues are great, the speakers are awesome, and the public is friendly and inspiring. Also, like everything in Bavaria, these meetups involve beer. Good beer. And pizzas. Which are courtesy of numerous group sponsors.

Yesterday, there was a meetup. I have decided to blog bout it because it was just great.

Here's an outline of talks:

Remote employees for an IT company: yay or nay? A view of a CTO

The first presentation was given by Dimitar Siljanovski, a CTO of Cuponation, a successful startup that offer retail discount coupons to customers and is operating all across Europe. As a tech person himself, but also an executive, Dimitar was talking about hiring and managing remote teams for IT businesses.

In his talk, he explained what is the main motivation for remote hires, which is costs and quality. I found it particularly funny that Dimitar mentioned that they prefer to hire Ukrainians rather than Russians, because the latter are too expensive. He has told the audience about where and how they find the people, and he has been able to justify that the concept actually works: the churn rate among in-house staff is twice as higher as the one amongst remotes! To say nothing of the fact that a very little number of employees have left the company at all.

Dimitar has also elaborated quite a bit on how they at Cuponation maintain the remotes motivated: e.g. promote, relocate etc., and on how otherwise located developers help expand the teams locally. He mentioned that one of the obstacles for foreign developers was their slight insecurity in their English language skills (for example, when there is a necessity to participate at meetings, and how easily this could be overcome). He has mentioned that understanding different cultures is of the highest importance. Dimitar has also admitted that one of the issues he bumped into once was a conflict of two dazzling personalities. He has confessed that he always intends to hire the best, those he himself can learn from, but it is understandable that when two stars collide it might cause problems for the whole team. Such issues are, however, also manageable, as far as he explained.

All in all, remote working in IT is working. The complete presentation could be found here.

Hogwarts for entrepreneurs: The Founder Institute

The second talk was presented by Jan Kennedy from the Founder Institute. This is an educational organisation backed by Microsoft which trains people to be entrepreneurs. They start with testing whether you have a so-called Entrepreneur DNA determining how apt you are to find and run a business. More information on that can be found here, but as far as I have retrieved from Jan's brief presentation, one has to be a determined and well-balanced person in order to succeed in building a company from scratch. Finally, once the test is passed, you can apply for the program. Then, during 4 months you will get mentored on how to be an entrepreneur, and you are promised to get a personal approach to that based on your own strengths and weaknesses revealed by the DNA test.

The enrollment deadline for Munich is approaching, but if you are elsewhere and are curious about the FI activities, you can attend one of their events listed here.

When your train is late, there is a business opportunity there

Finally, the last talk that has been presented by a lean startupper Thomas Hartmann who has taken up a common problem and found a market in it.

Not many people are aware of it, but, according to EU legislation, if your train is 1 hour late, you are entitled to receive a refund of 25% of your travel fare. If the delay reaches or exceeds 2 hours, your compensation rises up to a whopping 50%. This is very well explained here (in German).

However, train companies seem to be reluctant in giving your money back. For example, with DB, the main German train company, you can only request your refund at the station, in person, directly addressing a company's representative, or you can write them a letter with a request. No e-mail or phone calls would be accepted and there is no app for that either.

Or, there was no such app. Until Thomas has come up with an idea to set up a service that would help people get reimbursed without making them fill in and file all these forms or queue up losing time among other frustrated customers. He has created a business called Bahn-Erstattung.de which aims to simplify the refund process for the end users. All you have to do is to send a picture of your ticket and a repayment request. You do so via a smartphone up. The company takes it right from there, and you have your money back in about a month.

This idea, simple and smart, has aroused a great interest in the audience. When people were asking Thomas about what he would have done differently if he was to start all over again, he said that he would not invest that much in the technical part. Similarly to Couponation, he has also hired a remote developer, in this particular case, from India, to create the software. And this has worked well for him.

Is remote work a one solution for all?

To close the loop, I would like to add a couple of thoughts of mine on remote work. The successes provided by hiring remote programmers have got me thinking whether long-distance collaboration is a good idea for any kind of labour. I have found a very interesting article in the MIT Sloan Management Review explaining how to "set up remote workers to thrive". The author lists four challenges standing on the way of distant working and then takes a closer look on each of them suggesting solutions to the problems. In my opinion, this article is not only an absorbing read, but also a result of a thorough research backed by 13 publications. What has surprised me quite a bit is that no particular kind of job is listed anywhere in the text. This leads to a conclusion that remote work challenges are generic, regardless of what you actually do for the company. Since these issues are (allegedly) manageable, one may be able to successfully work from "a Galaxy far, far away" doing many different things, and not necessarily writing code.

This kind of work relationship might not suit every kind of personality and situation but is worth consideration.

miércoles, 22 de octubre de 2014

Some about credit scoring

Data science and credit scoring

I really love reading Data Science Central and, in particular, Analytic Bridge. Several days ago, I have received their newsletter in which they reported on a lot of stuff including the ever popular Big Data. In one of the articles, there is an overview of a selection of case studies on big data.

One of them speaks (briefly, and provides a wrong link) about a German, Hamburg-based fintech company - Kreditech - that applies sophisticated data analysis for the purposes of credit scoring. Exciting, right? Especially taking into consideration the fact that Kreditech takes into account the behavioral angle. As reported, the company has found and makes use of interesting connections between a person's social media behaviour and their financial credibility. This guys have reached a massive success and hey have expanded the shop to many countries, including Spain and Russia. On their webpage, they say that they rely on big data and complex machine-learning algorithms to make faster and better scoring decisions.

Some time ago, I have read a great article that I have found on LinkedIn. It is a blog post from Simon Gibbons, someone who has worked in credit business for more than 20 years, and he admits that he is glad that there are things that have not changes about this job since then.

Which has brought me to thinking that machine learning is one of them. Indeed, it is no secret that logistic regression is The Algorithm used around to perform credit scoring. And if you want to take probably the most amazing online course on machine learning existing nowadays, offered by Stanford University on Coursera platform and elsewhere - (do act on this urge!) - you will very soon find out that logistic regression is one of the very basic machine learning algorithms. Statisticians may laugh now. Economists working in scoring may now say they are advanced in machine learning.

Nevertheless, in my humble opinion, logistic regression does a great job, because its output is well-interpretable and flexible. First and foremost, the regression coefficients can be translated to odds of a person being credit-worthy conditioned by a given factor. Second is that the method output is the probability of a person being credible or not. So, when you want to classify people accordingly, you can play a little bit with the probability threshold value. Thus, you can mitigate the risks of misclassification, namely avoid getting to many false positives and the opposite. Third is that because the underlying algorithm is standard, it can be upgraded (considering regularisation terms, for example), and this can improve a model's prediction power on a new data.

A hidden gem: German credit scoring datasets

It is really hard to get personal financial data - it may be even harder than getting clinical data. Luckily, there exists a wonderful German credit scoring data set . Provided by Munich's LMU university, it has been used extensively by well-known German statistics professors - G. Tutz, L. Fahrmeir and A. Hamerle - for educational purposes. The dataset can be downloaded from the LMU webpage or as a part of the R package Fahrmeir, but I suggest the first option, because in the package, as least in my download, the dataset appears to be somewhat trimmed.

In his book, "Regression for Categorical Data", particularly, in the exercise 15.13 Dr. G. Tutz suggests that a reader uses this dataset to fit a bunch of classifiers: including linear, logistic, nearest neighbours, trees or random forest methods. It is further proposed to split the set in the proportion into the training and the test sets, in the proportion 80/20, and compute test errors.

Playground

I, too, could not resist the temptation to put my hands on this dataset, and I've tried several methods. Here below I report on three of them that I am particularly fond of: the application and possible results.

At first, the whole dataset has ben fit by these models in order to see how they approximate it and what features they rely upon at most. The next step was to split the data several times into train and test sets, as recommended in the book, and test how these models, fitted on train data, predict on test data.

Fitting the dataset

Logistic regression

So, to start with, I have run a logistic regression classifier. This has yielded me the following set of significant predictors:

The following factors are among them:

balance of current account (categorised),
credit period,
moral, namely, empirically observed tendency to return lend money,
purpose of credit,
balance of savings account (categorised),
being a woman,
installment in percent of available income,
being a garantor,
time of residence at the current home address,
type of one's housing,
not being foreign workforce (Gastarbeiter in German).

Here, the values are rounded to 3 digits. Exponent the coefficients for the conditional odds ratios.

Then, the goodness of fit can be deduced from a bunch of statistical tests for models, or by crosstabbing the actual values and the fit. Here, the threshold issue should be brought up again. The logistic model outputs fitted probabilities. Therefore, one can classify the respective values as setting the decision margin. The most common and intuitive way to do so, also suggested by the shape of the logistic function curve, is to set it to 0.5. This would yield the following fit:

or 78.7% of correctly classified values. In statistics and machine learning, two terms are used quite extensively to elicit the goodness-of-fit of a given classifier: namely, precision, recall and the F1-score:

Precision is the share of true positives out of the whole predicted positives.
Recall is the share of true positives out of the whole observed positives.
F1 score is the weighted average of precision and recall that provides an estimate of how good the classifier is: 0 is the worst value, 1 is the best value.

Depending on what the primary aim of a classifier is, the threshold value can deviate from 0.5. If one wants to predict credit worthiness very confidently, then the threshold can be set higher. If one wants to avoid missing too many worthy people, namely, to avoid false negatives, the threshold can be set lower.

Here below is the table showing what different threshold values yield:

So, as expected, setting threshold to 0.7 yields the best precision, the value of 0.3 provides the best recall and the highest F1-score, and the value of 0.5 leads to the highest overal percentage of correctly classified subjects.

Tree

The CART algorithm introduced in 1985 by Leo Breiman cannot be underestimated. I love this algorithm and I use it a lot. What I particularly like about it is that it can report on variable importance both in terms of GINI importance (or impurity) and information gain. Also, pruning, i.e. reducing the size of a tree, is a very important concept which helps create usable models.

In the beginning, a classification tree has been fit with no prunng involved.

Variable importance (GINI):

balance of current account: 31
duration: 15
purpose of credit: 11
credit amount: 11
value of savings: 10
most valuable assets: 9
living at current address: 7
previous credits: 2
type of housing: 1
working for current employer. 1
job type: 1

These values of GINI impurities have been rescaled to add up to 100, so one can quickly see the relative importance of the factors.

The first tree with the complexity parameter of 0.01 has resulted to have 81.

In R, a CART tree can be fitted using the package rpart. When it comes to fitted values, rpart can return them, in particular, both as fitted values and as probabilities of belonging to a class. If the latter option is chosen, then one can employ the same moving threshold paradigm.

Opting for the default fitted values has yielded the following classification:

The fit is better than for logistic regression: in each category in particular and overall - 79.7%.

Having refitted the model basing the splits on the information gain, has elicited the following variable importances:

balance of current account: 35
duration: 14
purpose of credit: 11
moral: 11
value of savings: 10
credit amount: 7
most valuable assets: 6
living at current address: 2
previous credits: 1
type of housing: 1
age: 1

This is a slightly different set. The fit is a little bit worse - under the same complexity of the tree (0.01) - 79.3%, but still a bit better than for logistic regression.

Pruned to complexity of 0.05, both trees yield the same accuracy of the overal fit: 74.7%

Support Vector Machine

I am a massive fan of SVM. Mostly because of all this dimensionality reduction and the "kernel trick". Support vectors offer a whole different approach to classification, and the underlying models are very flexible - because of kernels and regularisation.

The implementation of SVM in R is amazing and is done via linking the e1071 package. The default kernel is Gaussian, which is referred to as radial basis function, and the selection of other implemented functions include: linear (dot product), polynomial and sigmoid kernels. I think that this is an exhaustive set for basic research needs, but I kind of am interested in implementing other kernels and using them with this classifier. The only minor thing that I'd change is I'd call the radial basis kernel Gaussian - which it actually is. RBF is a broader term: a Laplacian kernel is also a radial basis function. But I'm being picky, perhaps.

Anyway, as my aim was to fit the dataset I considered it ok to massively overfit it and set the regularisation term to whatever works.

The table below reports accuracy of classification for different kernels and cost parameters (in %):

For Gaussian and polynomial (of degree 3, which is the default) kernels the fit improves drastically with the growth of the cost parameter. Here below is a similar table but reflecting the number of support vectors each model relies upon:

The size of the dataset is 1000 observations.

The gamma parameter (or the kernel scale parameter), which is, by default in the function equals 1/(number of features) - including dummies - has remained untouched.

Prediction

The second part of the excercise from the Prof. Tutz's book suggests splitting the dataset several times into train and test parts and then fit the model using the first part and test its predictive performance on the second.

Using random sampling, have split the dataset 10 times, assigning 20% of it to the training set and 80% to the test set. The tree model has used the default complexity parameter of 0.01. The SVM model has been implemented with the use of the Gaussian kernel and the cost parameter of 5.

Below, there are the validation results for each trial reported as the percentage of cases classified in the test set correctly:

And the summary of the results:

As seen from the boxplot, SVM outperforms other methods in prediction accuracy, followed by logistic regression. CART, however, had higher average performance than logistic regression, and the smalles results variability of the three (SD=1.81%).

I then ran the same analysis resampling the data 100 times and have come up with the following results:

The respective standard deviations are:

Logistic regression: 2.44 %
CART: 2.85 %
SVM: 2.75 %

Finally, I have run 1000 inerations of the same analysis. just to see if the results hold. And they hold:

As for variability of results, the respective standard deviations were:

Logistic regression: 2.84 %
CART: 2.86 %
SVM: 2.66 %

This comparison could be taken several steps further. Namely, the data could be split into train, cross-validation and test sets, where the first serves to fit the model, the second - to adjust the parameters, and the third - to test the performance of the resulting classifier. However, there is always room for improvement, and these results already can provide one with an idea of the methods.

If you are still bearing with me, please let me draw your attention to the existence of such an important predictor in the dataset as presence/absence of a telephone in a person's posession. I believe, back in the days they ment stationary phones not even the oversized mobile Motorolas. What would it be now, an iPhone 6?

lunes, 13 de octubre de 2014

Unit tests in R made simple

There is noting new under the moon.

During the last six months, I have been working mostly in R. R is great for research purposes, and I am not participating in these endless discussions about what is cooler: R, Python, Matlab, SAS or you name it. As being priviledged by speaking all of the above mentioned languages with a greater or lesser fluency, I can compare, and therefore I think that it all comes down to what you want to do in the end.

One of the things that I have adopted from my working-exclusively-in-Python experience is is the test-triven development (TDD) paradigm. Now, even writing my research code in R, I can't help creating these tests.

There is actually not much new to say about unit testing, because the topic is extensively covered elsewhere. In my humble opinion, this blog post offers the most awesome coverage of unit testing that I have ever seen.

TDD in general and unit tests in partucular are often neglected by R users - unless they are writing a package.

I think the added value of unit tests for research code cannot be overestimated since, despite popular beliefs of people unfamiliar with R, the language is much more than - how one of my classmates liked to put it - "a sophisticated statistical calculator".

Of course, many-many research findings have been successfully made employing script-based code, but when you have to do similar things multple times, and when you can wrap your code up and make it unfold beautifully with every call, testing is comes in very handy.

R has a certain characteristics: there exist at least one implementation (i.e., package) for almost anything. For some things, there are multiple ways to do them. I don't really know why people reinvent the wheel, but my guess is that when the current state of things is not working for them, they prefer to start from scratch rather than to dig into someone's code.

So, if you are eager to to unit test your thoroughly developed work you can opt for - at least - these three packages:

The last one does not seem to be used very often. The second has the fame because it has been developed by the very Hadley Wickham, and is allegedly used by him in his packages. To those unfamiliar with the name, let me just say that he is the reference R guy, a visualisation guru and the ggplot2 creator. He has a $60 worth book published by Springer Verlag and a stackoverflowing reputation on Stack Overflow.

I am using the first package from the list, RUnit, and not for the reason that is has been created by fellow German people working in field of epidemiology. I do so merely because RUnit is so similar to the unit testing framework of Python that I am already familiar with. It is reportedly alike to the unit testing approach implemented in Java. I don't know Java, so I can't tell. What I can tell that RUnit is great for use. It is clear, comprehensive and disambiguous. Moreover, it comes with a terrific reference manual that is a great read - apart from being informative. It provides a simple yet exhaustive explanation of what unit tests are, why they are helpful and how they differ from integration tests. Also, it provides guidelines on how to write unit tests. It is quite unlikely to encounter a line like:

"Once a bug has been found, add a corresponding test case"

or like:

"Develop test cases parallel to implementing your functionality. Keep testing all the time (code - test - simplify cycle)."

in an R document (and I've read quite a few of them).

This blog post provides a nice comparison of RUnit and testthat.

Unfortunately, to my knowledge, there exist no implementation of test suits in any of R IDEs. But this is not a major problem, especially for those R users who, like me, started their journey with R using console only.

So, if you want to define a test suite in R, all you need to do is link the library, defineTestSuite(), runTestSuite() and, if you wish to, printTextProtocol() for your tests.

Like that:

viernes, 6 de junio de 2014

p-values for test statistics in R

Imagine that you have a statistical hypothesis (null hypothesis) that you are about to test. You get (or you know how to get) a test statistic value, however, the p-value for this statistic is somehow not provided by the program.

Normally, all the statistical software packages, so as R and relevant Python (numpy, scipy) functions offer the p-values together with the computed statistic values, but exceptions can still happen.

In a general applied data analysis, a p-value has become a down-to-earth guideline to highly suspect that something could be true, or significant, or not. As defined, a p-value is the measure of the strength of the evidence against the null hypothesis. It denotes the probability to obtain a value of the test statistic at least as extreme as the one observed, given that the null hypothesis is true. If the p-value is beneath a certain threshold, called the significance level, then the null hypothesis is rejected. Otherwise - it is not rejected.

There are a couple of nice short videos out there, like this one, that explain the concept behind the p-values quite well.

Knowing the distribution of the test statistic, it is easy to find out the relevant p-values. One approach is to use the statistical tables for distributions. The other one is to compute the p-values using statistical sofware.

R provides a set of fuctions like pnorm, pt, etc - their names start with "p" (for the "probability") followed by the notion for the relevant distribution. These functions can be used to compute the p-values. By default, they output one-sided - left-sided - probabilities. This means that R computes the probability to obtain the value of the test statistic at least as small as the observed - or even smaller. For the right-sided probabilities, the parameter lower.tail in the function call should be set to FALSE.

Now, the other thing is that for testing of some hypotheses, two-sided tests are needed. The basic rule of thumb here - if I correctly remind my graduate stats classes - is the following: if the hypothesis is formulated using the sign "=" - then a two-sided test is needed, whereas if it is formulates using the signs "<, <=, >, >=" - then a one-sided test should be used.

For a two-sided test, the one-sided p-value should be multiplied by 2. It can be done as follows in R. If we assume that our test statistic is a z-score, and the value for it that we get is equal to -3.7, then the one-sided and the two-sided p-values could be computed as follows:

Now, common sense suggests that -3.7 is a strange value for a z-score, since these follow the standard normal distribution. Therefore, their mean is 0 and their standard deviation is 1. 99.7% of the data lies within 3 standard deviations of the data, therefore, something beyond the range of [-3, 3] already looks suspicious. However, if a precise p-value is needed, it should be computed or obtained from a table.

Also, sometimes the p-values are not provided for the regression model outputs. They might be very useful to make an inference regarding the significance of a given predictor in the regression equation. For example, some model outputs obtained employing the package VGAM - the one aimed for fitting vector generalized additive models - lack p-values for the coefficients estimates.

The code below is used for fitting a proportional odds model - it is an example from the developers of the package. Here, the z-scores for coefficients estimates are suggesting significance for both intercepts and the independent variable. The hypothesis here is two-sided.

Last but not least, I would like to announce that I will not be attending the useR!-2014 conference - although my abstract has got accepted to it. This is a matter of some personal circumstances, and I truly look forward to apply for the conference, hopefully, get accepted, and then go there next year.

lunes, 5 de mayo de 2014

See you at useR! 2014

This is just a quick note aimed to tell the world and set a reminder to self on that my abstract has got accepted to the useR! 2014 conference.

Ta-dam!!!

This will be a poster, not a full-bodied talk, but still. I am very happy, especially since I have been thinking to go there for three years in a row.

The matter of my report will be air pollution exposure, geostatistics and machine learning, and how to infer on the first via the methods of the second and the third using R.

Now I have developed a great interest in poster design and Paso Robles wineyards: my intention is to take a road trip from LA to San Francisco after the conference. I want to see my friends in SF, and since my husband has promised to drive, so I cannot miss the chance to enjoy some Californian wines in situ.

Also, I expect to finally meet Dr. Virgilio Gomez Rubio who is an invited speaker at the conference. I have made a massive use of the book that he has coauthored, and I have I missed his talk once in Barcelona (by arriving 45 minutes late from Girona), so now, since I have to go farther, I hope not no miss this one.

miércoles, 30 de abril de 2014

Keep it simple!

Some time ago, I have figured out that those of us are the most knowledgeable who possess the ability to explain the most complex stuff in the most lay terms.

Repeating after the renowned French philosopher, Nicolas Boileau-Despréaux, "Ce que l'on conçoit bien s'énonce clairement, Et les mots pour le dire arrivent aisément."

What is conceived well is expressed clearly, and the words to say it come easily.

There is a famous Russian anectode (yes, we to tell anectodes all the time). Two university professors meet, and one asks another one: "How's the exam going?" "Terrible," - the other answers. "What's wrong?" - inquires the first. "Well, they just can't get it straight. I've explained them once, explained them twice, I have even understood it myself already, but they still don't get it!"

So, some people complicate the things in order to seem more knowledgeable, to get chicks or to rock a job interview, but does this really work out?

It doesn't.

Daniel Kahneman, the behavioral finance forefather, in his amazing book "Thinking, Fast and Slow" wraps it up as follows:

"If you care about being thought credible and intelligent, do not use compex language where simplier language will do."

This person has a Nobel prize so he kind of knows what he is talking about. Kahneman adresses the findings made by his friend and colleague Daniel Oppenheimer from UCLA (Princeton University before). In his paper (Ig Nobel-winning, actually) "Consequences of Erudite Vernacular Utilized Irrespective of Necessity: Problems with using long words needlessly", Oppenheimer provides a broad statistically-backed evidence of why people should keep it simple.

One of the experiments that he describes is using an abstract from the classic work, "Mediation IV", by René Descartes. The first paragraph of this manuscript was translated into English by two different interpreters, and then the translations were presented to a group of Stanford university undergraduates. Half of the participants read the more complex, 98-word, translation, whereas the other half read the simplier, 82-word, verison. More to that, half of the participants was told that the text had been written by Descartes, while the other half was told that it came from an anonymous author. The students were instructed to rate the complexity of comprehension of the text and the intelligence of its author, both on a 7-point scale.

Results show that those who read the "simple" text and knew that it belonged to Descartes, rated Descartes as more smart than those who read the "complex" text. The same was observed amongst those who attributed the text to the anonymous author.

In this experiment, the complexity of the text was negatively correlated with the intelligence of the author and positively - with the difficulty of comprehension. Then, the difficulty of comprehension was used as a mediator in the analysis aimed to establish the link between the complexity of the text and the intelligence of its author. To do so, Sobel's mediation test was employed.

The results of the analysis are summarized grahpically on the picture taken from the Oppenheimer's paper:

Sobel's test is one of the most common tests for intervenience, that could be also referred to as tests for mediation (in psychology) or tests for surrogate or intermediate endpoint effects (in epidemiology). If you are interested in these, please take a look at this arcticle.

In plain terms, we would use mediation tests if we want to see whether the relationship between some variable and some other variable depends on some intervening factor. These tests are well realised in R and SAS, however, I failed to find a Python implementation for them.

One of the limitations of the Descartes's study that Oppenheimer points at is that it has been conducted on smart people. So, if you expect to deal with someone whos knowldege you question, the chances of them being mesmerized by you spelling magic complex words are random.

martes, 15 de abril de 2014

Why abstaining from drinking may be a bad idea when aplying for a job in finance

One of the major motivations of statistics is to attempt to figure out whether there is a link or a lack thereof between something and something else.

The "somethings" tend to be described in data format, and therefore mathematical procedures come in very handy to make an inference and to support or reject what common sense is suggesting.

Whereas for numeric data the methods are quite straightforward, as they stand on the shouders of giants as taking into account the knowledge coming from, say, physics, geometry, calculus, differential equations etc., when one has to deal with categorial data, the things get somewhat trickier.

There is an ample bunch of test fo dependence/independence for categorical data, and there are well-written books on the matter - like this one or this one. Nonetheless, the suggestion to use this or that particular test is often driven by empirical knowledge and looks more like a technical analysis omen rather then a thorough, sigma-algebra-based strict mathematical stuff.

Whatever works.

As searching for yet another test to evaluate the existence of a link between an important health ouctome and a common exposure factor, I´ve bumped into a classic study conducted by the great mind behind it all, Karl Pearson. In 1909, he evaluated whether there is a connection between criminal behavior and consumption of alcoholic beverages.

Pearson studied 1426 criminals, and his null hypothesis was that there was no association between the type of crime and alcohol consumption.

Below there is a descriptive table for this study. It has been taken from the book by A. Elliott and W. Voodward, Statistical Analysis Quick Reference Guidebook: With SPSS Examples

Just by a simple 'look see' method, one can firmly reject the null hypothesis: drinkers are clearly more prone to conducting criminal activities.

Except for one. Fraud. Indeed, fraud should require some solid intellectual input, and therefore one must be really clear-headed when doing something fradulent.

As actions of this kind are most frequently associated with financial industry, HR departments of relevant institutions could take a closer look at this interesting misalignment in the common pattern. Also, the prospective candidates for financial jobs could abstain from proudly proclaiming selves as devoted no-drinkers during the interviews.

Not only this is rude, because, you know, in this industry people do not drink alcohol, but also the admirers of Pearson's contribution to statistics might find these life choices not so zero cool.

I'm kidding, of course.

The value of the chi-square statistics for the test of independence is 49.731 and the p-value is around 0.000.

Olga Ivina’s blog