viernes, 23 de noviembre de 2012

Dr. What the thesis was about

On Tuesday, I've defended my thesis, and therefore I've become a doctor. They've graded me with an "Excellent", but I still have no idea what that would mean. Neither have I an idea whether it is relevant for my future career. I only know this is a European doctorate program, and therefore - a European PhD title, which means that I've spent roughly 1/4 of my PhD time abroad, learned a lot, met great people and had a lot of fun. The other 1/4 of my PhD time it was summer in Spain, so you can imagine, right? In a nutshell, these last three years have been pretty awesome.
Hereby, I provide a presentation with some brief highlights of the matter of my thesis. This is a shortened version of my actual PhD thesis presentation, a half of it, to be precise. My report lasted approxinately for an hour followed by another hour of questions.



The basic idea of the dissertation is to provide valid predictions of air pollution concentrations even in conditions of a severe lack of observed data. This is no news that air pollution influences adversely people's health and well-being. Both adults and children are affected, and both short- and long-term exposures lead to health effects. This is why air pollution assessment methodology is being constantly updated. There is a bunch of methods nowadays that serve for the purpose, but, roughly, they can be divided into two groups: cheap to implement, and expensive to implement yet extremely precise. The latter is a desiderata, of course, for all the methods. Sometimes, the prediction may lack validity. Especially when the actual concentrations at some points of the map (say, specifically where a person with some health effect lives) are not available. Then, the prediction error cannot be properly assessed. 

In my case, I had a tiny data set of mean annual concentrations of some contaminants. Those major contaminants can be altogether referred to as "criteria pollutants". Two most investigated of them are nitrogen dioxide and fine particulate matter, and these are the two that I have taken up for my study. For each of them, I had the annual values measured at the monitoring stations across the Barcelona Metropolitan Region. There are 49 stations in total, and for every year and for every pollutant the measurements were available roughly at 24. So, there were 24 points on the average for every year and pollutant. 


In order to provide a valid prediction for pollution surface for every year and pollutant, conformal predictors have been employed. It is a technique that has been recently developed by people at the Royal Holloway University of London, more precisely, in its unit called Computer Learning Research Centre. This is a machine learning method, and it comes from statistical learning theory. A conformal predictor is always valid, and it can be build upon almost any statistical algorithm, including, of course, regression - the one that is majorly used for air pollution modeling. 


A conformal predictor has been derived on the basis of a classic kriging: this method has been chosen because of the given data configuration. The next step is to derive an anisotropic approach for  kriging, once more data is available. Also, a conformal predictor on the basis of the most (I'd say "pop", but it is a serious blog) frequently used algorithm, land use regression is on the way.

If you have any questions, please do not hesitate to ask.

P.S. Now I am back to financial data analysis. My next post will be about  personal income savings, the most popular bank in Spain for this purpose (in my perception)and its efficiency as such. 

martes, 6 de noviembre de 2012

What I Did Last Saturday (feat. Banc Sabadell daily data)

So, some time ago I've made a decision that my ideal job situation would be to stay in Barcelona and to work as a quantitative analyst, otherwise known as quant.  However, it seems that in Barcelona the term "quant" barely exists. These people are as rare as white truffles, and the algorithm of becoming such in this city is nondeducible.

A context search through LinkedIn has elicited that quants mostly inhabit at a Spanish fourth largest bank, Banc Sabadell (a Catalonia-based bank).

Banc Sabadell does not seem to be needing extra quants, or at least, it does not post such offers online. But what it does post is some amazing piece of marketing featuring my glorious compatriot Yuri Gagarin:


The text below Gagarin's photo says that Banc Sabadell is the first Spanish bank to offer 24/7 customer service via twitter. So, they're not only the first to hire quants. The ads is impressive, right? I headed to Banc Sabadell's webpage in order to try to apply for a job. I haven't discovered any relevant offer, but what I did discover was the open-source data on their stock prices. Which is valuable since you cannot get this data from Yahoo!Finance: the ticker "SAB" is used for another company on global markets.

I downloaded the daily close prices for the last 4 years. Within this time frame, the price of a share has decreased from €5.57 to €1.26. Which is understandable considering the general recession of Spanish economy. Banc Sabadell is one of the 35 enterprises to be considered in the main Spanish market index, IBEX35. Here is the graph performance of both during the last 4 years:


Banc Sabadell daily stock prices show a constant decrease over time. IBEX35 movements have followed a similar path but with more pronounced ups and downs. This is explained by the fact that IBEX35 considers performance of another 34 companies. Also, the index is based on pondered market capitalization of the businesses, not their stock prices. 

As I have already mentioned before, I've been doing a course on financial econometrics on Coursera. While the majority of methods is a repetition of what I had already studied as part of my university courses on financial mathematics and econometrics, some approaches are really new to me. For instance, the constant expected return model was the thing I have first come across during this course.  I decided to test the approach on the Sabadell data. 

First, I have derived the continuously compounded returns for the shares. The data as it comes from Banc Sabadell server contains missed values - in the sense that observations are absent for some days. Therefore, to obtain regular time series (suitable for prediction and fitting models like ARIMA), I have converted the daily returns data to monthly averages: 


It is seen that the mean value of continuously compounded returns is a tiny bit below zero, but practically these returns are almost zero. Which can be confirmed by a t-test: 

t = -0.5601, df = 48, p-value = 0.578
true mean: -0.0005466

The 95% confidence interval for the mean is: -0.0025 to  0.0014. The plots for autocorrelation and partial autocorrelation functions reveal no significant dependencies on previous observations: 



Thus, these monthly cc returns behave like a realisation of a white noise process. ???

Is it s Gaussian white noise?

If I would like to fit a constant expected return model for this data, I would check the assumption of normality of the residuals, otherwise denoted as "random news shocks". The Jarque-Bera test suggests the rejection of the hypothesis of normality: 
Chi-squared = 44.65, df = 2, p-value = 2.02e-10


A Gaussian process with the same mean value (-0.00055) and the same standard deviation (0.0068) as those of the monthly cc returns can be simulated. And a constant expected return model can be fitted: 



Then, the histograms of the real and the simulated data can be compared. It is clearly seen that the Banc Sabadell monthly continuously compounded returns do not behave as normal. 



While the simulate data is skewed but still oscillates around zero, the real data does not. It's sample mean, median and mode are a tiny dat but below zero. Almost every month, the observed average continuously compounded return is about -0.1%, which  explains the slow but constant decrease in stock prices. The maximum observed monthly return is 2.6%.
  
If I would have purchased some Bank Sabadell shares for €1000 in October 2008 - and I had not yet even started grad school then, so I couldn't afford a bigger investment anyway - what would this investment provide me right now?

First, lets check the historical VaR:


The 1% and 5% quantiles for daily continuously compounded returns are equal to -0.055 and -0.034 respectively, and thus the (daily) VaR values would make up €54.03 and €33.59. For monthly continuously compounded returns, monthly 1% and 5% historical VaR is €11.9 and €8.47 respectively which is way smaller.

The actual performance of the shares can be tracked back using a simple loop. If I would have invested €1000 in these shares in October, 2008, i.e. 49 months ago, considering real monthly returns, now I would have some €984. This seems not bad but somewhat doubtful since the real price of the shares has decreased more than 4 times in this 4 years. 

Indeed, if the daily real returns data is used, and the performance of the €1000 investment is evaluated for the 1016 days time frame (the number of available daily observations is 1017), on October 29, 2012 I would have €373 and 30 cents. Which seems correct.  So, having invested €1000 back then, I would have lost €627. Which for a grad student is a solid amount. 

Moreover, I am inclined to think that students should not invest their money in stocks at all. Well, unless they are filthy rich. Instead, they should find a bank that offers them a savings account with a fixed and stable interest rate, they should pay some 10% of their income into this account every month, and not ever touch the invested until... well, I'm going to dig into this in a separate blog entry, perhaps.  

RESUME: Banc Sabadell's shares perform no worse than the Spanish economy does. Average monthly returns data is confounding since it is too smoothed to yield correct loss estimates. Also, the volatility/bias of the average monthly returns is to be better investigated. Daily data as it is, however, contains missings and thus is unsuitable for regular time series modeling. This is a problem of a data scientist, however, not the bank. I know.