lunes, 2 de febrero de 2015

World's inequality, repeated measures: R and SAS

Theory

Repeated measurement studies are conducted frequently in all areas of research. In epidemiology, these designs are used very often. Sometimes they are referred to as longitudinal studies implying that subjects are followed over time, and that the aim is to track their exposure to risk factors, or outcome, or both.

In economics, longitudinal data are called panel data. These can describe how a variety of economic factors, for instance, macroeconomic parameters, like inflation, poverty or GPD, change over time between some economic subjects. Panel data come in handy for competitiveness studies. Econometricians know a a vast variety of recipes to cook such data sets: from plain univariate ANOVA to Bayesian MCMC models.

Many tutorials on practical handling of repeated measurement data employ the dietox data set: 831 observations containing weights of slaughter pigs measured weekly. So, indeed, these designs are to be encountered in any area of research.

In practice, two main types of models are implemented to deal with panel data: the generalised linear mixed models (GLMM) and the generalised estimation equations (GEE). A lot of smart scientific text is written on both of them, so I won't elaborate on how they work. I'll only briefly mention that they provide somewhat similar inferences. GLMM come up with "preciser" estimates due to random effects and a goodness-of-fit measure which is the maximum likelihood. GEE models do not compute likelihood, they use a so-called quasi-likelihood which is computationally simpler to obtain. The GEE approach has met a lot of followers who praise its advantages over the GLMM method. The main strengths include relative ease of calculation and weaker dependence on a "correct" multivariate distribution specification: GEE models can yield consistent results even in case of misspecification of a data correlation structure. The main critics that is attributed to GEE models in comparison to GLMM are: being population-averaged (read "less precise") and that quasilikelihood these models use is only apt for estimation and does not provide means for testing model fit and therefore for choosing a correct model specification. However, Pan (2001) has proposed a quasilikelihood information criteria (QIC), and it has been successfully implemented in practice.

Also, I massively like the say of G.P. Box, a prominent American statistician: "All models are wrong, but some are useful." I want a T-shirt with this phrase.

Practice

When it comes to practical handling of data,  equally wrong GEE and GLMM methods are provided in Matlab, SAS and R, but unfortunately - not in Python. However,  you can run R code from Python, and the library rpy2 is the most popular way to do this.

A recent paper (Nooraee et al. 2014) has compared GEE models fitted by SAS, SPSS and several R packages and has found differences in the output.

I mostly use R for the reasons of convenience and, wait fo it, crossplatformness (this word does exist!), but I have all due respect to and can even use SAS and Matlab. So I have decided to fit a GEE model in different programs and see what comes out.

I have deliberately left Matlab behind. GEE models are implemented in Matlab via the GEEQBOX toolbox (a package designed by Sarah Ratcliffe and Justine Schultz from the University of Pennsylvania) but it does not come with the standard distribution and is not supported by Simulink. In SAS, however, GEE models are included in the base distribution and are respectively supported.

And R is a freeware which means that you use it at your own risk. It also means that if you want to do something, anything, there are at least two libraries written independently by different people that allow you to do that. In particular, if you want to fit a GEE model, your first choices would be provided in packages geepack, gee and  repolr (see Nooraee et. al 2014 for a comprehensive comparison of performance).

Inequality

So, what about the inequality?

In this entry, inequality does not imply comparison between analytical models or their practical realisation. What is meant is the real world economic inequality, on a macrolevel. I have downloades a small data set from the website of the World Bank. It shows economic discrepancy measured by a so-called Gini coefficient. 20 EU countries are used for the analysis, and the data has been observed for 11 years (with missings): from 2004 to 2014.

The following indicators have been used: GINI index, mean consumption per capita, and the shares of income distributed among the quintiles of population. The latter means: how much of the total income is held by the poorest 20% following up to how much is held by the richest 20%. Yes, this is exactly about how much of the world's money is possessed by the top 1% but on a fractioned and a deconstructed level. Then, I have computed an interquintile range as the difference of income possession between the richest and the poorest. These factors have been opposed to one another in a correlation test (Pearson's) to check for basic linear connections:  


gini consumption 1st quintile 2nd quintile 3rd quintile 4th quintile 5th quintile IQiR
gini 1.00 -0.56 -0.96 -0.97 -0.93 -0.01 0.99 1.00
consumption -0.56 1.00 0.49 0.66 0.51 -0.12 -0.55 -0.54
1st quintile -0.96 0.49 1.00 0.90 0.81 -0.16 -0.93 -0.96
2nd quintile -0.97 0.66 0.90 1.00 0.92 -0.06 -0.96 -0.96
3rd quintile -0.93 0.51 0.81 0.92 1.00 0.25 -0.96 -0.93
4th quintile -0.01 -0.12 -0.16 -0.06 0.25 1.00 -0.13 -0.04
5th quintile 0.99 -0.55 -0.93 -0.96 -0.96 -0.13 1.00 0.99
IQiR 1.00 -0.54 -0.96 -0.96 -0.93 -0.04 0.99 1.00

So it can be seen that the interquintile range correlates with the GINI coefficient almost perfectly (the correlation coefficient is: r= 0.999) and so does the 5th quintile of the data which stands for the share of income held by the richest 20%. Basically, the wealth of the wealthiest alone reflects inequality perfectly.

 A scatterplot:




I have decided to proceed with this predictor because the mean consumption has only been provided in 16 observations. My other idea was to add any GDP measure as a predictor, but my aim was to compare software performances and not to make a macroeconomic finding, so I left this urge unattended.

A very basic GEE model uses 76 observations of 18 countries over 9 years. Both in R ans in SAS realistaions, these models operate on complete observations with no missing values. GEE help fit population averaged models, and it relies on the concept of correlation structure for times. Here, it is assumed that the correlation structure is autoregressive of order 1, because the inequality of the given year will have the strongest relationship with the inequality of the preceding and/or the subsequent year.


The same analysis can be done in SAS: The code:


GEE Model Information
Correlation Structure AR(1)
Subject Effect country_name (18 levels)
Number of Clusters 18
Correlation Matrix Dimension 9
Maximum Cluster Size 9
Minimum Cluster Size 1


Algorithm converged.


Working Correlation Matrix
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
Row1 1.0000 0.9541 0.9103 0.8685 0.8286 0.7906 0.7543 0.7197 0.6867
Row2 0.9541 1.0000 0.9541 0.9103 0.8685 0.8286 0.7906 0.7543 0.7197
Row3 0.9103 0.9541 1.0000 0.9541 0.9103 0.8685 0.8286 0.7906 0.7543
Row4 0.8685 0.9103 0.9541 1.0000 0.9541 0.9103 0.8685 0.8286 0.7906
Row5 0.8286 0.8685 0.9103 0.9541 1.0000 0.9541 0.9103 0.8685 0.8286
Row6 0.7906 0.8286 0.8685 0.9103 0.9541 1.0000 0.9541 0.9103 0.8685
Row7 0.7543 0.7906 0.8286 0.8685 0.9103 0.9541 1.0000 0.9541 0.9103
Row8 0.7197 0.7543 0.7906 0.8286 0.8685 0.9103 0.9541 1.0000 0.9541
Row9 0.6867 0.7197 0.7543 0.7906 0.8286 0.8685 0.9103 0.9541 1.0000


GEE Fit Criteria
QIC 79.3518
QICu 86.0000


Analysis Of GEE Parameter Estimates
Model-Based Standard Error Estimates
Parameter Estimate Standard Error 95% Confidence Limits Z P-value
Intercept -16.3367 0.9553 -18.2090 -14.4645 -17.10 <.0001
year 2004 -0.2521 0.2592 -0.7602 0.2560 -0.97 0.3309
year 2005 -0.2286 0.2650 -0.7480 0.2907 -0.86 0.3882
year 2006 -0.2318 0.2598 -0.7410 0.2774 -0.89 0.3722
year 2007 -0.1116 0.2535 -0.6085 0.3852 -0.44 0.6597
year 2008 -0.3641 0.2533 -0.8605 0.1324 -1.44 0.1506
year 2009 -0.2944 0.2585 -0.8011 0.2124 -1.14 0.2549
year 2010 0.2325 0.2425 -0.2428 0.7078 0.96 0.3378
year 2011 0.0223 0.2294 -0.4273 0.4719 0.10 0.9226
year 2012 0.0000 0.0000 0.0000 0.0000 . .
fifth_quintile 1.2089 0.0233 1.1633 1.2545 51.97 <0 .0001="" p="">
Scale 0.7654 . . . . .


The zero estimates for year 2012 is explained by the fact that there is only one observation for this data point which is Romania (Gini of 27.3). It is not a good practice to include such observations in a model, but they are handy to show differences in software output.

SAS and R (geepack) have yielded somewhat different estimates.The scale parameter reported by R is 0.3038, and the one reported by SAS is 0.765. Then, in the correlation structure, the alpha reported by R is 0.698 and the one reported by SAS is 0.954. The significance, direction and magnitude of coefficient estimates hold.

Bonus track: Pigs

So, as I mentioned in the beginning, the dietox dataset is often used for training purposes in R. As not having found any comparison of analysis of this data sets for R and SAS, I have ran a very plain model in both: a GEE model with a Poisson distribution function with a logarithmic link (the default), where pig's weight has been modeled as a function of time and copper intake. Coefficient estimates and standard deviations differed at second decimal points, which is ok and may be due to different conversion algorithms and floating point numbers handling. However, the scale and correlation parameters have shown differences in the first decimal point.

Résumé

SAS and R come up with slightly different estimates, yet for sake of practical analysis the difference is negligible. Results are generally consistent with one another.

No hay comentarios:

Publicar un comentario