viernes, 6 de junio de 2014

p-values for test statistics in R

Imagine that you have a statistical hypothesis (null hypothesis) that you are about to test. You get (or you know how to get) a test statistic value, however, the p-value for this statistic is somehow not provided by the program.

Normally, all the statistical software packages, so as R and relevant Python (numpy, scipy) functions offer the p-values together with the computed statistic values, but exceptions can still happen. 

In a general applied data analysis, a p-value has become a down-to-earth guideline to highly suspect that something could be true, or significant, or not. As defined, a  p-value is the measure of the strength of the evidence against the null hypothesis. It denotes the probability to obtain a value of the test statistic at least as extreme as the one observed, given that the null hypothesis is true. If the p-value is beneath a certain threshold, called the significance level, then the null hypothesis is rejected. Otherwise - it is not rejected.

There are a couple of nice short videos out there, like this one, that explain the concept behind the p-values quite well.

Knowing the distribution of the test statistic, it is easy to find out the relevant p-values. One approach is to use the statistical tables for distributions.  The other one is to compute the p-values using statistical sofware. 

R provides a set of fuctions like pnorm, pt, etc - their names start with "p" (for the "probability") followed by the notion for  the relevant distribution. These functions can be used to compute the p-values. By default, they output one-sided - left-sided - probabilities. This means that R computes the probability to obtain the value of the test statistic at least as small as the observed - or even smaller. For the right-sided probabilities, the parameter lower.tail in the function call should be set to FALSE.

Now, the other thing is that for testing of some hypotheses, two-sided tests are needed. The basic rule of thumb here - if I correctly remind my graduate stats classes - is the following: if the hypothesis is formulated using the sign "=" - then a two-sided test is needed, whereas if it is formulates using the signs "<, <=, >, >=" - then a one-sided test should be used.

For a two-sided test, the one-sided p-value should be multiplied by 2. It can be done as follows in R. If we assume that our test statistic is a z-score, and the value for it that we get is equal to -3.7, then the one-sided and the two-sided p-values could be computed as follows:



  

Now, common sense suggests that -3.7 is a strange value for a z-score, since these follow the standard normal distribution. Therefore, their mean is 0 and their standard deviation is 1. 99.7% of the data lies within 3 standard deviations of the data, therefore, something beyond the range of [-3, 3] already looks suspicious. However, if a precise p-value is needed, it should be computed or obtained from a table.

Also, sometimes the p-values are not provided for the regression model outputs. They might be very useful to make an inference regarding the significance of a given predictor in the regression equation. For example, some model outputs obtained employing the package VGAM - the one aimed for fitting vector generalized additive models - lack p-values for the coefficients estimates.

The code below is used for fitting a proportional odds model - it is an example from the developers of the package. Here, the z-scores for coefficients estimates are suggesting significance for both intercepts and the independent variable. The hypothesis here is two-sided.



Last but not least, I would like to announce that I will not be attending the useR!-2014 conference - although my abstract has got accepted to it. This is a matter of some personal circumstances, and I truly look forward to apply for the conference, hopefully, get accepted, and then go there next year.