Multicollinearity


Advance warning: This is a tedious post, and it is extremely unlikely you will find it either interesting or informative.

Santosh Anagol is an economics PhD student at Yale and he blogs at Brown Man's Burden. Going through his stuff, I came across a short paper he wrote back in 2004 about the implications of multicollinearity (I won't link to Wikipedia on this, as the article on multicollinearity is lacking and potentially misleading. For more information, read a standard econometrics textbook.)

What he does is simple enough:


with the error normally distributed and uncorrelated with the x's, etc. He then proceeds to run the regression three times, with σ12 (the covariance of x1 with x2) going from zero to .99.

At correlations below .999 our statistical model nails the point estimates and has large t-values. So we don’t need to worry about correlated regressors unless the correlation is EXTREMELY high.
Talking about variables with a correlation of .99 is not very relevant for practical purposes (For many popular datasets, I doubt the correlation between the recorded values and their true values is even as high as .95). In any case, the sample size chosen (1000) is large, and it is not surprising that the OLS estimators yield estimates close to the true value even in the presence of .95 correlation (it is not surprising to an experienced econometrician; see the conclusion to the post). What is more interesting to observe is how the confidence interval around these point estimates changes as σ12 is chosen to be higher. With x's barely correlated, x1 is roughly .13 points wide, with σ12=.5 it goes to .15 units and at σ12=.95 it reaches almost .4 units.

Continuing with the results:
With regressors that have correlations around .99, we get some bad results. In this case the point estimates are off, and one of them is significant. This would obviously be the wrong conclusion about the DGP.



'Statistical significance' is often misunderstood to be a measure of confidence in the point estimate, but it is nothing of the sort. Finding an estimate to be 'statistically significant' simply means that the (95% in this case) confidence interval does not include zero - in other words, there's a low chance that the true value of the statistic in the population is zero, and thus the variable of interest is likely to have an effect on y.

So, the conclusions we would draw about the DGP from the above results are actually the right ones: x1 is not likely to be equal to zero (and it isn't; it equals 2 by construction), and there's a 95% chance that x2 lies between -3.12 and 2.9897 (which it does; by construction, x2=1). The only reason β1 is found to be statistically significant and β2 not is the fact that x1 was picked to equal 1 and x2 was picked to equal 2, so we need to feed our estimators with more information in order to establish that x1 has an effect on y than is the case with establishing the same thing for x2.

The point estimates are indeed off, but this is purely due to the particular random sample - and the large confidence intervals alert us as to the possibility this is the case. Run the same model with a larger sample size (or pick a large number of other random samples and draw the probability distribution of your estimators), and the OLS estimates will be spot on.

And a final observation:

If two variables are highly correlated, will it screw up coefficients on other, exogenous variables? I ran the model with another regressor x3 that was uncorrelated with x1 and x2 , and with a coefficient of 3 in the data generating process. The degree of correlation between x1 and x2 DOES NOT CHANGE point estimates and t-stats of our coefficient on x3.


...which is to be expected from theory. Any explanatory variable that is not correlated with the x's of interest does not need to enter the model at all - it can safely reside in the error term without any bias being introduced as a result. (the Gauss Markov assumptions only call for the error to equal zero given x). The coefficient on x3 would be the same even if x1 and x2 were not included in the regression, and the coefficients on x1 and x2 are not affected by the inclusion of x3 in the specification.

Before leaving this post, I should make clear that I am not critical of Santosh's note; in fact, I think it's great and his effort is to be applauded. From the introduction to the paper:

I’ve been confused for a while about the effects of having x variables that are correlated. This is pretty embarrassing, given this is undergrad metrics stuff. But I’ve also seen enough grad students and professors throw around ”multicollinearity” without really understanding its implications that its worth straightening out.

This is not 'undergrad metrics stuff' at all. It is true that economics undergrads learn about the qualitative effect of 'multicollinearity', but developing an understanding of its significance in practice only comes after substantial exposure to the literature and hands-on experience (as with so many things econometrics). Santosh's attitude is the right one, and playing around with simulated data is a great, low cost way to digest the theory and really understand econometrics - one that tutors should be encouraging far more than is currently the case.



by datacharmer | Sunday, October 07, 2007
  , | | Multicollinearity @bluematterblogtwitter

0 comments: