Collinear predictors present a challenge in model construction and interpretation. This topic is covered in intuitive and engaging style in Chapter 5 of the excellent book Statistical Rethinking, by Richard McElreath.
Collinear predictors refers to when predictor variables correlate strongly with one another. To see why this can be a problem, consider the following example.
Let’s say we have:
Independent variables: x, y
Dependent variable: z
We can use R to generate sample data sets in the following way:
x = rnorm(50)
y = rnorm(50,mean = -x,sd = 1)
This generates a set of normally distributed samples, x, with mean 0 and sd 1. It then generates a sample of y, using -x as the mean for each sample. This produces correlated variables:
Correlation varies from -1 to +1, with 0 being uncorrelated and +/- 1 indicating perfect (linear) correlation.
We can plot this data in 2 dimensions to get a better picture:
As we can see, x & y are clearly correlated.
We can now generate a sample for z:
z = rnorm(50, mean = x+y)
Let’s take a look at the x-z and y-z plots:
Here x and z appear completely uncorrelated.
While y and z look to have a slight positive correlation. Is that all we can do?
We can also use built-in R functions to look at the point estimates for the slopes of a linear regression on these data sets:
LMxz = lm(z ~ x)
LMyz = lm(z ~ y)
As we suspected, the slope for x is close to 0, while for y it is larger and positive.
(Note that we have only looked at the point-estimates for the slopes without quantifying uncertainty. In general, it’s better to have an estimate for the uncertainty in model parameters, but we’ll leave that for another time.)
Is this the whole story? Let’s try the following:
LMxyz = lm(z ~ x + y)
Here we can see that when the linear model includes both x & y, both slopes are positive and near 1. What happened to our other coefficients, -0.17 and 0.51?
Picture a three dimensional space, x-axis coming towards you, y-axis going left and right & z-axis going up and down.
Now remember, most of our data is actually scattered roughly around the line y = -x. Combine that with the plane z = x+y (roughly what we got from the multivariate linear model), and this is what it looks like:
The plane cuts through the origin, coming down, towards us and to the left, while going away, up and to the right. The line y =-x is indicated by a thin gray line. The intersection of the line and the plane occur at z=0 (!).
This is how the single predictor models don’t show an effect, while the multivariate does. The error in the data averages roughly around the plane as it cuts through the origin, so it is only after controlling for y that the effect of x can be determined (and vice versa).
The predictive power of an x-only or y-only model is not awful (though it is worse), so long as the new data still clusters around y=-x. However, the interpretation of the model is significantly clouded by the collinearity between x & y until the multivariate model is inspected.
Fun side note: the result we get nearly recovers the statistical process used to generate z ( z = rnorm(50, mean = x+y) ).