Statistical Diversions: Collinear Predictors

Collinear predictors present a challenge in model construction and interpretation. This topic is covered in intuitive and engaging style in Chapter 5 of the excellent book Statistical Rethinking, by Richard McElreath.

Collinear predictors refers to when predictor variables correlate strongly with one another. To see why this can be a problem, consider the following example.

Let’s say we have:

Independent variables: x,  y

Dependent variable: z

We can use R to generate sample data sets in the following way:

x = rnorm(50)
y = rnorm(50,mean = -x,sd = 1)

This generates a set of normally distributed samples, x, with mean 0 and sd 1. It then generates a sample of y, using -x as the mean for each sample. This produces correlated variables:

cor(x,y)
[1] -0.7490608

Correlation varies from -1 to +1, with 0 being uncorrelated and +/- 1 indicating perfect (linear) correlation.

We can plot this data in 2 dimensions to get a better picture:

XY

As we can see, x & y are clearly correlated.

We can now generate a sample for z:

z = rnorm(50, mean = x+y)

Let’s take a look at the x-z and y-z plots:

XZ

Here x and z appear completely uncorrelated.

YZ

While y and z look to have a slight positive correlation. Is that all we can do?

We can also use built-in R functions to look at the point estimates for the slopes of a linear regression on these data sets:

LMxz = lm(z ~ x)
LMxz$coefficients[2]

x
-0.1678911

LMyz = lm(z ~ y)
LMyz$coefficients[2]

y
0.5118612

As we suspected, the slope for x is close to 0, while for y it is larger and positive.

(Note that we have only looked at the point-estimates for the slopes without quantifying uncertainty. In general, it’s better to have an estimate for the uncertainty in model parameters, but we’ll leave that for another time.)

Is this the whole story? Let’s try the following:

LMxyz = lm(z ~ x + y)
LMxyz$coefficients[2:3]

x                      y
0.9391245     0.9768274

Here we can see that when the linear model includes both x & y, both slopes are positive and near 1. What happened to our other coefficients, -0.17 and 0.51?

Picture a three dimensional space, x-axis coming towards you, y-axis going left and right & z-axis going up and down.

XYZ

Now remember, most of our data is actually scattered roughly around the line y = -x. Combine that with the plane z = x+y (roughly what we got from the multivariate linear model), and this is what it looks like:

XYZpl

The plane cuts through the origin, coming down, towards us and to the left, while going away, up and to the right. The line y =-x is indicated by a thin gray line. The intersection of the line and the plane occur at z=0 (!).

This is how the single predictor models don’t show an effect, while the multivariate does. The error in the data averages roughly around the plane as it cuts through the origin, so it is only after controlling for y that the effect of x can be determined (and vice versa).

The predictive power of an x-only or y-only model is not awful (though it is worse), so long as the new data still clusters around y=-x. However, the interpretation of the model is significantly clouded by the collinearity between x & y until the multivariate model is inspected.

Fun side note: the result we get nearly recovers the statistical process used to generate z ( z = rnorm(50, mean = x+y) ).

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s