Multiple Imputation with Colliders

I have found myself thinking a lot recently about multiple imputation in the presence of colliders. Proponents of MI commonly recommend that any variable available in the dataset should be included in the imputation stage, even if it will not be included in the analysis stage. In one important statement (PDF):

For greater efficiency, add any other variables in the data set that would help predict the missing values (p. 57).

I have often wondered if this is true—if there are particular cases in which adding variables atheoretically at the imputation stage can lead to bias in the analysis stage even if the analysis stage model is correct.

Colliders are a candidate for such a variable. A collider is a variable that is jointly caused by both an independent variable and a dependent variable: X -> C <- Y. It is well-known that in a regression of Y on X in which the true causal effect of X is zero, conditioning on a collider can create the illusion that X actually causes Y. Take the simple case where X and Y are random variables that are unrelated to one another, but imagine that they jointly cause a third variable C, and we include that as a control variable. If we were to generate those data and run that regression in R, here is what we would get.

x <- rnorm(100)
y <- rnorm(100)
c <- y + x + rnorm(100)
summary(lm(y~x+c))$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  0.1009087 0.07196625  1.402167 1.640576e-01
## x           -0.3802116 0.09452905 -4.022166 1.141927e-04
## c            0.4776374 0.04893261  9.761127 4.397550e-16

Conditioning on a collider in this simple example generates a highly significant but entirely spurious correlation between X and Y.

So what happens if you were to include a collider when imputing missing data? My intuition is that doing so would create similar problems, but to test that intuition I made a little simulation.* Consider the best possible case: the analyst knows that the correct analysis model excludes the collider, but includes the collider in the imputation stage because it is correlated with the variables with missing data. First load the necessary packages and reset ggplot2's colors away from the default mouldy waffle setting.

library(mice)
library(reshape2)
library(ggplot2)
theme_set(theme_classic())
set.seed(14850)

The following function simulates a dataset with missing data in Y those probability of missingness depends on the value of Y. It then runs five analyses

with the full data (no missing values): Full
using listwise deletion/complete case analysis: LD
using MI using no auxiliary variables: MI
using MI but with a proxy variable for Y: MI-full
using MI but with the collider C: MI-collider

simulation<-function(p){

  # set up the data 
  N <- 1000
  x <- rnorm(N)
  u.y <- rnorm(N)
  y <- u.y + rnorm(N)
  c <- y + x + rnorm(N)

  # full data frame, no missing values
  dat.full <- as.data.frame(cbind(y,x,u.y,c)) 

  # create missingness in Y
  dat.missing.collider <- dat.full                     
  dat.missing.collider$miss.y<-rbinom(N,1,pnorm(y+p))
  dat.missing.collider$y[dat.missing.collider$miss.y==1]<-NA
  dat.missing.collider$miss.y <-NULL  

  # df includes Uy but no collider, missing values for Y
  dat.missing.full<-dat.missing.collider
  dat.missing.full$c<-NULL

  # df includes collider but not Uy, missing values for Y
  dat.missing.collider$u.y <- NULL    

  # df omits Uy and c, missing values for Y
  dat.missing <- dat.missing.collider
  dat.missing$c <- NULL                 

  # if all data were observed
  res.full <- summary(lm(y ~ x, data=dat.full))$coefficients
  b.full <- res.full[2,1]
  p.full <- res.full[2,4]

  # listwise deletion
  res.missing.ld <- summary(lm(y ~ x, data=dat.missing))$coefficients
  b.missing.ld <- res.missing.ld[2,1]
  p.missing.ld <- res.missing.ld[2,4]

  # impute the data
  mi <- mice(dat.missing)                   
  mi.full <- mice(dat.missing.full)          
  mi.collider <- mice(dat.missing.collider) 

  # results without c or Ux, Uy
  res.missing.mi <- summary(pool(with(mi, lm(y ~ x))))
  b.missing.mi <- res.missing.mi[2,1]
  p.missing.mi <- res.missing.mi[2,5]

  # results with Ux, Uy
  res.missing.mi.full <- summary(pool(with(mi.full, lm(y ~ x))))
  b.missing.mi.full <- res.missing.mi.full[2,1]
  p.missing.mi.full <- res.missing.mi.full[2,5]

  # results with collider
  res.missing.mi.collider <- summary(pool(with(mi.collider, lm(y ~ x))))
  b.missing.mi.collider <- res.missing.mi.collider[2,1]
  p.missing.mi.collider <- res.missing.mi.collider[2,5]

  outputs <- cbind(b.full,
                   b.missing.ld,
                   b.missing.mi,
                   b.missing.mi.full,
                   b.missing.mi.collider,
                   p.full,
                   p.missing.ld,
                   p.missing.mi,
                   p.missing.mi.full,
                   p.missing.mi.collider)
  return(outputs)
}

Run this simulation 250 times and collect the output.

sims <- data.frame(t(matrix(replicate(250, simulation(0)),nrow=10)))
sims <- cbind(1:nrow(sims),sims)
names(sims) <- c("Iter",rep(c("Full","LD","MI","MI-full","MI-collider"),2))

First, we compare the estimates across all five estimators (remembering that the true effect of X on Y is zero).

colMeans(sims[2:6])

##         Full           LD           MI      MI-full  MI-collider 
## 0.0006753861 0.0030361351 0.0016691195 0.0021527321 0.0029543371

to.plot.b <- melt(sims[1:6], id.vars="Iter", variable.name="Estimator", value.name="Estimate")
ggplot(to.plot.b, aes(x=Estimate, color=Estimator)) + geom_density()

plot of chunk plot1
This looks pretty good: the distribution of coefficient estimates is centered around zero for each of these models, and the mean results for MI with a collider aren't any worse than for MI without a collider. But what of Type-1 error rates, a particular concern when conditioning on a collider?

colMeans(sims[7:11]<.05)

##        Full          LD          MI     MI-full MI-collider 
##       0.044       0.060       0.036       0.048       0.044

To my surprise, Type-1 error rates don't seem to be a problem here so long as the collider does not appear in the analysis stage.

Puzzled by this result and not believing my lying eyes, I looked online for further discussion, and came across this interesting paper (PDF) on auxiliary variables in MI. They diagnose a more subtle problem: sometimes adding a collider into the imputation stage can transform the data missingness from Missing at Random (MAR) to Missing Not at Random (MNAR). In the simulations above, data are MAR because they were confined to Y and we have X through which to model Y. But if Y causes a variable C that is correlated with the missingness mechanism and that missingness mechanism also depends on X, then conditioning on C will transform MAR data into MNAR data.

To illustrate this problem, adjust the simulation function as follows:

simulation_coll_missingness<-function(p){

  # set up the data 
  N <- 1000
  u.y <- rnorm(N)
  u.c <- rnorm(N)
  x <- rnorm(N)
  y <- u.y + rnorm(N)
  c <- y + u.c + rnorm(N)

  # full data frame, no missing values
  dat.full <- as.data.frame(cbind(y,x,u.y,c)) 

  # create missingness
  # p(Y missing) correlated with C and X
  dat.missing.collider <- dat.full                     
  dat.missing.collider$miss.y<-rbinom(N,1,pnorm(u.c+x)) 
  dat.missing.collider$y[dat.missing.collider$miss.y==1]<-NA
  dat.missing.collider$miss.y <-NULL  

  # df includes Uy but no collider, missing values for Y
  dat.missing.full<-dat.missing.collider
  dat.missing.full$c<-NULL

  # df includes collider but not Uy, missing values for Y
  dat.missing.collider$u.y <- NULL    

  # df omits Uy and c, missing values for Y
  dat.missing <- dat.missing.collider
  dat.missing$c <- NULL                 

  # if all data were observed
  res.full <- summary(lm(y ~ x, data=dat.full))$coefficients
  b.full <- res.full[2,1]
  p.full <- res.full[2,4]

  # listwise deletion
  res.missing.ld <- summary(lm(y ~ x, data=dat.missing))$coefficients
  b.missing.ld <- res.missing.ld[2,1]
  p.missing.ld <- res.missing.ld[2,4]

  # impute the data
  mi <- mice(dat.missing)                   
  mi.full <- mice(dat.missing.full)         
  mi.collider <- mice(dat.missing.collider) 

  # results without c or Ux, Uy
  res.missing.mi <- summary(pool(with(mi, lm(y ~ x))))
  b.missing.mi <- res.missing.mi[2,1]
  p.missing.mi <- res.missing.mi[2,5]

  # results with Ux, Uy
  res.missing.mi.full <- summary(pool(with(mi.full, lm(y ~ x))))
  b.missing.mi.full <- res.missing.mi.full[2,1]
  p.missing.mi.full <- res.missing.mi.full[2,5]

  # results with collider
  res.missing.mi.collider <- summary(pool(with(mi.collider, lm(y ~ x))))
  b.missing.mi.collider <- res.missing.mi.collider[2,1]
  p.missing.mi.collider <- res.missing.mi.collider[2,5]

  outputs <- cbind(b.full,
                   b.missing.ld,
                   b.missing.mi,
                   b.missing.mi.full,
                   b.missing.mi.collider,
                   p.full,
                   p.missing.ld,
                   p.missing.mi,
                   p.missing.mi.full,
                   p.missing.mi.collider)
  return(outputs)
}

The difference is now that the probability that Y is missing depends both on a factor that determines C and on X. Simulating these data 250 times again, here is what we get:

sims <- data.frame(t(matrix(replicate(250, simulation_coll_missingness(0)),nrow=10)))
sims <- cbind(1:nrow(sims),sims)
names(sims) <- c("Iter",rep(c("Full","LD","MI","MI-full","MI-collider"),2))
to.plot.b <- melt(sims[1:6], id.vars="Iter", variable.name="Estimator", value.name="Estimate")
ggplot(to.plot.b, aes(x=Estimate, color=Estimator)) + geom_density()

plot of chunk simulation2
Now we see that using a collider in the imputation stage generates bias in the analysis stage even when the analysis stage does not control for the collider.

colMeans(sims[7:11]<.05)

##        Full          LD          MI     MI-full MI-collider 
##       0.028       0.040       0.048       0.052       0.644

And Type-1 error rates are unacceptably high.

What is an example of an “imputation-stage collider”? Imagine we wish to use education (X) to predict partisanship (Y), but we have missing data on partisanship for people who feel excluded from the political system and who also have lower levels of education. And let's also imagine that members of party R are more likely to respond that they don't trust the government (C), as are people who feel excluded from the political system. Adding trust in government at the imputation stage will bias estimates of the effect of education on partisanship, but excluding it (given a series of other assumptions about what causes what that I'll leave unspecified here) would not.

Note also that trust in government should be highly correlated both with education and with partisanship, so it looks like a good candidate for including in our imputation model if all we thought about was its predictive capacity.

So what have we learned from this exercise? The main takeaway—as always—is that for MI to yield unbiased estimates of regression coefficients or causal parameters of interest, its assumptions have to be met. But more precisely, even having the correct model of the analysis stage does not absolve the analyst of considering the relationship between the imputation stage variables, the causal model, and the missingness mechanism. It turns out that in this simple example, imputing with an analysis-stage collider is innocuous (so long as it is excluded at the analysis stage). But imputation-stage colliders can wreck MI even if they are excluded from the analysis stage.**

As I have argued elsewhere, MI cannot be a theory-free exercise. Just as there is no rule of thumb for comparing MI to its alternatives, there is no simple rule of thumb for auxiliary variables such as “include as many variables as you can in the imputation stage” or “in principle, including all of the remaining auxiliary variables in the imputation model is desirable” (PDF).

NOTES

*This post also serves as a test to see if I could make the Rmarkdown-to-Wordpress integration work, via knit2wp.
**And in case you're wondering if including the imputation-stage collider as a control in the analysis stage will help, it won't.

Multiple Imputation with Colliders

NOTES

Share this:

More posts

AI Generated Maps of Southeast Asia Are Here

Agentic AI and Social Science Research Practice

Measuring the Credibility Revolution in Political Science

Comparative Politics Needs Area Studies, and Area Studies Needs Comparative Politics