Modeling Possibly Nonlinear Confounders

In recent empirical work, my coauthors and I have found it useful to treat ordinal variables as factor variables rather than as continuous variables when entering them on the right-hand side of a regression. What I mean is a situation as follows: we are estimating the model

Y = a + βT + γU + ε

where the parameter of substantive interest is β—the partial correlation between T and Y—but we include variable U on the hypothesis that U confounds the relationship between T and Y. A motivating example would a situation in which I want to know the partial correlation between party ID and tax policy preferences, but I suspect that income affects party identification and tax policy preferences, so we need to condition on income in order to “deconfound” the partial correlation of interest. 

When we measure income, we normally don’t have a continuous indicator of dollars/year or something like that, but rather a set of income categories: $20k/year or less, $20k/year-$40k/year, $40k/year-$60k/year, and so forth. This is an ordered variable (category 2 is more than category 1, category 3 is more than category 2…), but it is not a continuous variable. Lately, I have chosen to estimate 

Preferences = a + b*PartyID + c1*IncomeCat2 + c2*IncomeCat3 + c3*IncomeCat4 + c4*IncomeCat5

rather than 

Preferences = a + b*PartyID + c*Income

The former approach allows the relationship between income and tax preferences to be nonlinear, whereas the latter constrains that relationship to be linear. 

But here’s the thing: in this example we are entirely uninterested in the coefficients c1, c2… we only care about my estimate of b. And this weekend, over beers and hamburgers, some wise friends asked me the following: what’s the point of treating U as a factor variable if we are indifferent to c? The question arose in the context of a discussion of Hainmueller et al’s analysis of interaction terms, which shows that standard advice assumes that interactions are linear (which is often not true, and can have pernicious consequences) and recommends exploring nonlinear relationships rather than imposing a linear functional form. One can do this very simply by “discretizing” continuous variables and treating them as factors, just in the model above. 

Although I have a good handle on why Hainmueller’s flexible nonlinear approach is useful for modeling interaction effects, I found myself at a loss to explain why the nonlinear approach would make sense in a non-interactive context in which U is nothing but a confounder. Why not just control for it? What’s the benefit of allowing that relationship to be nonlinear?

As is often the case, a little bit of simulation can help to develop intuitions. So this is what I have done, prompted by some idle ruminations while watching TV over breakfast. The full exposition is below, but the TL;DR result is that treating a control variable U as a factor variable allows a standard OLS setup to estimate b in contexts where both of the following are true: (1) the relationship between U and T is nonlinear, and (2) the relationship between U and Y is nonlinear. When only one (or neither) of those two conditions is true, treating U as continuous will work just fine, but there is little cost to modeling it as a factor variable. 

The Setup

Here is a simple simulation in R in which there is a binary causal variable of interest T that is a (possibly nonlinear) function of U, an ordinal confounder that has five values (1…5).  Y, in turn, is a function of T and a (possibly nonlinear) function of U.

  u <- sort(rep(seq(1:5),n/5))

  t <- rep(NA,n)

  t[u==1] <- rbinom(n/5,1,p1)

  t[u==2] <- rbinom(n/5,1,p2)

  t[u==3] <- rbinom(n/5,1,p3)

  t[u==4] <- rbinom(n/5,1,p4)

  t[u==5] <- rbinom(n/5,1,p5)

  y <- rep(NA,n)

  y[u==1] <- t[u==1] + rbinom(n/5,4,q1)

  y[u==2] <- t[u==2] + rbinom(n/5,4,q2)

  y[u==3] <- t[u==3] + rbinom(n/5,4,q3)

  y[u==4] <- t[u==4] + rbinom(n/5,4,q4)

  y[u==5] <- t[u==5] + rbinom(n/5,4,q5)

In this setup, we capture the relationship between U and T with the parameters p1…p5. An example of a linear relationship between U and T is if p1 = .1, p2 = .2, p3 = .3, p4 = .4, and p5 = .5, in which the probability that T = 1 increases linearly across the values of U. 

Here, by contrast, are examples of nonlinear relationship between T and U:

  1. p1 = .1, p2 = .1, p3 = .2, p4 = .2, and p5 = .9
  2. p1 = .6, p2 = .5, p3 = .2, p4 = .3, and p5 = .9

In the former, the probability that T = 1 increases discontinuously at the highest value of U. In the latter, the probability that T = 1 is higher when U = 1 than when U = 2, 3, or 4, subsequently rising again when U = 5. 

We capture the relationship between U and Y with the parameters q1…q5 in an analogous way. Y is our outcome of interest, which is always a linear function of T, with β = 1, plus U. Notice that if q1=q2=q3=q4=q5 then Y is independent of U; otherwise U confounds our estimates of β. Notice as well that we have a statistical model that looks pretty much like the motivating example of tax preferences, partisanship, and income, in which Y is a five-category dependent variable, much like a survey response on a Likert scale. 

To imagine the nonlinear relationships that this model will capture, start with income (U). It could be that the probability that one is a Republican (T) rises linearly with income. But it could also be the case that the probability that one is a Republican is much higher at the highest income bracket than in any of the lower income brackets; or it could be that the probability that one is a Republican is actually a bit higher among those with low incomes than among those at the middle of the income distribution, rising again among those at the highest income brackets. 

The same is true with views about tax cuts. It could be that the probability that one supports tax cuts (Y) rises linearly with income (U), but it could also be that preferences for tax cuts are substantially higher among the wealthiest than they are among others, or that there is some other relationship in the data. We suspect, though, that those who identify as Republicans (T) are more likely to support tax cuts (Y).*

Estimation

We are using ordinary least squares regression to estimate β.** Consider three different ways to estimate this relationship.

  1. Omit U entirely, which will certainly create bias unless q1=q2...=q5 (call this “No U”)
  2. Enter U as a single control variable (“Linear”)
  3. Enter U as a series of k-1 dummy variables, where k is the number of categories of U (“Factor”)

Below I have plotted the results from estimating this model 250 times on randomly generates U, T, and Y with a sample size of 1000. I allow for nonlinear relationships between U and T as well as U and Y, calculating the bias of the estimates as βb and plotting the distribution of estimates of bias across replications. The footer of the graph shows you the specific values of the p and q terms.

Look first at the blue line, which shows the distribution of bias in b when using the Factor approach. This distribution is centered around 0, which is just what we’d expect if this method were unbiased. It is not surprising to see that the black line, corresponding to estimates that omit U entirely, are highly biased, centered far from 0. But compare these to the red line: here, just accounting for the confounder U using the Linear approach yields biased estimates of the parameter of interest.

What happens if we explore different forms of nonlinearity? The substantive conclusions remain the same.

In this case, the relationship between U and Y is non-monotonic, not just nonlinear as above, but the conclusion remains the same. And just how much nonlinearity is needed to produce biased estimates? Here is a relatively mild case of nonlinear relationships between U and T and U and Y:

When nonlinearity is relatively mild, the bias is relatively small, but even here it is clear that the Linear approach produces worse estimates than the Factor approach.

As it turns out, however, that both nonlinearity in U,Y and U,T are essential for the Factor approach to be clearly superior to the Linear approach. Here is a case where the relationship between U and T is linear, but the relationship between U and Y is not:

We see that both Factor and Linear estimates appear to be unbiased, although the spread of the Factor estimates around 0 is less, suggesting that the Factor approach is more precise. And here is a case where the relationship between U and T is nonlinear, but the relationship between U and Y is.

In this case, both the Factor and Linear approaches appear to be unbiased, but the Factor approach is less precise than the Linear approach—at last, a point in favor of the Linear approach. And for completeness, when both U and T, and U and Y, are linearly related.

This last set of results tells us that if there are no nonlinearities at all, it is immaterial whether one models confounders as linear or not.

Summary Conclusions

The takeaway message from this simulation exercise is fairly simple. Modeling confounders as factor variables makes good sense if there is any possibility that the relationships between both the confounder and the causal variable and the confounder and the outcome variable are nonlinear. If both of these are linear, it doesn’t much matter which you choose. If one relationship is linear but the other is nonlinear, the Linear approach can sometimes yield more precise estimates than the Factor approach, but the Factor approach is still unbiased.

This is an argument for including confounders as factors as the default. There are costs to including long strings of dummied-out confounders in terms of degrees of freedom, but with sufficient sample size these costs are probably relatively minor. It would be interesting to explore whether one might specify the choice of linear versus factor specifications in terms of a bias-variance tradeoff.

Bigger Picture: Nonparametric Identifiability versus Estimation

Stepping back from the question of how best to model confounders, this example provides a useful example of the pedagogical limits of Directed Acyclic Graphs for applied causal inference research.

Here is the point: to my understanding, every single data generating process described above would be modeled using the following DAG:

The great benefit of DAGs is that they are extraordinary tools for ascertaining the nonparametric identifiability of causal relationships. Because our model holds that U -> T and U -> Y, we know that we must condition on U to identify the effect of T on Y. But moving from identification to estimation is not straightforward, and nothing about this DAG tells us about that relationship as an actual estimation problem.*** Thinking about nonparametric identifiability is crucially important for any causal quantity, but this case reinforces to me that in practice, there is still a devil in the statistical modeling details.

Notes

* One could also allow the relationship between partisanship and tax preferences to vary by income itself (Y = T + U + T×U), but that is the standard interaction effect case that Hainmueller et al have already studied.

** We could estimate β using a nonlinear ordered dependent variable approach, but that won’t make much difference.

*** It could be that there does exist a DAG that differentiates among those cases, with implications for how we estimate them. But I am not aware of it (and would be pleased to learn).