Category: Research

  • Modeling Possibly Nonlinear Confounders

    In recent empirical work, my coauthors and I have found it useful to treat ordinal variables as factor variables rather than as continuous variables when entering them on the right-hand side of a regression. What I mean is a situation as follows: we are estimating the model

    Y = a + βT + γU + ε

    where the parameter of substantive interest is β—the partial correlation between T and Y—but we include variable U on the hypothesis that U confounds the relationship between T and Y. A motivating example would a situation in which I want to know the partial correlation between party ID and tax policy preferences, but I suspect that income affects party identification and tax policy preferences, so we need to condition on income in order to “deconfound” the partial correlation of interest. 

    When we measure income, we normally don’t have a continuous indicator of dollars/year or something like that, but rather a set of income categories: $20k/year or less, $20k/year-$40k/year, $40k/year-$60k/year, and so forth. This is an ordered variable (category 2 is more than category 1, category 3 is more than category 2…), but it is not a continuous variable. Lately, I have chosen to estimate 

    Preferences = a + b*PartyID + c1*IncomeCat2 + c2*IncomeCat3 + c3*IncomeCat4 + c4*IncomeCat5

    rather than 

    Preferences = a + b*PartyID + c*Income

    The former approach allows the relationship between income and tax preferences to be nonlinear, whereas the latter constrains that relationship to be linear. 

    But here’s the thing: in this example we are entirely uninterested in the coefficients c1, c2… we only care about my estimate of b. And this weekend, over beers and hamburgers, some wise friends asked me the following: what’s the point of treating U as a factor variable if we are indifferent to c? The question arose in the context of a discussion of Hainmueller et al’s analysis of interaction terms, which shows that standard advice assumes that interactions are linear (which is often not true, and can have pernicious consequences) and recommends exploring nonlinear relationships rather than imposing a linear functional form. One can do this very simply by “discretizing” continuous variables and treating them as factors, just in the model above. 

    Although I have a good handle on why Hainmueller’s flexible nonlinear approach is useful for modeling interaction effects, I found myself at a loss to explain why the nonlinear approach would make sense in a non-interactive context in which U is nothing but a confounder. Why not just control for it? What’s the benefit of allowing that relationship to be nonlinear?

    As is often the case, a little bit of simulation can help to develop intuitions. So this is what I have done, prompted by some idle ruminations while watching TV over breakfast. The full exposition is below, but the TL;DR result is that treating a control variable U as a factor variable allows a standard OLS setup to estimate b in contexts where both of the following are true: (1) the relationship between U and T is nonlinear, and (2) the relationship between U and Y is nonlinear. When only one (or neither) of those two conditions is true, treating U as continuous will work just fine, but there is little cost to modeling it as a factor variable. 

    The Setup

    Here is a simple simulation in R in which there is a binary causal variable of interest T that is a (possibly nonlinear) function of U, an ordinal confounder that has five values (1…5).  Y, in turn, is a function of T and a (possibly nonlinear) function of U.

      u <- sort(rep(seq(1:5),n/5))

      t <- rep(NA,n)

      t[u==1] <- rbinom(n/5,1,p1)

      t[u==2] <- rbinom(n/5,1,p2)

      t[u==3] <- rbinom(n/5,1,p3)

      t[u==4] <- rbinom(n/5,1,p4)

      t[u==5] <- rbinom(n/5,1,p5)

      y <- rep(NA,n)

      y[u==1] <- t[u==1] + rbinom(n/5,4,q1)

      y[u==2] <- t[u==2] + rbinom(n/5,4,q2)

      y[u==3] <- t[u==3] + rbinom(n/5,4,q3)

      y[u==4] <- t[u==4] + rbinom(n/5,4,q4)

      y[u==5] <- t[u==5] + rbinom(n/5,4,q5)

    In this setup, we capture the relationship between U and T with the parameters p1…p5. An example of a linear relationship between U and T is if p1 = .1, p2 = .2, p3 = .3, p4 = .4, and p5 = .5, in which the probability that T = 1 increases linearly across the values of U. 

    Here, by contrast, are examples of nonlinear relationship between T and U:

    1. p1 = .1, p2 = .1, p3 = .2, p4 = .2, and p5 = .9
    2. p1 = .6, p2 = .5, p3 = .2, p4 = .3, and p5 = .9

    In the former, the probability that T = 1 increases discontinuously at the highest value of U. In the latter, the probability that T = 1 is higher when U = 1 than when U = 2, 3, or 4, subsequently rising again when U = 5. 

    We capture the relationship between U and Y with the parameters q1…q5 in an analogous way. Y is our outcome of interest, which is always a linear function of T, with β = 1, plus U. Notice that if q1=q2=q3=q4=q5 then Y is independent of U; otherwise U confounds our estimates of β. Notice as well that we have a statistical model that looks pretty much like the motivating example of tax preferences, partisanship, and income, in which Y is a five-category dependent variable, much like a survey response on a Likert scale. 

    To imagine the nonlinear relationships that this model will capture, start with income (U). It could be that the probability that one is a Republican (T) rises linearly with income. But it could also be the case that the probability that one is a Republican is much higher at the highest income bracket than in any of the lower income brackets; or it could be that the probability that one is a Republican is actually a bit higher among those with low incomes than among those at the middle of the income distribution, rising again among those at the highest income brackets. 

    The same is true with views about tax cuts. It could be that the probability that one supports tax cuts (Y) rises linearly with income (U), but it could also be that preferences for tax cuts are substantially higher among the wealthiest than they are among others, or that there is some other relationship in the data. We suspect, though, that those who identify as Republicans (T) are more likely to support tax cuts (Y).*

    Estimation

    We are using ordinary least squares regression to estimate β.** Consider three different ways to estimate this relationship.

    1. Omit U entirely, which will certainly create bias unless q1=q2...=q5 (call this “No U”)
    2. Enter U as a single control variable (“Linear”)
    3. Enter U as a series of k-1 dummy variables, where k is the number of categories of U (“Factor”)

    Below I have plotted the results from estimating this model 250 times on randomly generates U, T, and Y with a sample size of 1000. I allow for nonlinear relationships between U and T as well as U and Y, calculating the bias of the estimates as βb and plotting the distribution of estimates of bias across replications. The footer of the graph shows you the specific values of the p and q terms.

    Look first at the blue line, which shows the distribution of bias in b when using the Factor approach. This distribution is centered around 0, which is just what we’d expect if this method were unbiased. It is not surprising to see that the black line, corresponding to estimates that omit U entirely, are highly biased, centered far from 0. But compare these to the red line: here, just accounting for the confounder U using the Linear approach yields biased estimates of the parameter of interest.

    What happens if we explore different forms of nonlinearity? The substantive conclusions remain the same.

    In this case, the relationship between U and Y is non-monotonic, not just nonlinear as above, but the conclusion remains the same. And just how much nonlinearity is needed to produce biased estimates? Here is a relatively mild case of nonlinear relationships between U and T and U and Y:

    When nonlinearity is relatively mild, the bias is relatively small, but even here it is clear that the Linear approach produces worse estimates than the Factor approach.

    As it turns out, however, that both nonlinearity in U,Y and U,T are essential for the Factor approach to be clearly superior to the Linear approach. Here is a case where the relationship between U and T is linear, but the relationship between U and Y is not:

    We see that both Factor and Linear estimates appear to be unbiased, although the spread of the Factor estimates around 0 is less, suggesting that the Factor approach is more precise. And here is a case where the relationship between U and T is nonlinear, but the relationship between U and Y is.

    In this case, both the Factor and Linear approaches appear to be unbiased, but the Factor approach is less precise than the Linear approach—at last, a point in favor of the Linear approach. And for completeness, when both U and T, and U and Y, are linearly related.

    This last set of results tells us that if there are no nonlinearities at all, it is immaterial whether one models confounders as linear or not.

    Summary Conclusions

    The takeaway message from this simulation exercise is fairly simple. Modeling confounders as factor variables makes good sense if there is any possibility that the relationships between both the confounder and the causal variable and the confounder and the outcome variable are nonlinear. If both of these are linear, it doesn’t much matter which you choose. If one relationship is linear but the other is nonlinear, the Linear approach can sometimes yield more precise estimates than the Factor approach, but the Factor approach is still unbiased.

    This is an argument for including confounders as factors as the default. There are costs to including long strings of dummied-out confounders in terms of degrees of freedom, but with sufficient sample size these costs are probably relatively minor. It would be interesting to explore whether one might specify the choice of linear versus factor specifications in terms of a bias-variance tradeoff.

    Bigger Picture: Nonparametric Identifiability versus Estimation

    Stepping back from the question of how best to model confounders, this example provides a useful example of the pedagogical limits of Directed Acyclic Graphs for applied causal inference research.

    Here is the point: to my understanding, every single data generating process described above would be modeled using the following DAG:

    The great benefit of DAGs is that they are extraordinary tools for ascertaining the nonparametric identifiability of causal relationships. Because our model holds that U -> T and U -> Y, we know that we must condition on U to identify the effect of T on Y. But moving from identification to estimation is not straightforward, and nothing about this DAG tells us about that relationship as an actual estimation problem.*** Thinking about nonparametric identifiability is crucially important for any causal quantity, but this case reinforces to me that in practice, there is still a devil in the statistical modeling details.

    Notes

    * One could also allow the relationship between partisanship and tax preferences to vary by income itself (Y = T + U + T×U), but that is the standard interaction effect case that Hainmueller et al have already studied.

    ** We could estimate β using a nonlinear ordered dependent variable approach, but that won’t make much difference.

    *** It could be that there does exist a DAG that differentiates among those cases, with implications for how we estimate them. But I am not aware of it (and would be pleased to learn).

  • Is Quantitative Description without Causal Reasoning Possible?

    This week saw the launch of an exciting new journal entitled the Journal of Quantitative Description: Digital Media. Although the bit after the colon delimits topical scope of this particular journal, it is the bit before the colon that is most exciting and which has elicited wide commentary. JQD:DM promises to publish

    quantitative descriptive social science. It does not publish research that makes causal claims

    This is a big statement, because many if not all mainstream social science journals are increasingly consumed by a focus on causal inference using quantitative methods. To be fair, this has probably been true for a long time now. But the revolution in statistical methods for causal inference in the past forty years has given quantitative social scientists a very sophisticated toolkit for understanding the relationship between statistical procedures and causal claims, such that progress in the latter is now catching up with progress in the former.*

    I do not think that anyone seriously holds the position that only causal inference is important. Description has always been essential to the scientific and social scientific enterprise: what is the population of Israel? what is the behavior of the cardinal eating from my bird feeder? and so forth. Yet the task of quantitative description raises an interesting question about the role of causal reasoning in making theoretically relevant descriptive statements.

    I will make two assumptions as a starting point:

    1. quantitative description is always theoretical
    2. theoretically interesting tasks of quantitative description involve relating one variable to another variable.**

    These assumptions are not assumptions about quantitative methods themselves—one could always simply produce descriptive statistical correlations between, say, refrigerators per capita and infant mortality across Indonesian provinces—but rather about the types of quantitative descriptions that are held to advanced social scientific knowledge. Assumption 1 tells us that we rely on theory to tell us what is potentially informative about a quantitative description, and Assumption 2 tells us that we should focus on what problems arise when we describe relations among variables.***

    Under these maintained assumptions, I think that it follows that all quantitative description is done either in the shadow of causal reasoning, or with implicit restrictions on the system of causal relations that the quantitative description partially captures.

    Let’s start with a classic example of what seems to be a good quantitative description: creating an index that measures a latent psychological construct. Bill Liddle, Saiful Mujani, and I did this for my 2018 book Piety and Public Opinion, creating what I called a “piety index” designed to capture individual piety across a sample of Indonesian survey respondents. We made this index from multiple variables, and used theory to restrict “what went into” this index, so Assumptions 1 and 2 hold. Isn’t this just descriptive? It is: but note that the grandfather of latent trait analysis, Spearman (1904), proceeded from a model in which the latent construct caused the observable indicators associated with it. This causal claim feels rather innocuous, but it is causal; and any attempt to relate an index of the form that I created with any sorts of other outcomes or correlates must confront some sort of causal model to be interpretable.

    Turn to another example: the cross-national relationship between private gun ownership and state terror (a topic I first addressed thousands mass shootings ago). There, I produced descriptive correlations between, well, state terror and private gun ownership, but deliberately asserted that

    Of course, these are not estimates of the causal effect of gun ownership (or anything else) on state terror. These are conditional correlations, and there are plenty of reasons why we might believe that the causal relations here are more complicated than what this discussion has implied. 

    The point I was trying to raise is that we learn things from these correlations even when we are sure that they are not causal. This, I think, is related to the model that JQD:DM seeks to follow.

    But this is not a causation-free analysis! It is interesting only insofar as we can related it to a causal question. We reason through the potential set of causal relations that could have produced that correlation to make sense of what it likely means. A long quote from a follow-up post makes the point (funny enough, it anticipated JQD:DM):

    If I were writing an article for a good social science journal, I’d probably stop right here and abandon the project. Thankfully, we have eliminated some of the numerology from quantitative social science in the past two decades, meaning that we cannot wave our magic interpretive wand over a regression table to reach our preferred conclusion. If you want to claim to have identified “the effect of” gun ownership on freedom from state terror, partial correlations will no longer suffice.

    But we still learn policy-relevant things from these results even if they do not identify a causal relationship. The first point is to remember that the question of interest is not the average causal effect of gun ownership on state terror (which, for better or for worse, as become the question of interest for quantitative social science research). Instead, our policy question is more squishy: does such widespread gun ownership protect American citizens from tyranny? Here is what we have learned even without an estimate of a causal effect.

    1. American citizens aren’t as protected from state terror as we might think.

    2. Plenty of countries rate as highly (or more highly) than the U.S. with lower levels of gun ownership.

    3. Plenty of countries with lower levels of gun ownership experience far more state terror with lower levels of gun ownership.

    4. The partial correlation between gun ownership and state terror disappears when you take regime type and economic development into account.

    All of these data are hard to square with the idea that the ubiquity of firearms in the U.S. is protecting Americans from state terror. We can construct a theoretical world in which gun ownership at the levels that we see in the United States today is protecting us from tyranny, but that theoretical world must have a lot of curious features to it to also produce the results from yesterday. 

    Understanding what the conditional correlation could have possibly meant implied that we could imagine some sort of causal system from which the qualitative description—the correlation ρY,X is statistically significant, but the correlation ρY,X|W is not—emerged.

    There are other examples that I might provide, but I hypothesize that any quantitative description that is held to advance social scientific knowledge in the ways that the journal hopes will be either multivariate measurements of things or tantalizing correlations among things.

    Is this bad or wrong? Does it undermine the purpose of JQD:DM? In both cases the answer is no. I reach a different conclusion: that JQD:DM and any journal like it will always confront lurking criticisms that causal reasoning is somehow being smuggled into the quantitative descriptions that they publish. This is a fine problem to have, but I suspect that even a journal explicitly devoted to quantitative description will struggle to police the boundary between descriptive and causal inference.

    By way of conclusion, here is a speculative future for journals like JQD:DM. In many if not most cases, there is a lot to be learned from statistical correlations that cannot be given a strong causal interpretation. The standard in most quantitative social science is to target a causal parameter like an average treatment effect or a dose-response function****. The enterprise “fails” if the design does not allow for that target parameter to be identified, and as my example of the paper that I would abandon hints, researchers often will not even try if they know that it is unidentifiable.

    Another approach would be to identify a target parameter, a quantitative descriptive fact that is partially informative about that parameter, and a mapping from the former to the latter using assumptions and logical bounds. JQD:DM and journals like it might foreground this sort of approach to highlighting what we learn from quantitative descriptive exercises. A loosely related approach such as that outlined in Little and Pepinsky (2021) might fit nicely under this model as well.

    NOTES

    * To make explicit what I mean in this sentence: we have long had sophisticated statistical tools, but without the theory of causality required to attribute causal meaning to them.

    ** Examples of univariate quantitative description would be finding answers to the questions of “how many balls are in that urn?” or “what is the GDP of Venezuela?”

    *** Importantly, observe that time is a variable. To describing how a single variable differs across time is a task relating multiple variables to one another.

    **** Or, analogously, a sufficient statistic or an identification region.