Multiple Comparisons: Outcomes versus Measures

A recent post at Duck of Minerva by Peter Henne raised the question of multiple comparisons in international relations research. He poses the problem like this:

Imagine we’re writing a paper on interstate conflict. We could measure conflict onset, duration or intensity. We could measure intensity by an ordinal scale, the number of deaths, or other more complicated measures. We could include all dyads, politically-relevant dyads, dyads since 1945, 1989, etc. Ideally these choices would be based on our theory, but our theories are often not specific enough to specify such choices.

I find this striking, because even though the post is about “what the ‘multiple comparisons issue’ really is,” this is not how I customarily think of multiple comparisons. The above description captures a problem of uncertainty and/or researcher discretion in measurement of one variable. Yet most overviews of the multiple comparison problem describe the problem of uncertainty and/or researcher discretion in choosing what outcome to measure or measuring multiple outcomes.

The difference is subtle but important. Let’s take an example of the democratic peace in international relations. Imagine a well-designed (if implausible) experiment that randomly assigns half the world’s countries to be democracies. We then look for the consequences. We might be interested in the effect of democracy on war, but be uncertain in how to measure it. A continuous measure of people killed in conflict? A binary measure that is 1 if the conflict reaches some intensity, zero otherwise? An ordinal scale of none-some-lots of war? Maybe our measures of people killed are biased? Maybe we have many such measures, each biased in a different but unknown way? We might try to use all of them as dependent variables to see how inferences change. That is uncertainty in measurement of one variable. If we report only the ones that statistically significant results, then we have researcher discretion in the same.

The multiple comparison problem, however, often describes something different. This is when we look for all sorts of outcomes of the experiment. Not just war (however measured) but also economic growth. Or tax collection. Or social rights. Or environmentalism. Or religiosity. Or any number of things. Maybe we try all of them as dependent variables to see what the experiment affected. That is uncertainty in choosing what outcome to measure. If we only report the ones that produced statistically significant results, then we have researcher discretion in the same.

The problem with the latter is that if you keep picking dependent variables, you will eventually find one that is correlated with the treatment just by chance. One might then infer that the treatment affected that outcome. We can address this problem by adjusting the target level of statistical significance required to draw inferences about the treatment, increasing the “standard” for statistical significance to correct for the fact that one has looked for many outcomes. That is one solution to the multiple comparisons problem. Another is pre-registration.*

Does it matter if the dependent variables are multiple measures of the same thing versus multiple different dependent variables? My intuitions suggest that the answer is yes, but I could not figure out how, and this masterful review of the multiple comparisons problem does not distinguish between the two except for indirectly, in the case of the dependent variables being correlated with one another (in other words, the tests are not independent). So to get a handle on this, I created a little simulation that compares two types of analyses. I created a binary treatment variable T, and then correlated it with some randomly created dependent variables.

  1. In the first, there are twenty dependent variables that measure different things. I simulated these with twenty randomly generated dependent variables Y1,Y2,…,Y20.
  2. In the second, there are twenty different measures of a single dependent variable Y. One is Y itself. Another is a dichotomized version of Y. Still another is Y measured as an ordinal variable. Some others are Y variables plus some non-random noise. And so forth—some measures are closer to Y than others, some are dichotomous or ordinal variables. I got pretty creative here. The idea is that this reflects uncertainty in measurement of a single dependent variable.

Importantly, each of these dependent variables was generated independently of T. So we know that the experiment did not cause any outcome. But if we were to select a standard of statistical significant of 95%, we would find that 5 out of every 100 random dependent variables would be correlated with the T entirely by chance. Or, in this case, 1 out of the 20 that I created for the first analysis. And there would only be a 5% chance that the single dependent variable $Y$ would be associated with that treatment variable in the second analysis. Any statistically significant correlation between the treatment variable and any dependent variable is therefore a false positive.

I then simulated these analyses 500 times, for random draws of the treatment variable and the DVs. For each, I calculated how many false positives showed up in each analysis of the twenty dependent variables (which I’ll call the “Many DVs” scenario) and how many false positives showed up in each analysis of the twenty measures of the single dependent variable (“One DV, Many Measures”).** The figure below compares the results.

The histograms show the distribution of false positives. The dotted lines are the median false positive rate across the 500 simulations. These results are quite revealing.

  1. In the the “Many DVs” scenario, as expected, the modal outcome is that 1/20 or .05 of the results are significant at the 95% confidence level. This is as expected in the multiple comparisons literature.
  2. In the “One DV, Many Measures”, results are very different. The most common outcome is that no results are significant. But when they are significant, they are much more likely to be significant across many measures. That is the long right tail of that distribution.

These differences help to clarify what the problem of multiple comparisons is and also how corrections for it might or might not work. Multiple tests of the same dependent variable measured differently might not increase our confidence in the result if they are all due to the chance of that DV, because they are not independent tests. A multiple comparison adjustment will decrease the false positive rate in the “One DV, Many Measures” scenario, but this is a mechanical result, one that comes from just increasing the threshold for statistical significance.

By contrast, in the “Many DVs” scenario, adjustments for multiple comparisons respond directly to the fact that too many independent tests will eventually yield significant results simply by chance.

NOTES

* Although in principle, one might also pre-register a fishing expedition (which my colleagues and I once termed “hypothesis trolling” or “hypothesis squatting”).

** I used logits and ordered logits where appropriate in the “One DV, Many Measures” tests.

Re-Reading Imagined Communities for the 1000th Time

Some books just stick with you. Benedict Anderson’s Imagined Communities: Reflections on the Origin and Spread of Nationalism is one of them. It was one of only two books that I was assigned to read in multiple classes in college. At Om Ben’s memorial service, my colleague Isaac Kramnick called it the “second-most important book ever written by a Cornell faculty member.”* And it’s a standard entry on any syllabus dealing with nationalism and national identity.

I recently assigned Imagined Communities for the N-th time in my own class, and I was struck—once again—by how dense it is. Every time I read it I discover something new, or remark upon a flippant turn of phrase or a spicy footnote that I hadn’t noticed before. This time was no different.

The standard two-word summary of Imagined Communities is “print capitalism.” This is shorthand for Anderson’s argument that the spread of mass literature in vernacular languages, motivated by the capitalist impulse to sell penny dreadfuls to as many people as possible, created the idea of a community united by a common language. But the last two times I re-read Imagined Communities, I was struck more by his focus on the idea of what he and others term “simultaneity.” That is idea of what a reader (or generally, a person) conceives of the temporal dimensions of the social world. Writes Anderson,

What has come to take the place of the mediaeval conception of simultaneity-along-time is, to borrow again from Benjamin, an idea of ‘homogeneous, empty time,’ in which simultaneity is, as it were, transverse, cross-time, marked not by prefiguring and fulfilment, but by temporal coincidence, and measured by clock and calendar.

I won’t pretend to have nailed down what specifically this distinction is,** but the idea is that mass vernacular fiction did not simply create a common experience of people-like-me-reading-things-that-only-we-can-read but also that the nature of the mass market novel shifted people’s cognitive map from a medieval to a modern form. That’s much more than just “print capitalism.”

The other thing that I noted is Anderson’s gleeful recitation of racial epithets, designed to make the point that

it is remarkable how little that dubious entity known as ‘reverse racism’ manifested itself in the anticolonial movements.

He writes in a footnote

I have never heard of an abusive argot word in Indonesian or Javanese for either ‘Dutch’ or ‘white.’ Compare the Anglo-Saxon treasury…

and then goes on to list them. This is wonderfully interesting for two reasons. First off, in his posthumously published memoir, Anderson not only talks about the common “abusive argot word” in Indonesian for “white” (= bule) that everyone knows, but also he claims to have invented it as a derogatory term for Caucasian person.***

Second off, it is interesting because Caroline Hau—a very fine author of both fiction and nonfiction—recently gave a lecture at Cornell entitled “For Whom Are Southeast Asian Studies” in which she urged us to remember the audiences for whom our books are written. In Anderson’s own telling, Imagined Communities is for an English audience. Not English-speaking, but specifically from England. Perhaps that explains why one would argue such an odd and plainly untrue thing? Hard to say. But a Southeast Asian audience would happily supply him with the treasury of colonial racial epithets for colonizers.

NOTES

* Here is the winner for Kramnick’s most important Cornell faculty book.

** Anderson cites Erich Auerbach’s Mimesis and Walter Benjamin for these ideas.

*** He claims to have coined the term in the 1960s, to be clear, so the timing does not work (Imagined Communities is from 1983). As colorful as that linguistic history might be, I think he is plainly incorrect.