Replicability versus Robustness

There is a lot to like in Allen Dafoe’s discussion of the imperative of replicability in political science (ungated PDF, H/T the Monkey Cage). Yet there are some areas that give me pause, and in particular, the notion of “robustness.”

Look at Appendix 2B, entitled “Minimal Robustness Standards.”

– 1: One or more key results were not robust to a sensible modification of an arbitrary aspect of the specification.

There is a troubling fuzziness here. It all comes down to what I am now calling referee degrees of freedom. What parts of a “specification” are “arbitrary,” and what modifications are “sensible”? We need answers to those questions before we can declare any results to be “not robust,” and there is certainly no consensus in political science about what kinds of modifications are sensible and what features of models are arbitrary, at least not in the abstract. For example, I’m pretty sure that logit regressions ought to yield the same substantive conclusions as probit regressions. But what about linear probability versus logit? Your favorite control variable? Fixed or random effects? A subset analysis? ATT versus ATE? And what result is robust: the sign? Effect size? P-value? How different do these things have to be to not be robust?

This points to a generally sloppy way that the discipline discusses “robustness” in applied research. Let’s not forget, moreover, that there is a more widely accepted definition of “robust” as in “robust statistics.” Here, robustness here is a property of a statistic, not a specification or a set of specifications.

Dafoe’s discussion covers very important issues in replicability, most importantly with respect to transparency. But robustness is conceptually distinct from replicability, and the discussion here ought to be seen as raising important questions rather than providing conclusive answers.