There is a lot of attention recently to the perils of p-fishing in political science. P-fishing refers to the selective reporting of statistical results that cross some boundary of statistical significance (customarily, p < .05). The problem arises because researchers may run many more analyses than they actually report, and given model or specification uncertainty, it is easy to report only those results that are consistent with the inferences that the analyst wants to make. A good discussion of p-fishing may be found in this CEGA blog forum. The issue of p-fishing isn’t new, but the attention that it's getting is. This goes along with a renewed attention to publication bias and other related concerns.
Solutions to the p-fishing problem come in two flavors, one targeting temptations and the other incentives. One is pre-registration (see e.g. this EGAP initiative): specify publicly how you intend to analyze your data before you actually collect it. Think of pre-registration as a commitment device. The other is eliminating the incentives to p-fish by publishing null results and other so-called “non-findings” (as the new Journal of Experimental Political Science intends). Pre-registration gets rid of the temptation to p-fish, and publishing null results gets rid of the incentives to p-fish. Of course, we could implement both of these at the same time.
But what if we thought about the problem differently? What if we cautiously embraced some kinds of p-fishing, but chose to focus on the process rather than the catch?
I imagine the problem of p-fishing as something rather less ominous than actually methodically searching for significance by taking a mess of variables and sifting through all possible conditional correlations or subsetting the data to uncover some treatment effect (or not reporting the effects that were not found). Instead, it probably looks more like the following in many cases.
A Sketch of P-Fishing in Practice
- Analyze data.
- If this doesn’t generate statistically significant results, respecify. Stop and move on only if significant results cannot be obtained.
- If a result comes out, poke around a little more. New specifications, analyses, sub-samples, and so on.
- If result holds up to reasonable poking, shop it around.
- Get feedback and new ideas for new analyses after shopping. This may mean from colleagues, from referee reports, from audiences at conference or workshop presentations, etc.
- Incorporate suggestions into new analyses. If the results hold up sometimes but not always: only report the ones that fit, and develop a long argument about how the ones that don’t fit are inappropriate anyway. Pray for favorable referee treatment.
- If results never hold up: submit anyway and pray for favorable referee treatment.
- If results always hold up: celebrate.
To reiterate: This is not an endorsement of such a practice, this is a description of what I imagine that it looks like, especially when using observational data.
The problem at Step 2 is publication bias: a result is only a result if it crosses some artificial threshold of statistical significance, so there is no professional incentive to report non-significant findings. The broader problem at Steps 2, 3, 6, 7, and 8 is researcher degrees of freedom: the analyst has all sorts of opportunities to choose a specification that generates those precious significance stars, and in doing so, s/he is focusing on getting the stars, not the validity of the findings. Since we never observe the true model that produces the data that we analyze (which is why we are in the business of making inferences) we cannot check any particular specification against the truth.
My suggestion is that we might be focusing too much on the fact that results are no longer significant, and not focused enough on why and so what. If I discover that controlling for variable X2 wipes out the statistical significance of X1, or that my treatment effect T disappears in a subset of the analysis sample, then our task could very well be to explain why that would be the case rather than proudly (or dejectedly) throwing out our conclusions about X1 or T and move on. That task itself generates knowledge, and yet it goes unrecorded.
Say we have a study claiming to show that public opinion affects tax policy. When public opinion in some jurisdiction is anti-tax, taxes get lower in that jurisdiction. The correlation is significant at some level. Then we learn that when we control for the state of the economy, public opinion no longer explains tax policy.
We tend to say “Ah, results not robust. Give up. Or (if submitted) reject.” But we could also look at the same process and conclude “oh, maybe public opinion is epiphenomenal on economic performance.” We could focus on why the estimates are no longer statistically significant: are coefficients changing, or standard errors, or both? Of course, none of this would be the end of the discussion, but that is true of any piece of research. My point is this: by focusing on what sorts of theories are consistent with the fact that results change across reasonable specifications—rather than either (1) committing to only doing one type of analysis or (2) publishing only the non-results—we may actually learn things that we otherwise wouldn’t be able to learn.
There are certainly some studies that do something like this. But it is too rare, probably as a result of publication bias and the incentives to generate statistically significant results. Explicit acknowledgment of the fragility of results is rare; much more effort goes into showing that results are not fragile.
But Of Course…
The different approach to p-fishing I have in mind here would require a lot of changes to how we report on research. First, note that it actually requires us to address both the temptation and incentives to p-fish: we have to accept that some inferences are misleading and that sometimes we haven’t learned anything. So we have to be willing to publish null results rather than disguising them.
It also requires something more. It requires us to acknowledge explicitly the iterative process of the research enterprise, with discovery and disappointment and exploration as a natural part of that process. I imagine that those of us who strive for a laboratory-scientific model of empirical research would not be too keen on this: aren’t we just recognizing how much our research departs from our ideals? Should we be rooting these practices out, not embracing them?
I also think of science as a process of discovery…Every paper I have [written] has some really interesting robustness, extensions, follow-ups that I would have never thought about at the beginning.
Our publication practices should not be structured in ways that hide the scientific process, and I’d be interested to hear of ways to accomplish this. One is to embrace the strongest version of the Replication Standard, in which every keystroke and computer command is recorded from the first conceptualization of a research project until publication. Perhaps there are others…but if so, they probably require a lot more reforms to how we discuss and report research than just replication.