Let’s say we want to estimate a quantity . We form an estimate of that quantity , with a 95% confidence interval of . Let’s say we form another estimate , with a confidence interval of . And then it is revealed to us that . Which estimate, or , is the correct one? Can we infer that is correct?

It is easy to see that the answer to the first question is “we can’t tell, the data are equally consistent with both estimates.” The second question is more subtle, but the existence of suggests to us that we ought to be cautious about inferring that is somehow “correct.”

This toy example reveals something fundamentally rotten in the election polling postmortem.

Many polling pundits are arguing this week that the polls were “correct” in some sense because polling results produced estimates with confidence intervals (or credible intervals) that captured the final two-party vote share, either nationally or by state. Here is one example but there are many others to find. The above example makes clear that such inference should not be drawn. If we call the election poll aggregates , the results from Tuesday’s election (call them ) are equally consistent with the hypothesis that the aggregated estimates were perfectly unbiased and that those estimates were biased upward by four points.

The general point is this. The confidence interval around the two-party votes share estimate from polls reflects the standard error of the…estimate from the polls. It is not a confidence interval that captures the actual two-party vote share except under the hypothesis that the data generating process that produces the polls is the same data generating process as generates the vote. The same point holds for polling aggregates. We may not infer anything about the accuracy of the polls or the quality of the poll aggregates from the relationship between the election result and some confidence interval *except for by maintaining that hypothesis*. If we maintain an alternative hypothesis that the polls were systematically subject to substantial error in modeling turnout and/or voter intentions, these results are also consistent with many such hypotheses about the size of that error.

This point has momentous implications for public opinion polling and for American democracy. If one makes inferences about the quality or correctness of polls from C.I. coverage, then one might conclude that there is no need to reevaluate the polls themselves. Estimates of uncertainty are necessary in public opinion polling, but they also make it hard to diagnose fundamental, systematic error. The more informative way to proceed is to identify errors, as Sam Wang has done (“The business about 65%, 91%, 93%, 99% probability is not the main point”), and going forward, to learn how to minimize that error.

There is no way to avoid the secondary conclusion that this will be hard. As I wrote two days ago,

Future aggregates for future elections by sites like 538 are going to use historical performance (i.e., prediction error today) to weight or “adjust” future polls. It is possible that some polls were more accurate than others because they had better models of turnout and voter intentions. It is also possible that all polls were just off (“correlated errors,” in the lingo), and some of these randomly happened to be less off than others. If the latter is true, then adjustments in the future will be worse than useless—they will be chasing noise.

The future of election polling is not “whose polling aggregation method had the greatest uncertainty?” The future is “whose polls are the most accurate, and how do we know?” Anyone who suggests otherwise is either confused, or trying to sell you something.