Malaysia’s GE13, Long Form Research Blogging Redux, and Statistics versus Econometrics

A little over two years ago, I wrote a post on “Long Form Research Blogging” related to my series of posts on Malaysia’s Thirteenth General Election. I wondered at the time if there would ever be a way to published all of that in an academic journal.

Through an unusual set of coincidences, I have managed to cannibalize a good deal of those posts in an article that is now in print. What’s unusual about the process? Just that the article is a commentary on another article which appears in the Journal of East Asian Studies which makes what I think are some erroneous conclusions about the relationship between ethnicity and votes for the Barisan Nasional coalition. The full-text is not yet available from the publisher’s website. But I have made a copy of the original article, my comment, and the authors’ rejoinder available here: PDF. This is of course only for your own personal use, dear reader.

I think that the contributions speak for themselves. But I do want to flag one issue in NRVP’s rejoinder which I find puzzling. They draw on a distinction between statistics and econometrics as describe by Rob Hyndman, and describe me as taking a statistical approach.

The argument above that Pepinsky makes against our use of the fractional logit model is an example of the difference in disposition between taking either a theory-driven or data-driven approach. While largely similar, econometrics is predominantly theory driven while statistics tend to be data driven. Therefore, an econometrician develops a model based on economic (and other relevant) theories while a statistician may build a model after looking at datasets. The econometrician subsequently confronts the model with datasets to test the theory. The interested reader can refer to Rob Hyndman’s blog post1 for interesting insights into the differences between the two. In this context, it can be said that our econometric model is theory driven while Pepinsky’s model is data driven.

This is odd to me because I thought that I was taking the econometric approach, and they were taking the statistical approach! I mean, contrast that quote above with the following argument from NRVP about why they have opted to use ethnic population totals, which I argue are theoretically inappropriate as a substitute for ethnic population shares:

  • It would make the article too technical, distracting the reader from
    the political issues at hand.
  • Interpretation of isometric log-ratio transformed variables is difficult,
    even in linear regression models, thereby making it hard to
    make useful inferences.
  • No work has been done on how the isometric log-ratio transformation
    can be performed on quadratic variables and for interaction variables

In lieu of the above, we decided to go with ethnic population totals as our measure of ethnicity, as the sum constraint would at least somewhat be removed. However, we acknowledge that this is not the best way to model ethnicity, which Pepinsky has correctly and strongly pointed out. Nevertheless, in our opinion, it is the better choice to model the data.

That looks to me like NRVP are prioritizing statistical procedure over coherent theory. My view is that we should not do that.

Readers who have slogged the whole way through this post might also be interested in Andrew Gelman’s thoughts on statistics versus econometrics.