I am 100% fascinated by this post by Jay Ulfelder on using text mining to code regime type. I love how it allows us to think concretely about fuzzy classifications, and to be explicit about uncertainty. If I had to guess, I’d say that this is the future of regime coding.
I was also a bit dismayed to see that some of my favorite countries score lower on a likelihood-of-democracy score than I would like.

Image credit: Jay Ulfelder, from http://dartthrowingchimp.wordpress.com/2014/08/25/mining-texts-to-generate-fuzzy-measures-of-political-regime-type-at-low-cost/
What could explain why? Well, the one thing that we depend on in such exercises is an unbiased (or at least “randomly” biased) source of text. And it could be that there are features of countries that lead sources to discuss them differently. Let’s take an example: the killing of an unarmed black man by a police officer will almost certainly not affect the way that Freedom House discusses the United States. However, a police killing of an unarmed man in a country like Indonesia may affect the way that Freedom House discusses Indonesia. Same with violence against Sikhs in the U.S. versus violence against Christians, Ahmadis, and Shia in Indonesia. The list goes on.
Now, this need not undermine the exercise. My guess is one could model the differences in the way that the underlying texts represent political conditions in the countries that they cover using a rich set of observables. If we hypothesize that the sources talk about rich countries differently, they talk about Muslim countries differently, and they talk about countries with histories of authoritarianism differently (so that the authoritarian histories have “slow decays”), then perhaps that could be used as input to the classification algorithm. I don’t have access to the Ulfelder, Schrodt, and Ward paper, but I wonder if they’ve explored such issues.
One thing is for sure: I wish I could raise these questions myself at their presentation. Unfortunately, I am a discussant on a panel at the same time, at the always popular 8AM slot. So, I look forward to reading more.
cpgearlywarning August 26, 2014
Hey, Tom. Thanks very much for discussing this work. I have quick responses to two concerns you raise.
First, on why some of the countries you know score so low on the likelihood-of-democracy measure, the key point here is that in our preliminary analysis, we defined democracy as a Polity score of 10. This sets the bar super-high and is not to be confused with the minimalist definitions that are typically used in binary democracy/autocracy data sets. As our paper notes, we plan to relax that criterion in future iterations, but the map represents an effort to identify a pretty stringent form.
Second, to your point about biases in the texts, you’re right that this should be a big concern, and that the State and Freedom House reports represent a very narrow range of views on this subject. To address this issue, we’re planning to expand the corpus in the future iterations to include a much wider array of texts from a more diverse set of sources, eventually to include news stories as they are published. It will be interesting to see how the results change as we widen and move the lens around.
tompepinsky August 26, 2014
Sounds fascinating, and like some great possible extensions. I look forward to reading more.
puspini September 9, 2014
I would’ve wanted to attend this panel too!!