In my last post, outlined an interpretation of the results of GE13 as reflecting primarily an urban-rural divide, an explanation that has in various ways been articulated as different or even competing with the ethnicity-based analyses that I have offered. I discussed the challenges of separating ethnicity from region or urban/rural, not only but empirically (they covary) but also conceptually (they covary *for a reason*).

In this post, I take a closer look at what the data say. Along the way, we will review some principles of causal explanation in the social sciences, to highlight what explanations such as mine are meant to do, and how we think about competing explanations for social phenomena. To preview: I do not believe that ethnicity and urban/rural are competing ways of looking at the results of the GE13 parliamentary races, and I can show that they both “matter” (in a sense that I will make precise later). But if we adopt the view that they must be competing interpretations, so that evidence in favor of one entails evidence against another, then ethnicity wins. Full stop.

Before proceeding, some caveats are in order.

- The data that I am using is still the preliminary data that was released the day after GE13. I have not yet scraped the official data from undi.info or any official website. If the numbers have changed in any appreciable way, my results might not hold.
- There are so many criticisms of the electoral process, including phantom voters, ballot box stuffing, and other shenanigans, that we should probably wonder if the official vote totals can be trusted. While it’s not true that the two-coalition vote shares are as random as darts on a dartboard, I take the possibility that the data are “cooked,” or at least gently simmered in places, very seriously. If anyone reading this has access to the voting-station level returns (I’m looking at you, poli sci PhDs who are now part of Malaysia’s parliament!) I am very happy to explore some election forensics with you.
- I have focused only on parliamentary races, even though I know that state assembly races may exhibit different patterns. I invite anyone who is seriously interested in engaging with these arguments, and who thinks that state elections are fundamentally different, to collect those data and analyze them.

Keeping those caveats in mind, there are far more analysts of Malaysian politics than there are social scientists who try to use Malaysian data to analyze GE13 systematically. Showing how ethnicity matters means different things depending on your perspective, so it’s worth being explicit about what I have in mind.

### Competing Explanations, Competing Hypotheses?

My analyses show that ethnicity predicts vote choice. This does not mean that *only *ethnicity predicts vote choice, or that ethnicity *always *predicts vote choice, in which case there are no other possible factors that could explain why any Malaysian votes for or against a BN candidate. It should go without saying that I also don’t have any patience for the perspective that we can make ethnic politics in Malaysia go away by studiously avoiding any discussion of ethnicity, or by asserting that we must adopt some alternative perspective.

Social scientists will understand this as an “effects of causes” approach rather than a “causes of effects” approach. This is very much mainstream political science. My analyses so far have only been able to show that an independent variable (ethnicity) explains a dependent variable (two-coalition vote shares). My analysis is rather indifferent to the fact that other scholars examine the other things that may explain vote choice in GE13.

But let me also be clear: ethnicity is an essential, fundamental factor in Malaysian politics. So I am transgressing the norms of mainstream comparative politics, and I am trying to make the case that ethnicity is something like a “master variable”—this is not a technical term—in Malaysian politics. Not *the* master variable, but *one* master variable. That is why it is so important that I show the amazingly tight quadratic fit between Malay population and BN vote share.

Taken together, the methods that I have used to study vote share are about showing the effects of causes. But I have discussed the results as if they are also making a broad point, about the causes of effects: why, in the broadest terms, GE13 turned out the way that it did.

So with that in mind, how do we analyze two different theories of what district characteristics explain BN vote share? It depends on the purpose of the analysis. If we want to show the effects of causes, we should find a measure of the urban/rural split across districts, and then see if that variable also explains BN share. We can also examine that measure next to our measure of ethnicity to see our rural variable explains any of the additional variation in BN vote share that ethnicity does not explain. This is one common interpretation of what multiple regression does: when viewed as a way to illustrate causal relationships (instead of just as a way to summarize partial correlations), multiple regression assumes that one set of outcomes can have multiple causes.

There is much less agreement about how to formally compare or adjudicate among different causes of effects. For some, the entire endeavor is ill-posed: what does it mean to assert that some explanation is *the cause* of some effect? One way to do this is to compare the extent to which two independent variables explain the variation in a dependent variable—in this case, do rural/urban differences explain more about the electoral results than ethnicity does? The problem that arises is that both variables may be pretty good at explaining variation.

Even so, there are various kinds of model selection procedures which can be used to select which model does “better” according to some metric. Recently, Imai and Tingley (2012) provided a very different way to think about this problem. We have two theories of what determines BN votes at the district level. These two theories imply two different hypotheses. The hypotheses are non-nested: one is not a subset of the other, which is what makes them mutually exclusive. Imai and Tingley propose that we can compare any set of theories with the proportion of the cases being analyzed that are “*statistically significantly consistent*” with one theory versus the other. Finite mixture models, as they discuss, are one way to calculate this.

This problem still assumes that the two models really do compete with one another, which remains a theoretical assumption rather than an empirical result. We will return to this at the end of this post. For now, let’s now turn to the data.

### Results

I have described the data on ethnicity previously. How should we measure whether a district is urban or rural? I would argue that we should not think of districts as either urban or rural, but rather as varying along an urban-rural continuum. And a natural indicator for the extent to which a district is urban or rural is its size. I have calculated each district’s area using the maps that my research assistant Rachel Greenberg has created. (I’m not sure what exact distance metric this uses, but that does not matter because the areas are all calculated with the same map using a common metric.) Taking the natural log of the district’s area yields a nice continuous measure of how rural the district is.

Below are the results are six simple regressions, using data from the peninsula. The dependent variable in each is the BN two-coalition vote share, and Models 1-3 differ from Models 4-6 in that the latter include fixed effects to control for any particularities of party politics in each state.

##### Table 1: Peninsular Districts

Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |

Malay | 1.22* | 0.98* | 0.76* | 0.65* | ||

(0.11) | (0.10) | (0.09) | (0.08) | |||

Malay Sq | -0.01* | -0.01* | -0.00+ | -0.00+ | ||

(0.00) | (0.00) | (0.00) | (0.00) | |||

ln Area | 5.68* | 2.97* | 5.96* | 2.89* | ||

(0.49) | (0.44) | (0.63) | (0.41) | |||

N | 164 | 164 | 164 | 164 | 164 | 164 |

Adjusted R2 | 0.61 | 0.45 | 0.69 | 0.81 | 0.58 | 0.86 |

State FEs | No | No | No | Yes | Yes | Yes |

Standard errors in parentheses. + p < 0.01, * p < 0.001.

The results show not only the extraordinary explanatory power of ethnicity, but also that how rural a district is is also a strong predictor of BN vote share. But the most important results are in Models 3 and 6. They show that ethnicity remains a very strong predictor of BN vote share even when accounting for urban/rural cleavages. My previous findings about ethnicity cannot be reduced to an urban/rural cleavage.

These results also show that, indeed, rural districts were more supportive of the BN, even accounting for districts’ ethnic makeup. This gives important credence to scholars who focus on the urban/rural split in GE13.

In this an all previous analyses, I have focused exclusively on districts in peninsular Malaysia. However, I can tell a similar story when including all districts. First, the graphical presentations:

Now, identical models as before, except for I also include a dummy for whether the election is in the peninsula or not. None of these results change if that is omitted.

##### Table 2: All Districts

Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |

Malay | 1.13* | 0.96* | 0.76* | 0.71* | ||

(0.11) | (0.11) | (0.10) | (0.10) | |||

Malay Sq | -0.01* | -0.01* | -0.00+ | -0.00+ | ||

(0.00) | (0.00) | (0.00) | (0.00) | |||

Peninsula | -11.81* | -8.59* | -9.48* | -12.20 | -24.52 | -11.79 |

(1.48) | (1.81) | (1.46) | (7.24) | (10.56) | (7.04) | |

ln Area | 5.30* | 2.40* | 5.30* | 1.59* | ||

(0.48) | (0.45) | (0.58) | (0.45) | |||

N | 221 | 221 | 221 | 221 | 221 | 221 |

Adjusted R2 | 0.66 | 0.50 | 0.70 | 0.80 | 0.57 | 0.81 |

State FEs | No | No | No | Yes | Yes | Yes |

Standard errors in parentheses. + p < 0.01, * p < 0.001.

The results are unchanged from before. Both ethnicity and urban/rural differences are very strong predictors of the BN vote share.

### Competing Theories Revisited

Despite these findings, it is possible to still maintain that explanations based on ethnicity versus urbanization really are mutually exclusive, and we have to choose one. So, we must compare Models 1-2, and 4-5, ignoring the fact that Models 3 and 6 exist.

The very simplest way to compare models is to compare the adjusted R2, or the percentage of the total variation in the dependent variable that is explained by the independent variables (with a penalty applied for more complex models which might be overfit the data). It is worth pausing to emphasize that comparing R2 is *very bad statistical practice*, especially from an effects-of-causes perspective. (This is an obligatory shout-out to my intro to methods teacher.) That said, we see that in both Table 1 and Table 2, adjusted R2 is higher for Model 1 than Model 2, and Model 4 than Model 5.

If it is a head-to-head contest between ethnicity and urbanization, score one for ethnicity.

More sophisticated model selection procedures for non-nested hypotheses include comparisons of Information Criteria, the J test, and the Cox-Pesaran test. In all cases above, the Aikike Information Criterion and the Bayes Information Criterion are lower in Model 1 than 2 and 4 than 5.

Score one more for ethnicity.

The J test and Cox-Pesaran tests, interestingly, fail in this case because the *tests reject both models*. This can happen when both models fit the data well, as is this case here. No points awarded, and a big question mark raised once again over whether the these hypotheses really are competing with one another.

Finally, the mixture model approach of Imai and Tingley (2012). I will focus here on the results from Table 2, Models 4 and 5. The technical details are here: most useful for our purposes are two quantities. First, the mean of the estimated prior probabilities that each observation is consistent with Model 4 or Model 5. Second, the number of observations that are statistically significantly consistent with Model 4 or 5. The results are in Table 3.

##### Table 3: Mixture Model Results

Prior Probability | Number of Observations | |

Model 4 (Ethnicity) | 0.884 | 207 |

Model 5 (Rural) | 0.116 | 14 |

Clearly, more of the district vote totals are consistent with an explanation based on ethnicity than on urban/rural cleavages.

A final point for ethnicity.

To summarize, all of the evidence I have presented points to three conclusions.

- Both ethnicity and urbanization are very good predictors of the pattern of BN vote shares, not just in the peninsula but throughout Malaysia.
- Even though ethnicity and urbanization overlap, and even though that overlap has historical origins which explain why both matter for the evolution of the BN and its electoral base, the data do not allow us to conclude that one variable encompasses the other.
- Nevertheless, if we force our analysis to adjudicate between ethnicity and urbanization as competing explanations for BN vote shares, ethnicity wins. Every model, every time.

Dear Tom,

The adjusted R2 of model 4 is lower than model 5 in table one (peninsular) 0.58 vs 0.69.

Thanks for reading, Pak Lar. You’re right: I copied the wrong entries for some of the entries in Table 1 (I did the R2 rather than the adjusted R2). It’s fixed now.

Thanks again!

I am curious about your choice of quadratic models. I may be wrong, but on a causal basis a model that is linear in, say, %bumiputera could have a causal basis in Bumiputeras as a whole having a certain voting trend, while a model that is quadratic in %bumiputera would (for the most straightforward causal explanation) require Bumiputeras to make their choices not only based on their Bumiputera status but also on the proportion of other Bumiputeras in their constituency, which seems unlikely.

Is it standard practice to use quadratic models? What happens if the models are constrained to be linear in the variables of interest?

Thanks for reading and commenting, Shernren.

I include the quadratic term in models of %Malay / %bumiputera because it’s clear that the quadratic term fits the data really well; see the graphs here http://themonkeycage.org/2013/05/13/post-election-report-2013-malaysian-election-part-ii/.

But my hunch is that there is a theoretical reason as well. In most (but not all) heavily Malay districts, UMNO runs against PAS. PAS, as an Islamist party, attracts Malay votes in a way that neither PKR nor DAP do. So my hunch is that the quadratic is just capturing the presence of PAS in the most heavily Malay districts. There, in Malay-vs-Malay elections, the effect of more Malays as a fraction of the total population no longer generates more support for UMNO over PAS.

The data seems to bear this out: here is a result of a regression that replaces the quadratic Malay term with a PAS term (plus state fixed effects).

pctBN

————————–

malay 0.53*

(0.03)

PAS -2.36+

(1.19)

————————–

N 164

r2_a 0.81

————————–

So I don’t agree with the idea that the “quadratic in %bumiputera would … require Bumiputeras to make their choices not only based on their Bumiputera status but also on the proportion of other Bumiputeras in their constituency”. It just reflects the very clear tendency of PAS to run in the most heavily Malay districts.

“(I’m looking at you, poli sci PhDs who are now part of Malaysia’s parliament!) ”

It should be PhD — I doubt there is more than one

Area vs. population? Interested to see whether you record the same results if you measure the urban/rural divide differently. Rather than area why not population density? Given the average size of a PR win constituency was 77,655 vs. BN average constituency size of 46,150 surely this makes sense? DAP internal figures distinguish between urban/semi-urban and rural and have BN winning 108 of 125 constituencies identified as rural. Given current demographic and urbanization trends I wonder how much the ethnic identifier alone remains the main determinant. From qualitative research and anecdotal evidence from current trip to Malaysia the rural/urban narrative is one that has broad acceptance both withint UMNO and among PR constituent parties.

I am sure that ethnicity is not the only factor that determines vote shares. But the role of ethnicity is undeniable, despite what party figures tell foreigners. When UMNO starts running against DAP in Chinese districts, and wins, then we will know that Malaysian politics has transcended ethnicity.

Population density would be more precise than just area. I do not have the population shares on hand, though. The next round of analysis will use that.

[…] this post by Andrew Gelman on “reverse causal questions” helps to sharpen some of my comments here about “causes of effects” research designs. It’s not just that I am transgressing […]

[…] [i] https://tompepinsky.com/2013/05/18/rural-or-malay-contending-perspectives-on-ge13-2/ […]