Category: Language

  • ChatGPT-4 Knows Malaysia’s Ethnic Order

    I am writing a book about Malaysia’s ethnic order*, using the Malay world as a way to think more generally about social categories like ethnicity and how to conceptualize them. This book project combines all of my interests: maritime Southeast Asia, political economy, colonial history (Portuguese, Dutch, English, and Japanese), language, analytic philosophy, social ontology, diaspora studies, criticizing things for not really making sense, trying to subvert paradigms, making custom maps, and statistics. The last of these is the subject of this post.

    One of the tasks of my book is to demonstrate that my understanding of Malaysia’s ethnic order is not simply a projection of my own biases onto Malaysian history, society, and culture. To do this, I’ve been using survey data collected from Malaysians (and Indonesians… although that’s not really relevant for today’s post) to try to characterize the country’s ethnic order. An ethnic order, for me, is

    the set of beliefs, practices, and ancillary institutions that organize and sustain ethnic categories within a given political community

    Such an abstract concept is hard to characterize, and hard to survey people about without supplying exactly the answers that I hope that they will provide (thus invalidating the exercise). So what I have done is to survey Malaysians and ask them questions about what comes to mind when they think of various ethnic groups in the country. They provided open-ended responses, which I analyzed through a Structural Topic Model to estimate the relative salience of different types of responses for different types of ethnic groups.

    The result is an analysis that says that, for example, when responses are generally focusing on things like language and religion, they are more likely to be talking about Malays versus Chinese. When they’re talking about economic matters, they are more likely to be talking about Chinese than Indians. And so forth. When the book is done, it will explain why such results are consistent with my particular approach to understanding Malayness in Malaysia.**

    But with ChatGPT-4 and other large language models (LLMs) completely blowing up everything, it occurred to me this morning that one might wonder what the computer says about my data. I was inspired to do this by reading this paper from Petter Törnberg, which used ChatGPT-4 to predict what the partisan affiliated with the authors of tweets. I’m reporting here what I found from some initial playing around, and it is remarkable, scary, and open to a lot of interpretation (more on this at the end).

    Here is what I did: I logged into ChatGPT-4 and opened up my file containing the open-ended survey responses. I then typed the following at the prompt:

    I asked a Malaysian survey respondent the following question:

    Now I am going to ask you about several different ethnic groups. I would like to you to tell me two (2) things that come to mind when you think about these groups. There is no right or wrong answer; you may think of particular words or phrases, or perhaps nothing at all.

    I asked this about ethnic Malays, ethnic Chinese, and ethnic Indians in Malaysia. Imagine that the respondent gave the response “kurang rajin Islam.” Which of the three ethnic groups above do you think this response would be associated with?

    The words kurang rajin Islam are the responses that one survey respondent actually gave. It is a combination of two responses, because I asked the respondent this question twice (one response was kurang rajin and the other was Islam). After hitting Enter, the ChatGPT-4 spit out the following:

    Based on the response “kurang rajin Islam,” it is likely associated with the ethnic Malay group in Malaysia.

    “Kurang rajin” translates to “less diligent” in English, and “Islam” refers to the religion of Islam. The phrase suggests a perception or association related to the Malay community and their commitment or dedication to their religious practices. It’s important to note that this response reflects the perspective of the individual respondent and should not be taken as a generalization for the entire Malay ethnic group.

    This is actually bananas. The computer correctly identified which group that response was associated with. And it kept doing it. I asked

    Imagine that the respondent gave the response “pemegang ekonomi.” Which of the three ethnic groups above do you think this response would be associated with?

    And the computer said

    The response “pemegang ekonomi” translates to “economic holders” or “economic owners” in English. Given this response, it is likely associated with the ethnic Chinese group in Malaysia.

    The phrase “pemegang ekonomi” suggests a perception or association related to the economic prowess or prominence of the ethnic Chinese community in Malaysia. It reflects the recognition of their significant contributions to the economic sector in the country. However, it is important to note that this response represents the viewpoint of the individual respondent and should not be taken as a generalization for the entire ethnic Chinese group.

    It gets worse, but I’ll spare you. And although it doesn’t always work, these initial forays show that LLMs are remarkably accurate at predicting what group a short survey response was referring to in this specific example. You can think of a lot of other ways that one might study social categories using these AI tools.

    I don’t know exactly what to do with these results, but I do know that they’re getting at something significant. For one, the AI developers are mostly trying to rein in LLMs to prevent them from saying racist, sexist, and otherwise biased things. But here is a case where we have the reverse: actual statements (and rather explicitly biased in both cases) by real humans that can be classified according to the best guess of who is being referred to. How can this be stopped?

    But I think that there’s another level at which to think about this result. That is, knowing that ChatGPT-4 is not thinking, but rather just doing a super-duper prediction model of what words are most likely to be sensible responses to other words, what could one learn from this? Is this evidence of something, and if so, of what? What people think? What people have written down? What Malaysians have written down? What researchers or journalists have written down?

    There are even more levels. What about the coarseness of my own prompt (I wrote Malaysia, when I really mean peninsular Malaysia)? Should I give feedback on these results, thus contributing to reinforcement learning? If I write up these results, would I then be further contributing to stereotypes by producing a text that will feed into the same LLMs to produce even more such results? What if my own ethical goals are not to reinforce, but rather to undermine the present social order?

    I don’t have any great answer to these questions. I am a pessimist about what LLMs will do to human society who nevertheless marvels at what they can do sometimes.

    NOTES

    * You can think of an ethnic order as like a “racial order.”

    ** But not Indonesia. That’s important. And also, not really even Sabah/Sarawak, this is about peninsular Malaysia. That’s also important.

  • The Proposal to Make Malay an Official ASEAN Language is not Crazy

    Last week the Malaysian Prime Minister Ismail Sabri Yaakob remarked that Malay has the potential to be an official language of Southeast Asia. This feels like a rather surreal moment in Malaysian politics, at least according to most of the popular media coverage I’ve seen. Like of course that’s not going to happen, right?

    I don’t think the idea of Malay as a working language for ASEAN is unrealistic. In fact, it would be entirely reasonable for ASEAN to designate one of the languages of Southeast Asia as an official or working language. And if you had to pick only one language from the ASEAN member states, it would definitely be Malay.

    But of course there would have to be a lot of interesting details to work out. So here’s how to think about what’s going on in Ismail Sabri’s comments.

    The first thing to note is that Ismail Sabri distinguishes explicitly between Bahasa Malaysia and Bahasa Melayu. The former is “the Malaysian language,” in the sense of the language of the state of Malaysia. The latter is “the Malay language,” which refers to a family of loosely related and mostly mutually intelligible dialects, many of which are creoles with various levels of official standardization and recognition. The generic term Bahasa Melayu would include the official languages of the states of Indonesia, Malaysia, and Brunei, the Indonesian regional languages of Jambi and Riau, various dialects of Malay spoken in peninsular Malaysia, and the Middle Indonesians of Jakarta, Medan, Makassar, Manado, Kupang, Ambon, and other cities. One often hears “Malay” used to describe the official language of Malaysia, but Ismail Sabri means Malay in the expansive, second sense.

    This means that Ismail is talking about the first language of hundreds of millions of people across at least seven countries. It is the majority language in Indonesia and Malaysia and Brunei, it is the national language in Singapore, it is a working language in Timor-Leste (a future ASEAN member), and it is a minority language in southern Thailand and the Philippines. No other single language is so important for so many countries in ASEAN; it’s not even close. It would be much more jarring to make such a case for Thai, Vietnamese, or Tagalog as a working language for ASEAN.

    Second, Ismail Sabri’s comments don’t seem to be saying that Malay would be the only official or working language aside from English. It seems reasonable to say that Malay might be one of many, just like German and French are the work languages of the European Union in addition to English.

    Third, the idea that there should be no other official languages for ASEAN besides English is a little colonial. It favors Singapore, and secondarily Malaysia and the Philippines. (This paragraph has levels of meaning that you may excavate at your leisure.)

    The argument against adopting Malay as an official language of ASEAN is that ASEAN is not yet at the stage where difficult, internally-focused conversations are feasible. That is, making the first language of the largest linguistic group the official language requires a level of social trust and long-term institutional commitment that ASEAN member states do not have.

    Put it this way: under what conditions would Vietnam agree to Malay as a working language for ASEAN? Only if it thought that there was a future in which either (1) Vietnamese were afforded such status in exchange or (2) it thought that ASEAN was so dependent on its relations with the Malay-speaking world, and Vietnam so dependent on ASEAN, that this was in Vietnam’s long-term interest.

    One might think about this with reference to Bahasa Indonesia and Bahasa Malaysia, two standardized and official versions of Malay. Malay was a first language of a small minority of Indonesians at independence, although it was spoken as a second language by a larger minority (mostly in urban areas). Making Malay the official language of Indonesia proved to be unifying specifically because it did not favor any of the larger ethnic groups in Indonesia. In Malaysia, where Malay was the first language of a numerical majority of Malaysians at independence, making it the official language has proven divisive for the country’s linguistic minorities. Making Malay the lingua franca of ASEAN would probably be more divisive than unifying among ASEAN members. English, for better or for worse, isn’t so divisive.

    Of course, there is also the question of which Malay. The national languages of Malaysia and Indonesia are close enough that one can understand both with some work,* but they are not identical and differences of meaning can sometimes emerge.** But the Middle Indonesians and Malay dialects of peninsular Malaysia can be highly divergent from those national standards: I suspect that a speaker of Kupang Malay would not be able to converse easily with a speaker of Kelantan Malay. Making Malay the lingua franca of ASEAN would mean developing a generic version of Malay which is probably not any of the standardized forms that currently exist.

    And finally, if we’re going to talk about the political implications of making Malay a working language of ASEAN, we might also pause to consider the politics of Ismail Sabri’s own remarks. He was making them at a symposium on the Malay language being held by Malaysia’s own national language council, Dewan Bahasa dan Pustaka.*** I view Ismail Sabri’s comments as directed internally towards Malaysians, encouraging them to use Malay in Malaysia as the default working language in business, tech, and education. Such an argument wouldn’t be needed for any other national language in Southeast Asia, which is, itself, an interesting observation.

    NOTES

    * I tend to think that the difference between Bahasa Indonesia and Bahasa Malaysia as roughly equivalent to standard American English and Scots. If you speak one you can understand most of the other but you might prefer the subtitles when watching a TV show. It might be easier for non-native speakers like me to switch between them.

    ** Best example is kenyang, which means “full” or “stuffed” in Standard Indonesian, but in Bali Malay means something closer to “tumescent.” Cf. “fanny.”

    *** Official motto: Bahasa Jiwa Bangsa [= Language is the Soul of the Nation]. (Not ethnic group: nation.)