# The Social Media Data Generating Process

I had the opportunity to serve as a discussant for a panel on big data in Asian studies (abstracts in PDF) at the most recent meeting of the Association for Asian Studies. As it turns out, social media is one of the central sources of data for this rapidly developing new field (the other is machine coding of text). As I was thinking about how to discuss the various opportunities and challenges that come from using social media to study politics—and everyone involved was a political scientist of some sort—I found myself trying to sketch out a data generating process for social media data. I came up with four basic challenges that follow as a result of how I conceptualize this DGP.

The basic idea is that social media is a form of political participation. When you think about it that way, though, it’s immediately clear that it is only one of many forms of participation. We might share a link on Facebook, or retweet a hashtagged tweet, and that generates social media data. But we might also vote, or protest, or send a letter to a representative, which is also a form of participation, but one that is not observed in the social media data. There are lots of forms of political participation, and denote the collection of them as $\mathbf{D}$. Each social media datum from individual $i$ is then $D_{i} \in \mathbf{D}_{i}$.

The first challenge, then, is characterizing the relationship between $D_{i}$ and $\mathbf{D}_{i}$.

There is then a technology that censors some fraction of $D_{i}$. As a result of the censoring rule $r(\cdot )$, we actually observe $r(D_{i})$. In some cases the censoring rule is “no censoring,” so that $r(D_{i}) = D_{i}$. But in the interesting cases which animate many recent studies of social media and political participation in illiberal states, $r(\cdot )$ effectively deletes some fraction of the social media data. We especially worry about the possibility that the probability that $D_{i}$ is observed in $r(D_{i})$ is correlated with the content of $D_{i}$.

So the second challenge is characterizing $r(\cdot )$, and then knowing what to do about it.

Next, recall that social media is social. That means that we care about other actors $j$ who participate too, and we suspect that they in reaction both to social media and to offline events. So we have $D_j\left ( r(D_i), \mathbf{D}_i \right ) \in \mathbf{D}_j\left ( r(D_i), \mathbf{D}_i \right )$.

Third challenge is to characterize this recursive system in which actors participate in response to both online participation and offline participation, and the online participation is potentially censored.

And finally, we suspect that governments choose $r(\cdot )$ based on how they observe people to be responding to others’ participation. When participation is not threatening to the government, perhaps there is no censoring, but this may change as social media has effects on actual politics (think Turkey and Twitter). So $r(\cdot )$ is itself a function of $\mathbf{D}_{j}$, generating another kind of recursion in this process: $D_j\left ( r_{\mathbf{D}_j}(D_i), \mathbf{D}_i \right ) \in \mathbf{D}_j\left ( r_{\mathbf{D}_j}(D_i), \mathbf{D}_i \right )$.

This fourth challenge is to characterize how governments deploy censorship in response to the anticipated consequences of social media as a form of political participation. Note in the final expression above that offline participation may change as a response to online censorship.

Put together this way, I find it much easier to think about the opportunities and challenges from using social media. It’s also useful to see important new contributions in terms of how they make sense of each of these challenges. King, Pan, and Roberts (2013), for example, addresses the fourth challenge particularly well.