Author: tompepinsky

  • The Social Media Data Generating Process

    I had the opportunity to serve as a discussant for a panel on big data in Asian studies (abstracts in PDF) at the most recent meeting of the Association for Asian Studies. As it turns out, social media is one of the central sources of data for this rapidly developing new field (the other is machine coding of text). As I was thinking about how to discuss the various opportunities and challenges that come from using social media to study politics—and everyone involved was a political scientist of some sort—I found myself trying to sketch out a data generating process for social media data. I came up with four basic challenges that follow as a result of how I conceptualize this DGP.

    The basic idea is that social media is a form of political participation. When you think about it that way, though, it’s immediately clear that it is only one of many forms of participation. We might share a link on Facebook, or retweet a hashtagged tweet, and that generates social media data. But we might also vote, or protest, or send a letter to a representative, which is also a form of participation, but one that is not observed in the social media data. There are lots of forms of political participation, and denote the collection of them as \mathbf{D}. Each social media datum from individual i is then D_{i} \in \mathbf{D}_{i}.

    The first challenge, then, is characterizing the relationship between D_{i} and \mathbf{D}_{i}.

    There is then a technology that censors some fraction of D_{i}. As a result of the censoring rule r(\cdot ), we actually observe r(D_{i}). In some cases the censoring rule is “no censoring,” so that r(D_{i}) = D_{i}. But in the interesting cases which animate many recent studies of social media and political participation in illiberal states, r(\cdot ) effectively deletes some fraction of the social media data. We especially worry about the possibility that the probability that D_{i} is observed in r(D_{i}) is correlated with the content of D_{i}.

    So the second challenge is characterizing r(\cdot ), and then knowing what to do about it.

    Next, recall that social media is social. That means that we care about other actors j who participate too, and we suspect that they in reaction both to social media and to offline events. So we have D_j\left ( r(D_i), \mathbf{D}_i \right ) \in \mathbf{D}_j\left ( r(D_i), \mathbf{D}_i \right ) .

    Third challenge is to characterize this recursive system in which actors participate in response to both online participation and offline participation, and the online participation is potentially censored.

    And finally, we suspect that governments choose r(\cdot ) based on how they observe people to be responding to others’ participation. When participation is not threatening to the government, perhaps there is no censoring, but this may change as social media has effects on actual politics (think Turkey and Twitter). So r(\cdot ) is itself a function of \mathbf{D}_{j}, generating another kind of recursion in this process: D_j\left ( r_{\mathbf{D}_j}(D_i), \mathbf{D}_i \right ) \in \mathbf{D}_j\left ( r_{\mathbf{D}_j}(D_i), \mathbf{D}_i \right ) .

    This fourth challenge is to characterize how governments deploy censorship in response to the anticipated consequences of social media as a form of political participation. Note in the final expression above that offline participation may change as a response to online censorship.

    Put together this way, I find it much easier to think about the opportunities and challenges from using social media. It’s also useful to see important new contributions in terms of how they make sense of each of these challenges. King, Pan, and Roberts (2013), for example, addresses the fourth challenge particularly well.

  • Single-intervention Experiments and Causal Discovery

    Even in the easiest cases, there are generally a whole host of particular assumptions necessary to make causal inferences about anything. But normally, we think that the benefit of experiments is that they help us to isolate causal effects by manipulating one thing at a time. That is a key strength of experiments: regardless of the underlying causal structure, by manipulating one variable we can nevertheless isolate causal effects.

    Well, take a look at this:

    in order to identify the causal structure by single-intervention experiments some additional parametric assumption beyond Markov, faithfulness and acyclicity is necessary. Alternatively, without additional assumptions, causal discovery requires a large set of very demanding experiments, each intervening on a large number of variables simultaneously.

    Emphasis mine. This is from a fascinating short paper entitled “Experimental Indistinguishability of Causal Structures” by Frederick Eberhardt.

    I’m still working through the implications. My initial thought is that for the narrowest interpretation of experiments, where we define the thing that we care about as the ATE, and the ATE as the difference between treatment and control, this critique does not much matter. We have defined the object of interest as whatever happens when we intervene.

    But if we really believe that the goal of experiments is to intervene in order to learn about the causal structure of the world more generally—and I have to believe that that is really the goal that justifies experimentation—then the object of interest is the relationship between “whatever happens when we intervene” and “causal structure of the world.” In that case, the implications are profound for our ability to learn about that relationship using experiments.