Agentic AI and Social Science Research Practice

I’ve been exploring AI tools of various forms for more than a year now, mostly from a critical perspective of identifying things that they cannot do (such as draw maps), but also so that I can understand how they work. Just recently, inspired by Andrew Hall and Scott Cunningham and Andrew Little, I’ve started exploring Claude Code. And frankly, Claude Code seems like a game-changer for social scientists.

I am now convinced that generative AI does represent a seismic shift for the practice of social science research, but perhaps not in the way many initially expected. For years, AI evangelists and corporate boosters have proclaimed that generative AI displays reasoning and intelligence, but after all of this time, it is still not useable for serious research that involves generating novel prose.* But Claude Code really shines in a completely different domain: developing and executing code.

The Citation Hallucination Problem

Ask an LLM to provide citations supporting a theoretical argument in political science, and it will produce a mix of real and fabricated sources. The “hallucinated” citations often look entirely plausible, complete with realistic author names, journal titles, and publication years. But they do not exist. Or, if they do exist, they are scrambled up and inapposite for the claim being referenced.

This problem is well-understood, but it’s important to state it clearly. The problem is that when LLMs generate text about research, they are producing correct-sounding text based on their training data, not retrieving information from a database or matching a search string to a document. The result is prose that can sound compelling but which is unreliable.

For social scientists accustomed to standard citation practices and evidence-based arguments, this creates an obvious problem. We can’t simply ask an LLM “What does the literature say about voter turnout?” and trust the response without extensive verification. I know for sure that people are LLMs right now to generate literature reviews and reference lists. Trust me, I have played around quite a bit with these applications, using the latest and most powerful AI tools with enterprise subscriptions to see how they fare when confronting real research tasks of the form that I encounter on a daily basis. The hallucination problem is real.**

Creating and Executing Code

But on the other hand… Upload a CSV file containing survey data to Claude Code and ask it to run a regression analysis and plot the quantities of interest, and it works. Not only does it work, it writes and executes code better and faster than what I can do on my own, producing reproducible and well-documented scripts to clean and analyze data.

I believe that the difference boils down to the distinction between task completion and information retrieval. When Claude Code writes and executes an analysis script, it is not trying to predict what sounds plausible based on its training data. Instead, it is applying logical rules and programming syntax to manipulate data according to what it understands from your instructions. Programming rules and syntax are easy to learn because they are consistent across its training data. The code it generates follows basic statistical principles, and it can be verified by executing the script. Claude Code also is remarkably good at just automating away things like getting a Stata dataset into R, reading a codebook and cleaning data, and generating a report on what it has done.

Ask Claude Code to examine the relationship between education levels and political participation in your dataset, for example, and it will:

Load and examine your data structure
Clean the data as needed
Run regression models with the correct syntax, following whatever model you tell it to follow (and if you don’t tell it, it will guess, usually correctly),
Create appropriate visualizations, and
Save publication-ready tables and figures

The distinction between knowledge retrieval and code execution is the distinction between LLMs and “agentic AI”—AI systems that can perform tasks and execute workflows rather than simply answering questions. I predict that this will lead to a major shift in how social scientists approach empirical research.

Consider a typical research workflow in the quantitative social sciences: you have a dataset or some batch of data, a research question, and an analysis plan. In 2022, implementing this analysis required some familiarity with statistical software, either the scripting language or a GUI. You had to know how to read your data into the statistical package and how to clean it for us. You also had to know how to represent your analysis plans in that code. But Claude Code makes this all unnecessary. You don’t have to know R at all, or what package runs the ordered logit and how to specify that model, you just have to have R installed on your computer and connected to the internet.

Want to test for heterogeneous treatment effects across demographic subgroups? Claude Code can write that code. Need to visualize a trend over time? Easy—you don’t need to remember how your software handles dates, Claude Code will figure it out. Require multiple robustness checks with different model specifications? Claude can generate dozens of variations in minutes, and organize the results.

Interpretation is the Frontier

The frontier between information retrieval and code execution is interpretation. In my experience, Claude Code can infer your intentions and execute your tasks, but if allowed or invited to interpret the results, it goes astray. In several test cases, Claude Code seemed to be guessing what I wanted to hear, interpreting basic regression results incorrectly but provocatively. I conjecture that this is because it is hard for an LLM to learn the rules about how to interpret statistical output, even output which it generated itself, because of the poverty of the stimulus.

This is dangerous, because Claude Code and other agentic AI tools will interpret your findings, sometimes even if you do not ask them to. And although this is just a hunch, I think that these tools’ interpretations might be a function of how you pose their prompts. These tools are just ripe for p-hacking of the worst form: the computer might generate the result that it “thinks” you want to hear, and if you’ve turned over the execution to the computer, you as the author might not even know what it has done and what it has “chosen” not to do.

So What?

What does this all imply for social science practice? The state of the art is evolving extremely quickly, and I hadn’t even heard of Claude Code until last month. But I suspect that I will use Claude Code (or some other tool) extensively to accelerate data cleaning, visualization, and related tasks. I am very skeptical of agentic AI’s knowledge claims, reasoning, and ability to interpret what it produces.

The shorthand is, use agentic AI for tasks that involve following rules. Do not use agentic AI for tasks that generate answers, arguments, or interpretations.

My greater worries are about how the field responds. Yes, we will be faster at doing computer and data work, but there are basically no guardrails for how else it can be used. Let me be clear: with a dataset and a codebook and nothing else, I can have a research article draft in 10 minutes, which is probably executed correctly as a statistical matter, but likely interpreted incorrectly as a substantive matter.

People will respond to this dramatic decrease in the cost of producing research articles. At the Journal of East Asian Studies, I have already seen at least a half-dozen papers written by generative AI. There will be many more. The ones I have seen have basic problems of interpretation that I can identify, and there is a characteristic “tell” for what a Claude Code paper looks like.***

The big picture takeaway is, though, that as the costs of doing computer work continue to decline, the relative value of being able to read and interpret and understand just went up. For social scientists, the relative value of knowledge and expertise is higher than it was five years ago, and the value of being able to code has plummeted.

NOTES

* It can try, and it can make things that look like high-quality novel prose. It will fool people, especially those eager to be fooled. That is the nub of the issue.

** An example: Reid, Anthony. 2019. “A Plural World of Knowledge: Is Southeast Asian Studies Losing its Autonomy?” Kyoto Review of Southeast Asia 25.

*** I won’t write it down here, because then it will be fixed.