When AI Does Statistics For Us, Will We Know Any Better?

There has been a lot of commentary about new and more powerful AI algorithms like ChatGPT, and how they will change how to write college essays, job applications, and so forth. I don’t think that enough people are paying attention to the fact that these AI algorithms can also write code.

I am not a computer programmer or a coding expert. But I have noticed that ChatGPT can answer questions about statistical programming pretty effectively. I can ask it how to work with ggplot2, for example, and it helps me to avoid silly errors. I had the thought recently that it might also be able to help me write code in a programming language that I don’t actually know, like Python.

Here is an example that I can imagine happening already. I have asked ChatGPT to write code for me. All I know is that I want to do logistic regression in Python using NumPy. I don’t know how to actually do this, but I typed the following into the prompt: Write code in Python to perform a logistic regression with a dataset named “data” and a dependent variable called “outcome” using numpy

And here is what I got:

`import numpy as np `

`from numPy import logistic `

`# Split data into train and test sets `

`X_train, X_test, y_train, y_test = train_test_split(data, outcome, test_size=0.2, random_state=0) `

`# Fit the logistic regression model `

`log_reg = logistic.LogisticRegression(C=1e5) `

`log_reg.fit(X_train, y_train) `

`# Make predictions on the test set `

`y_pred = log_reg.predict(X_test) `

`# Compute accuracy `

`accuracy = log_reg.score(X_test, y_test) `

`print('Accuracy: {}'.format(accuracy))`

A couple of remarks are in order. First, this code will work [EDIT: No it won’t. Needs more things to actually run. So long as you can figure out those,] if you need to do logits in Python, for whatever reason, you can just do them now. Second, this code is commented—it explains what the code is doing. (Automatic commenting is a very interesting computer science problem, one that AI is already being applied to.)

Third and most importantly, though, this is not the only possible response to the prompt. It is a response that smuggles in a lot of implicit decisions, and even assumptions about the data that you have and the use that you imagine for them. By this I mean, not every logit model has prediction accuracy as the objective. This code, though, presumes that that is your objective.

I came to this question as part of a conversation with some college friends about the future of the humanities, in response to the New Yorker essay that everyone is talking about. The idea is that people want marketable skills from college. But as this crowd of friends includes both professors like me and computer scientists with decades of professional coding experience, there was a deeper conversation about what sorts of marketable skills will still be marketable over a timespan of more than the next five years or so. What happens to coding-focused majors when computers can do lots of the coding themselves?*

I’ll conclude with a reflection. On my one serious visit to Silicon Valley, I spent the day mostly drinking free club sodas and flavored kombuchas at a FAAMG headquarters and just talking to people. That was a special weekend for a lot of reasons. But what I remember most from those conversations were the hints that “the singularity” was coming: for them, that was the coding invention that put coders out of business. They used this to explain why their children were getting violin lessons and tutoring in French, which I thought was precious at the time because it reflected a level of privilege and possibility that seemed entirely out of reach for anyone who wasn’t in that part of our new tech ecosystem. Maybe they were right, though, and maybe all of us will need to wrestle with these implications.

NOTE

* I would be remiss if I didn’t acknowledge that in some ways, this is just the latest “get off my lawn” complaint about how technology is replacing understanding by automating what used to be done manually. I probably would know more about statistics if I had to use punchcards and code up an optimizer rather than just typing `logit y x` into Stata.

I will happily concede this. But fast computers did put most people whose careers depended on punchcards out of business, so the analogy holds.

Comment 1

1. David Weber

This… Doesn’t actually work, there’s no such module in numpy. You’d want scikit-learn for that.
https://numpy.org/doc/stable/search.html?q=Logistic

It’s a great demo of why you shouldn’t trust llms to actually teach you anything. They’re really only useful for fleshing out obvious writing. Code comments might work ok when all they have to do is pull up the documentation of a function.