Skip to Main Content

This summer OpenAI released Code Interpreter, a plug-in for the popular ChatGPT tool that allows it to take in datasets, write and run Python code, and “create charts, edit files, perform math, etc.“ It aims to be nothing short of the ideal statistical collaborator or research software engineer, providing the necessary skill and speed to overcome the limitations of one’s research program at a fraction of the price.

It is a bad omen, then, that while statisticians are known for pestering researchers with difficult but important questions like “What are we even trying to learn?”, Code Interpreter responds to even half-baked requests with a cheerful “Sure, I’d be happy to.” There are risks to working with a collaborator that has both extraordinary efficiency and an unmatched desire to please.


Let me state up front that AI’s benefits for science could be immense, with potentially transformative implications for the life and physical sciences. The democratization of data analysis represented by tools like Code Interpreter may be no exception. Giving all scholars access to advanced methods will open the doors to innovative research that would otherwise have been filed away as unachievable. Yet just like its potential to accelerate the exceptional qualities of science, AI is at risk of accelerating its many flaws.

One risk is widely — and rightfully — discussed: inaccuracy. The improvement of AI-driven software development will dramatically increase the speed and complexity of scientific programming, decrease the training required to write advanced code, and provide a veneer of authority to the output. Given that very few parties are incentivized to spend time and money on careful code review, it will be almost impossible to assess the accuracy of plausible-looking code that runs cleanly.

But there is another problem that is getting far less attention: Rapid, complex, and technically sound data analysis is insufficient — and sometimes antithetical — to the generation of real knowledge.


Indeed, for many of the commonly cited issues on the spectrum of scientific misconduct — think HARKing, p-hacking, and publication bias — the primary sources of friction for unaware (or unscrupulous) researchers are human constraints on speed and technical capacity. In the context of a discipline still grappling with those practices, AI tools that become more efficient at running complex scientific studies, more effective at writing them up, and more skilled at responding to users’ requests and feedback are liable to pollute the literature with compelling and elegantly presented false-positive results.

This misaligned optimization scheme is familiar to human scientists already; the Center for Open Science, to take one example, has spent a decade re-training scientific fields that overoptimized for productivity and prestige to begin rewarding rigor and reproducibility. These efforts have revealed scientific models that reorient the incentives placed on researchers. For example, registered reports — a form of scientific publication where manuscripts are submitted, reviewed, and accepted based solely on the proposed question and empirical approach — may provide a setting in which AI tools could advance biomedical knowledge rather than muddy it.

Yet new publication models cannot break academic science’s fixation on quantity. Without broader shifts in norms and incentives, the speed offered by AI tools could threaten the potential value of some of biomedical science’s most promising new offerings, like preprints and large public datasets. In this context, we may be forced to rely more and more heavily on meta-analyses (perhaps conducted by AI) but with less and less ability to factor expert judgment and methodological credibility into their evaluation.

Instead, it will be important to rethink how we reward the production of science. “Quality over quantity” is so trite as to be meaningless, but the continued importance of metrics like publication count and H-index — and even the persistent demand for paper mills — demonstrate that we have yet to fully embody its spirit.

There exist clues for a potential path forward. Some existing norms and policies encourage researchers to focus on a few high-quality research outputs at key career stages. For example, in economics, faculty candidacy relies largely on a single “job market paper,” and in biomedicine, the Howard Hughes Medical Institute requests that scientists highlight five key articles on applications. More recently, the National Institute for Neurological Disorders and Stroke rolled out a series of rigor-focused grants aiming to support education and implementation. Such efforts can collectively shift incentive structures toward slow, rigorous, and ambitious work; more funding and gatekeeping bodies should consider moving in this direction.

In addition to aligning human systems, efforts to optimize AI-driven research tools towards both technical capacity and knowledge generation will be vital. Ongoing AI alignment programs focused on safety may offer clues for building responsible virtual collaborators. For example, efforts to improve the transparency and reproducibility of AI output could produce relevant insights. Yet the case of AI in scientific research is unique enough — for instance, an identical response to an identical prompt could be either valid or invalid depending on past conversations that are unavailable to the AI — that it likely requires its own set of solutions.

To meet this moment, we will need to build and support research programs aiming to understand and improve both the tools and researchers’ relationship to them. There are countless fields of study that touch these topics, including, though certainly not limited to, human-computer interaction, sociology of science, AI safety, AI ethics, and metascience. Collaboration and conversation across these domains would provide strong insight into the most fruitful paths forward.

Scientific institutions can and should help facilitate this work. In addition to supporting efforts to unlock AI’s potential benefits for scientific productivity, government and philanthropic funders should invest in research focused on understanding how AI can be steered towards effective generation of reliable and trustworthy knowledge; as argued above, these goals can often be at odds in the context of human social systems.

A good example of this kind of institutional support is the National Institute of Standards and Technology’s Trustworthy and Responsible AI Resource Center, which will soon pilot several initiatives aimed at providing a space for researchers to elicit and study AI’s real-world effects on users in a controlled environment. NSF’s Artificial Intelligence Research Institutes — some with a focus on human-AI collaboration — represent another promising approach.

In general, it is understandable that the conversation around AI-driven research tools is an optimistic one. An eager, technically skilled, and highly efficient collaborator is a dream for any scientist. Matching one to every scientist could be a dream for society. But to reach that goal, we need to remember that, sometimes, a perfect collaborator is too good to be true.

Jordan Dworkin is the program lead for metascience at the Federation of American Scientists, a nonprofit, nonpartisan policy research organization working to develop and implement innovative ideas in science and technology.

Exciting news! STAT has moved its comment section to our subscriber-only app, STAT+ Connect. Subscribe to STAT+ today to join the conversation or join us on Twitter, Facebook, LinkedIn, and Threads. Let's stay connected!

To submit a correction request, please visit our Contact Us page.