Study Shows AI Can Pass The Turing Test More Reliably Than Humans (original) (raw)
Collagery/Shutterstock
One of the inevitable side effects of the current AI boom has been a certain level of mistrust when it comes to communicating with machine intelligences. It's certainly not unfounded paranoia, either; AI has demonstrated on many occasions that it's prone to hallucinations, inaccuracies, can be easily tricked, and even that it may perform better if you're mean to it, according to Live Science. Now, a new paper has demonstrated that it may be able to effectively impersonate human beings.
A preprint study (via arxiv) from researchers at Stony Brook University and UC San Diego has shown that, under certain conditions, AI can outperform humans in the Turing test, a classic (if somewhat antiquated) measure of a computer system's ability to mimic human behavior. However, it's not necessarily time to panic about the inevitability of a replicant/Cylon takeover just yet: these results were produced under a very specific set of circumstances and have limited applicability to real-world scenarios.
Methodology and results
Summit Art Creations/Shutterstock
Proposed by Alan Turing in 1950, the Turing test is nominally a way to evaluate how good a machine is at pretending to be human. In the classic three-party test, which the study replicated, a human judge is asked to converse with a computer and another human via text message. The judge then attempts to determine which of the participants is human and which is the machine.
The authors of the study ran two randomized, controlled, preregistered Turing-test experiments (meaning the parameters were set before the test was conducted, so that they couldn't be altered afterwards to fit the results). The sessions lasted five minutes and involved a human judge talking with an LLM and a UCSD undergrad or volunteer, then selecting which was human. Over the course of the study, hundreds of human participants took part, and four AI models were tested.
The headline result comes from OpenAI's GPT-4.5 model, which tricked the judges into thinking it was the human 73% of the time. LLaMa-3.1-405B performed just above chance at 56%, while the other two models were much less successful, with ELIZA at 23% and GPT-4o at 21%.
Some important caveats
Evgeniyshkolenko/Getty Images
Before you join Gen Z in its hatred of AI and rush your local data center with a hammer in hand, there are some important caveats to keep in mind. First, the Turing test isn't the most scientific. It's extremely subjective, relying both on the judgment of the human evaluator and on the responses of the other human participant. It also only reports a single, binary metric, whether or not a machine was able to fool the single human judge.
Also of note is that the best results (the 73% for GPT-4.5 and 56% for LLaMa-3.1-405B) came when the models were prompted to adopt a young, introverted, chronically-online persona. That suggests that the model's ability to pass may rely on steering the model toward a specific conversational style rather than on general intelligence or a model's innate adaptability. The only, very specific takeaway is that in short text conversations, under a specific experimental design, it's possible for an LLM to be mistaken for a person more often than the person it's paired against.
That said, as the authors point out, it does raise concerns about the growing risks around deception, trust, and the limits of human detection in casual conversation with AI (which has also been shown to affect your brain). It's also worth mentioning that an AI had already passed the Turing test as far back as early last year, and models have only continued to become more sophisticated since.