ChatGPT’s AI Performance: Beyond the Turing Test or Not Quite There?

Reading Time: ( Word Count: )

July 23, 2023

ChatGPT, an artificial intelligence chatbot from OpenAI, has been making significant waves in the technology landscape due to its extraordinary capabilities. This cutting-edge tool has captured the attention of technology giants and acclaimed writers, who commend it as a revolutionary development in AI.

The remarkable features of ChatGPT have even led some experts to speculate that it may have achieved the long-coveted milestone of passing the Turing Test, a measure designed to assess a machine’s aptitude to mimic human-level intelligence. The AI model has showcased exceptional adeptness across a multitude of domains, including mathematics (89th percentile), law (90th percentile), and GRE verbal skills (99th percentile).

An interesting study conducted earlier this month by researchers at the New York University’s medical school lauded ChatGPT’s capacity to dispense medical advice, which closely mirrors that of human medical personnel. However, the reliability of ChatGPT in critical decision-making situations continues to be debated among some researchers.

Performance Inconsistencies of ChatGPT A team consisting of Lingjiao Chen, Matei Zaharia, and James Zhu from Stanford University and the University of California, Berkeley, expressed concerns similar to some users regarding the consistency and potential decline of ChatGPT’s performance, as reported by Science X Network.

Also Read: “Is Threads Losing Its Thread? A Closer Look at Instagram’s Twitter Rival”

Their inquiry into the performance and behaviour of GPT-3.5 and GPT-4 revealed substantial fluctuations, with a notable decrease in responses to particular tasks between March and June.

The researchers focused on assessing ChatGPT’s capabilities in solving mathematical problems and generating computer code. They found a dramatic drop in GPT-4’s accuracy rate for prime number problems, from 97.6% in March to a startling 2.4% by June.

ChatGPT’s utility in supporting programmers with coding and debugging tasks was also stumbling. GPT-4 produced accurate, ready-to-run scripts in more than 50% of instances in March. However, this figure fell sharply to just 10% by June. Meanwhile, GPT-3.5’s performance saw a similar dip, from 22 curacies in March to only 2% in June.

The reasons behind these variances remain unclear, but the researchers speculate that modifications and upgrades to the system could be factors. Understanding the cause behind such performance swings proves challenging due to these language models’ inherently complex and non-transparent nature.

Unsurprisingly, these inconsistencies have sparked theories, including allegations that OpenAI is experimenting with smaller Language Learning Models (LLMs) to reduce expenses. Some have even suggested that OpenAI might deliberately impair GPT-4 to encourage users to opt for GitHub’s LLM add-on, CoPilot.

OpenAI has categorically denied these allegations. Through a tweet, OpenAI’s VP of Product, Peter Welinder, confirmed the organisation’s ongoing commitment to enhancing ChatGPT, ensuring each subsequent version is superior to the last.

Still, the potential “drift” in model results continues to worry some observers, leading to calls for OpenAI to increase transparency. They suggest that revealing sources of training data, code, and other foundational aspects of GPT-4 could alleviate such concerns.

Saher Mahmood

Author

Saher is a cybersecurity researcher with a passion for innovative technology and AI. She explores the intersection of AI and cybersecurity to stay ahead of evolving threats.