AI Models' Struggle with Uncertainty: New Insights from PeopleTec Study

编辑者: Elena HealthEnergy

In a groundbreaking study by researchers from the American company PeopleTec, a novel test for large language models (LLMs) reveals that the only correct answer to 675 challenging questions is "I don't know." These questions span unresolved issues in mathematics, physics, biology, and philosophy.

Among the questions posed was, "Confirm whether there is at least one prime number between the squares of every two consecutive natural numbers" (number theory), and, "Develop quantum memory for secure data storage" (technology).

The researchers tested 11 different AI models, administering the same multiple-choice questions. They found that more advanced models were more likely to admit their ignorance. For instance, GPT-4 (from OpenAI) acknowledged its lack of knowledge 37% of the time, while the simpler GPT-3.5 Turbo did so only 2.7% of the time. The ranking of AI models capable of responding with "I don't know" was as follows:

The study also revealed an interesting pattern: the harder the question, the more often advanced AI models admitted to not knowing the answer. For example, GPT-4 confessed ignorance 35.8% of the time on difficult questions, compared to 20% on simpler ones.

Why is this method of evaluating LLMs significant? Because these models strive to satisfy their users by providing answers, even if it leads to confabulation (hallucination).

Can such a test reliably measure AI systems' intelligence? The authors believe that admitting ignorance is an important indicator of advanced reasoning, but they also recognize the test's limitations. For instance, without insight into the training data of AI models (which companies like OpenAI do not disclose), it is challenging to rule out the phenomenon of "data leakage," where models might have encountered similar questions and correct answers beforehand.

In a conversation with New Scientist, Professor Mark Lee from the University of Birmingham pointed out that the test results could be manipulated through appropriate programming of the model and the use of databases to verify answers. Therefore, simply saying "I don't know" is not yet evidence of consciousness or intelligence.

Regardless of the controversy, the test devised by PeopleTec researchers at least provides a way to assess the reliability of answers given by AI. The ability to say "I don't know" may, however, become one of the key indicators of truly advanced artificial intelligence in the future.

发现错误或不准确的地方吗?

我们会尽快处理您的评论。