New Benchmark Reveals Limitations of AI in Historical Inquiry

Editado por: Veronika Nazarova

A team of researchers has developed a new benchmark, Hist-LLM, to evaluate the performance of three leading large language models (LLMs) — OpenAI's GPT-4, Meta's Llama, and Google's Gemini — on historical questions. This benchmark assesses the accuracy of answers against the Seshat Global History Databank, a comprehensive database of historical knowledge.

The findings, presented at the AI conference NeurIPS, indicate that even the best-performing model, GPT-4 Turbo, achieved only about 46% accuracy, which is only slightly better than random guessing. Maria del Rio-Chanona, a co-author and associate professor at University College London, remarked, 'LLMs, while impressive, still lack the depth of understanding required for advanced history.'

Examples of inaccuracies include GPT-4 Turbo incorrectly stating that scale armor existed in ancient Egypt during a specific period, despite it appearing 1,500 years later. Researchers suggest that LLMs struggle with nuanced historical inquiries due to their reliance on more prominent historical data, leading to incorrect extrapolations.

The study also highlighted a performance gap, with OpenAI and Llama models underperforming on questions related to regions like sub-Saharan Africa, indicating potential biases in their training data. Peter Turchin, the study's lead, emphasized that LLMs are not yet a substitute for human expertise in certain domains.

Despite these limitations, the researchers remain optimistic about the potential of LLMs to assist historians. They are refining their benchmark to include more diverse data and complex questions, noting, 'While our results highlight areas for improvement, they also underscore the potential for these models to aid in historical research.'

Encontrou um erro ou imprecisão?

Vamos considerar seus comentários assim que possível.