How intelligent is artificial intelligence?

Researchers have introduced a rigorous new assessment tool known as Humanity’s Last Exam (HLE) to evaluate the true capabilities of large language models. Published in Nature by the Center for AI Safety, Scale AI, and the HLE Contributors Consortium, the benchmark consists of 2,500 questions across mathematics, humanities, and natural sciences. Unlike previous evaluations, these questions feature definitive solutions that cannot be retrieved through simple internet searches. Previous assessments suggested that AI algorithms were proficient at answering difficult questions regarding topics such as junk DNA, alternative splicing, and epigenetics. However, further analysis revealed that these systems primarily return consensus views found online rather than verifying scientific accuracy. In many instances, the algorithms failed to recognize controversies or relied on unreliable sources, highlighting a disconnect between their output and established scientific reality. The HLE benchmark was specifically designed to address this issue by utilizing questions developed by global subject-matter experts. Each question is structured for automated grading yet requires knowledge at the frontier of human understanding. State-of-the-art models demonstrated low accuracy and poor calibration on this test, indicating a marked gap between current technical capabilities and expert human performance on closed-ended academic tasks. The public release of this dataset aims to inform research and policymaking by providing a clearer understanding of model limitations. While popular benchmarks showed high accuracy rates exceeding 90 percent, the HLE results suggest that current AI systems function more as advanced search engines than as genuinely intelligent entities capable of independent verification.

Loading...

How intelligent is artificial intelligence?

内容

编辑推荐