Humanity’s Last Exam is the New MultiAgent AI Benchmark 

  • Published on January 24, 2025
  • In AI News

The last benchmark before AGI?

FBI is Here to Evaluate LLMs and Benchmarks

The Rising 2025 | India’s Biggest DEI Summit in Bangalore

The AI field welcomes a new benchmark: Humanity’s Last Exam (HLE), introduced by the Center for AI Safety (CAIS) and Scale AI for testing AI systems on expert-level knowledge. The dataset includes 3,000 questions crowdsourced from 1,000 contributors across 500 institutions in 50 countries, including professors and PhD holders. It covers mathematics, humanities, and natural sciences using the multi-format approach that includes text, diagrams, and images.

The benchmark tested models like GPT-4o, Claude 3.5, and DeepSeek, with none scoring above 10%, revealing their struggle with complex and interdisciplinary problems. The benchmark showed that DeepSeek R1 – a cheaper and less powerful open-source model – outperformed the full o1 model known for its reasoning abilities. 

benchmark

HLE was created to address “benchmark saturation,” where AI models excel on standard tests but fail on novel challenges. 

“I wrote 5 questions in the new benchmark that even the top AI models score less than 10% on: Humanity’s Last Exam,” said Jeremy Nguyen on X. 

The project involved contributors from diverse academic and research backgrounds. Summer Yue, Scale AI’s Director of Research, said the benchmark was designed to push AI models to their reasoning limits.

Benchmarks in the AGI era

“Starting to see new well-built hard benchmarks in AI since almost everything else has already been exceeded. We now have this (with humanities questions), ARC-AGI 2, and Frontier Math. We also need some benchmarks for new knowledge creation rather than testing known problems,” wrote Wharton’s Ethan Mollick on X.

Last week, there were concerns about OpenAI’s involvement with FrontierMath. For context, In December, OpenAI announced its o3 models, reporting 25% accuracy on the EpochAI Frontier Math benchmark, a significant improvement from the previous 2% achieved by other models. 

Epoch AI recently clarified that OpenAI commissioned them to create 300 math questions for the FrontierMath benchmark. OpenAI owns these questions and has access to their statements and solutions, except for a 50-question private holdout set. 

The statement also noted that Epoch AI can evaluate and publish results on any model using the FrontierMath problem set but cannot share the questions or answers without OpenAI’s written permission.

“We can evaluate other models and have done so already. We will publish more results in the next few weeks, perhaps including DeepSeek’s,” clarified Epoch’s Tamay Besiroglu to AIM, addressing FrontierMath’s approach to evaluating models from other companies.

Regarding the holdout set, Epoch AI explained they are finalising a 50-question set for which OpenAI will only receive the problem statements, not the solutions.

AI evaluations largely remain underfunded, and tougher benchmarks are essential as we progress towards AGI. “Going forward, we will ensure all contributors have access to information about industry funding and data access agreements before participating and proactively publicly disclose benchmark sponsorship and data access agreements,” read Epoch’s statement.

Picture of Aditi Suresh

Aditi Suresh

I hold a degree in political science, and am interested in how AI and online culture intersect. I can be reached at aditi.suresh@analyticsindiamag.com

Association of Data Scientists

GenAI Corporate Training Programs

India’s Biggest Developers Summit

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Download the easiest way to

stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

February 5 – 7, 2025 | Nimhans Convention Center, Bangalore

Rising 2025 | DE&I in Tech & AI

Mar 20 and 21, 2025 | 📍 J N Tata Auditorium, Bengaluru

Data Engineering Summit 2025

15-16 May, 2025 | 📍 Taj Yeshwantpur, Bengaluru, India

17-19 September, 2025 | 📍KTPO, Whitefield, Bangalore, India

MachineCon GCC Summit 2025

19-20th June 2025 | Bangalore

discord icon

Our Discord Community for AI Ecosystem.

0
Show Comments (0) Hide Comments (0)
0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x