A large-scale test of 16 AI models on real-world documents revealed surprising results: expensive solutions don't always outperform their more affordable counterparts.
AI: Events
How to Measure Our Proximity to True AI: Google DeepMind Proposes a New Framework
Research
Google DeepMind has introduced a cognitive framework for assessing progress toward artificial general intelligence (AGI) and launched a Kaggle hackathon to develop relevant benchmarks.
AMD explains why comparing AI accelerators using a single performance metric is misleading and advocates for a multi-dimensional evaluation approach.
AI: Events
M4-RAG: When AI Seeks Answers in Images, Not Just Text, and Across Multiple Languages
Research
Researchers have introduced M4-RAG, a large-scale benchmark for evaluating systems that answer questions about images by drawing on external knowledge and operating in multiple languages.
Sber researchers have launched an open-source platform for the objective assessment of how accurately AI models can predict chains of events over long-term horizons.
Lab
A Voice at the Appointment: Why AI Can't Make Out the Doctor
Electrical Engineering & System Sciences
Researchers tested whether AI systems can comprehend real-world medical conversations – and the results delivered a harsh verdict for the entire industry.
Stanford researchers tested leading AI models on their ability to navigate space and found surprisingly poor results.
LightOn has released EDiTh, an open-source benchmark that allows testing corporate search on realistic documents without the risk of leaking confidential data.
AI: Events
OpenHands Index: How Developers Are Improving the Evaluation of AI Coding Agents
Research
The OpenHands team explains how their benchmark for evaluating AI agents works and why conventional metrics don't always reflect the true picture.