One of the biggest questions in the field of artificial intelligence sounds simple, yet remains difficult to answer: how do we know we are getting closer to true AI – the kind that can think as flexibly as a human? Google DeepMind has decided to tackle this question systematically by introducing its own conceptual framework for measuring progress toward AGI, or so-called “artificial general intelligence.”
AGI Isn't Just a Smart Program
Before we discuss measurements, it's worth clarifying what AGI is. In short, it's a hypothetical AI capable of solving any intellectual task as well as – or better than – a human. Not just playing chess or writing text, but truly any task, including those it has never encountered before.
Today's systems, even the most powerful language models, can do a lot – but they operate within fairly rigid frameworks. They excel at the tasks they were trained on but often falter where a human would easily adapt. So, the gap between “smart AI” and “true general intelligence” is still huge. And that's precisely why the question of measuring this journey is becoming more and more relevant.
Measuring Something That Doesn't Yet Exist Is No Easy Task
The problem is that we still don't have a universally accepted way to assess how close any given system is to AGI. Existing tests and benchmarks – that is, sets of tasks used to compare models – typically check for something specific: how well a model translates text, solves math problems, or writes code. But none of them provide a holistic picture.
This is where DeepMind is taking a step forward. The company is proposing a cognitive framework – a set of principles and categories that describe intelligence not by narrow skills, but by more fundamental cognitive abilities. Simply put, they want to measure not “what the model can do,” but “how it thinks and how flexibly.”
What Exactly Is DeepMind Proposing?
The approach is based on the idea that intelligence can be broken down into several key cognitive dimensions. This isn't just a list of skills – it's an attempt to describe the very structure of thought. Under the proposed system, the evaluation looks not only at whether the AI completed the task, but also how: did it use generalization, abstraction, reasoning, learning by analogy, and so on.
This approach allows progress to be tracked not as “leaps” from one high-profile result to another, but as a gradual movement across multiple dimensions simultaneously. This is closer to how scientists assess the development of intelligence in humans or animals – through a set of cognitive abilities rather than a single test.
A Hackathon to Test the Theory in Practice
Along with the publication of the framework, DeepMind has launched a hackathon on the Kaggle platform. It's a competition for developers and researchers, where participants are asked to create specific evaluation tasks – benchmarks that align with the logic of the proposed conceptual system.
This is an interesting move. Instead of coming up with all the necessary tests on its own, DeepMind is effectively opening up the task to the wider community. A hackathon is a way to quickly gather a large number of ideas, select the best ones, and turn them into functional evaluation tools. In essence, the company is saying, “Here's the concept – help us fill it with concrete measurements.”
Kaggle is a popular competition platform among machine learning specialists. Its audience numbers in the hundreds of thousands of developers and researchers worldwide, so the initiative's reach is considerable.
Why This Matters to Everyone, Not Just DeepMind
At first glance, this might seem like an internal project of a major tech company. But in reality, the issue of AI evaluation standards affects everyone who works with or depends on these systems.
Without common criteria for progress, it's hard to compare different systems, hard to distinguish real achievements from marketing hype, and extremely difficult to explain to the public what is actually happening. Right now, each lab largely evaluates itself using the benchmarks where its own models perform best. This is not an ideal situation.
If DeepMind succeeds in proposing a sufficiently convincing framework – and getting the broader community involved in its development – it could be a step toward fairer and more comparable evaluations across the entire industry.
What Still Remains an Open Question
Of course, such initiatives are rarely accepted unanimously. The very concept of AGI remains debatable: different researchers understand different things by it, and there is still no single definition. This means that any framework for its “measurement” will be based on specific assumptions – which can be challenged.
Furthermore, there is a risk that the new tests will ultimately prove to be just as narrow as the previous ones – just more elegantly packaged. The history of AI benchmarks is full of examples where models quickly “saturated” a test without demonstrating any real generalized intelligence.
But the very fact that one of the world's leading AI labs has decided to approach the issue systematically and openly is already significant. We'll see what comes out of the hackathon and how other industry players react to the proposed coordinate system.