Published on March 16, 2026

New Benchmark and Metric for Long Term AI Event Forecasting

Sber Now Able to Verify if AI Truly Can Peer Into the Future

Sber researchers have launched an open-source platform for the objective assessment of how accurately AI models can predict chains of events over long-term horizons.

Research 4 – 5 minutes min read
Event Source: SberLabs 4 – 5 minutes min read

Predicting the next step isn't all that difficult. It is far more challenging to foresee what will happen in a week or a month. This very distinction became the starting point for a study conducted by scientists at Sber's Center for Practical Artificial Intelligence.

Limitations of Current AI in Sequential Event Prediction

The Next Step Is Not Yet a Forecast

Every day, people leave digital footprints: paying for purchases, visiting websites, or booking doctor appointments. All of this forms sequences governed by their own logic. Modern AI systems are quite adept at guessing a single next action – for instance, that a person who bought a laptop will soon buy a mouse. However, business and medicine need something else: understanding not just what will happen, but when, while forecasting an entire chain of events rather than a single one.

The problem is that previously, researchers lacked a unified method to verify how well a particular model could construct such long-term forecasts. Every team measured quality in its own way, making it nearly impossible to compare results.

HoTPP Benchmark and T-mAP Metric for Model Evaluation

A Yardstick That Didn't Exist Before

To remedy this, Sber researchers developed a benchmark – a standardized set of tests – called HoTPP (Horizon Temporal Point Process). This is an open platform: any team in the world can use it to test their model under unified rules.

The platform works with data from various fields: finance, e-commerce, and medicine. This is crucial, as an effective forecasting tool should not be limited to a single narrow niche.

Along with the benchmark, the authors proposed a new metric – T-mAP (Temporal mean Average Precision). Simply put, it evaluates a forecast based on two parameters simultaneously: whether the model correctly identified the type of event and whether it accurately guessed the time of its occurrence. Previously, these aspects were often evaluated separately, providing an incomplete picture.

Efficiency of Simple Statistical Methods vs Complex Neural Networks

More Complex Doesn't Mean Better

One of the study's most interesting results served as a sort of warning for the entire industry. It turned out that in long-term forecasting tasks, complex neural network models sometimes perform no better than simple statistical methods. In other words, increasing the number of parameters and complicating the architecture does not solve the problem by itself.

Another issue identified by the researchers is the so-called «collapse» of predictions. Complex models occasionally begin to output repetitive forecasts, ignoring rare but significant events. This is akin to a weather forecaster who promises «cloudy, no precipitation» every day: formally, they will be right in most cases, but they will miss critical weather anomalies.

As Andrey Savchenko, the center's Scientific Director, noted:

"Our benchmark and metric allow for an objective assessment of which AI model truly 'sees' the future and which merely makes a lucky guess at the next step. It is particularly important that we identified the problem of prediction 'collapse': complex models sometimes produce uniform forecasts, ignoring rare events. This discovery sets the stage for new research."

An additional result was a significant boost in computing speed: algorithmic optimization allowed for training and model operation to be accelerated tenfold. This is a vital practical bonus: researchers will be able to conduct experiments faster, and companies will receive results more promptly.

Applications of Temporal Point Process in Finance and Medicine

Where This Will Be Useful

The applications for such tools are quite diverse. Banks and fintech companies will be able to more accurately predict when and what transactions customers will make. Retailers and logisticians can plan inventory more effectively by understanding not just demand, but its temporal structure. In healthcare, analyzing sequences of doctor visits will aid in the early diagnosis of diseases.

The paper based on the study's results has been accepted for publication in Neurocomputing – a prestigious journal in the field of neural networks, ranked in the first quartile (Q1) of scientific journals in its domain.

The authors hope that HoTPP will become a global standard for researchers – a tool that enables progress toward creating AI capable of truly understanding the uncertainty and complexity of the real world, rather than just guessing the next obvious event.

Original Title: Исследователи Сбера представили инструмент для оценки долгосрочных прогнозов ИИ-моделей
Publication Date: Mar 16, 2026
SberLabs sberlabs.com A Russian AI research lab of Sber developing models for business and scientific applications.
Previous Article RAFFLES: How to Teach AI to Explain Its Own Mistakes Next Article Alibaba Cloud Unveils Platform for Securing AI Agents

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 3 Flash Preview Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Flash Preview Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe