Published February 6, 2026

Perplexity Introduces Benchmark for Evaluating Deep AI Research Quality

The new DRACO benchmark evaluates how accurately, thoroughly, and objectively AI systems handle complex topic exploration across various fields of knowledge.

Development
Event Source: Perplexity AI Reading Time: 4 – 5 minutes

When people talk about testing language models, tasks like solving math equations or answering reading comprehension questions usually come to mind. But what if you need to evaluate an AI's ability to truly research a topic – to seek out information, analyze it, and form a cohesive understanding? That is exactly why the Perplexity team created DRACO, a benchmark that puts models to the test using realistic research scenarios.

Why Standard AI Benchmarks Fall Short for Research Tasks

The Problem With Standard Tests

Most existing AI benchmarks operate on a «question-answer» basis: models are given a short text that is easy to check for correctness. This is convenient, but it doesn't reflect how people actually use neural networks for deep-dives into complex topics.

When you ask a model to unpack a complex issue – like comparing treatment approaches for a disease or assessing the economic fallout of an event – you expect a comprehensive analysis rather than a single sentence. Evaluating such work is much tougher: you have to check not only factual accuracy but also the depth of the coverage and the objectivity of the viewpoints presented.

How DRACO Evaluates AI Research Quality

Three Pillars of Deep Research

DRACO is built around three key characteristics that are critical for any high-quality research:

Accuracy – checks for the absence of factual errors. If an AI writes about a medical study, it's vital that the data, conclusions, and terminology are spot on.

Completeness – evaluates the scope of the topic. Is there enough information for the reader to get the full picture, or were crucial aspects missed?

Objectivity – analyzes the balance of the presentation. Are alternative viewpoints represented, or does the model lean toward one side while ignoring the rest?

These criteria sound simple, but measuring them in practice is a tall order. You can't just compare the text to an answer key like a high school math test.

Inside the Benchmark

DRACO includes prompts across four domains: medicine, science, finance, and politics. This split is intentional – it allows testing of how models adapt to the nuances of different subjects. In medicine, data accuracy is the priority; in politics, it's the balance of opinions; and in finance, it's the relevance of information.

Each prompt is designed to force the model into a deep-dive investigation. This isn't a «Who invented the telephone»? kind of question, but rather: «What are the current approaches to treating this disease, and what do the latest clinical trials say about their effectiveness»?

To evaluate the results, the Perplexity team developed a system that uses another language model as a «judge». This is a common practice in the industry today: one model generates the response, while a second assesses its quality based on specific parameters. Of course, this approach isn't perfect, but it allows the process to be automated and reproducible.

Why AI Research Evaluation Benchmarks Are Important

Why This Matters

The emergence of DRACO is driven by the fact that more and more users are turning to AI not just for quick facts, but to dive deep into complex topics. Tools like «Deep Research» within Perplexity itself or similar solutions from competitors aim to do just that: help users get to the bottom of a problem by gathering and analyzing info from a multitude of sources.

But how do you objectively measure the effectiveness of such tools? Subjective feedback isn't enough for systematic technological growth. A benchmark, however, provides a way to track progress and compare different algorithms against one another.

Future Development of the DRACO Benchmark

What's Next

DRACO is just the foundation, not the final word. The Perplexity team has openly stated that the benchmark will evolve: new disciplines will be added, criteria will be refined, and additional metrics may eventually appear.

Furthermore, the question remains as to how well the automated «model-as-a-judge» assessment aligns with human perceptions of quality. This is a well-known issue in the AI field: algorithms can still diverge from humans on what constitutes a «good» answer.

Nevertheless, the very existence of such a tool shows that the industry is moving toward more sophisticated and realistic methods for vetting AI. Now, it's not just about «right or wrong»; it's about how useful and reliable the results are for solving real-world tasks.

Original Title: Evaluating Deep Research Performance in the Wild with the DRACO Benchmark
Publication Date: Feb 6, 2026
Perplexity AI research.perplexity.ai A U.S.-based company developing an AI-powered search engine with source-based answers.
Previous Article Roblox Unveils Cube – A Generative Model for Creating 3D Worlds Next Article How to Scale vLLM and Avoid Out-of-Memory Errors

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Google DeepMind
3.
Gemini 3 Flash Preview Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 3 Flash Preview Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe