Published on September 28, 2025

AI-powered video search by precise change descriptions

How to teach AI to search for videos by precise change descriptions – and why it matters more than you think

A deep dive into building a video search engine that doesn't just match keywords, but actually understands fine-grained descriptions of what's happening – and can pick the right clip out of millions.

Computer Science 6 – 8 minutes min read
Author: Dr. Sophia Chen 6 – 8 minutes min read

Imagine you're editing a film and looking for a very specific shot: not just «a child playing the piano», but precisely «a young child instead of an adult, with a teacher nearby and sheet music on the stand». Or maybe you're a content creator searching for a nature clip – not just any nature video, but «a lone tree on a hill, filmed with a static camera, with drifting clouds that create a sense of calm». Sounds like sci-fi? In reality, this is one of the hottest challenges in modern AI.

Precision video search revolution

Video search as an art of precision

Traditional video search works simply: you type in keywords, and the system shows you similar clips. But what if you don't just want to find a video – you need a modified version of it? Say you have a clip of a dancer in red clothes, but you need the exact same one in blue.

This task is called compositional video retrieval (CoVR). Think of it as Google, but instead of typing keywords, you show an example and explain what exactly should be changed. Like Hermione in Harry Potter, who always knew exactly what she was looking for in the Hogwarts library – not just «something about magic».

AI limitations in video search nuances

The problem: AI doesn't get the nuances

Current video search systems hit a major roadblock. They're fine with broad matches, but the moment you ask for details – they stumble.

Take a simple case. You have a video of a man playing the piano. A regular system, when asked «as a child», might return any random video with kids – sometimes not even music-related. But what you really want is the same setup, only with a child at the piano instead of an adult.

The issue is that traditional methods work like a translator who knows the words but not the meaning of the sentence. AI sees «child» and «piano» as separate islands, not as a coherent picture with precise changes.

Detailed descriptions in video retrieval

The revolution is in the details

Researchers came up with a new approach. Instead of vague one-liners about changes, they use rich, detailed descriptions of what exactly should be altered in a video.

Picture the difference between «make the background green» and «add a tranquil street scene with a lone tree on a grassy hill, use a static camera to capture the subtle motion of the tree and clouds, creating a sense of calm and natural beauty». The first is like a text message. The second – like a technical spec sheet.

The new dataset, Dense-WebVid-CoVR, contains 1.6 million examples with these detailed instructions. On average: 81 words per video description and 31 words for the change request. That's seven times richer than older systems!

How compositional video retrieval works

How it works: the architecture of understanding

The system runs like a trio of instruments in an orchestra:

Visual encoder – like an artist, watching the original video and sketching its digital portrait. Instead of analyzing every single frame, it uses the middle frame – efficient and precise.

Text encoder – like a literary critic, reading the description and extracting meaning. It builds a semantic map of what's happening in the scene.

Reasoning encoder – the star player. Like a director, it merges visuals with text instructions and builds a unified understanding of what needs to be retrieved.

The key innovation: all three components work together, not in isolation. Earlier systems paired elements step by step – video with text, then video with edits, then text with edits. The new approach merges everything into one «brain center».

Behind the scenes of AI video search

The math behind the curtain

The system learns through contrastive training – think of it as a «spot the difference» game. It's fed correct request-result pairs and incorrect ones, and learns to tell them apart.

The formula may look intimidating, but the core idea is simple: maximize similarity for correct pairs, minimize it for wrong ones. Like a trained sommelier distinguishing fine wine from a fake.

The temperature parameter τ = 0.07 controls how «confident» the system is. Too high – overly cautious. Too low – overly cocky.

Performance of video search AI

Results: the numbers speak

The system delivers impressive benchmarks:

  • Recall@1 (best match accuracy): 71.3% vs. 67.9% for the top competitor
  • Processing speed: 3× faster than previous solutions
  • +3.4% boost on the key metric

In practice? Out of 10 searches, the system gets the right video on the first try 7 times. For AI, that's top-notch.

Real-world testing of video retrieval AI

Testing in the wild

The team tested the system not just on synthetic data, but on real-world cases:

Ego-CVR dataset – first-person videos where timing is crucial. The system nailed it in zero-shot mode (no extra training).

Compositional image retrieval – adapted for still images. On the CIRR dataset, it hit 56.30% accuracy, outperforming rivals.

Fashion items – searching for clothes with modifications. On FashionIQ, the system successfully retrieved dresses, shirts, and tops with the exact tweaks requested.

Data quality in AI video search

The secret sauce: data quality

Half the success lies in meticulous data prep. Researchers manually reviewed all 3,000 test samples. Like proofreading a critical book – every word must fit.

The quality-control pipeline had seven steps:

  • Side-by-side video comparison
  • Contextual consistency check
  • Action and object validation
  • Temporal alignment check
  • Description completeness review
  • Clarity and brevity check
  • Automatic filtering of low-quality samples

Practical applications of advanced video search

Real-world applications

Where can this already be useful?

Film production: Directors and editors can instantly find the exact shots they need. Hours of footage reduced to seconds of search.

Education: Teachers can locate videos with specific scenarios. «Find a video of a chemical reaction – not in a test tube, but in an industrial reactor».

Content marketing: Creators can grab source clips with precise mood and style requirements.

Archives and libraries: Digital archives can offer sharper, more contextual search across historical footage.

Challenges and future of AI video search

Limits and what's next

Of course, the system isn't flawless. Around 2–3% of the training modification texts contain small inaccuracies. But tests show this barely impacts overall performance.

Main limitations:

  • High computational cost
  • Dependence on quality descriptions
  • Reliance on pre-trained models
  • Currently monolingual (English only)

Future of AI-driven content search

A look ahead

This technology paves the way for smarter content search. Imagine a search engine that understands not just words, but also context, mood, and style.

Next steps in development:

  • Multilingual support
  • Handling live video and streams
  • Integration with auto-editing tools
  • Search by emotional context

Importance of intelligent video search

Why it matters

We live in an era of information overload. Every minute, 500 hours of video land on YouTube. Without intelligent search, most of it remains untapped.

The new approach to compositional video retrieval isn't just a technical upgrade. It's a leap toward more intuitive human–machine interaction – where AI grasps not only what we're searching for, but why.

After all, as the saying goes: «AI is like a child – it repeats our mistakes, but learns faster». And the clearer we explain the task, the better it performs.

See you in the future – where search will feel as natural as conversation! 🚀

Original Title: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Article Publication Date: Aug 19, 2025
Original Article Authors : Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
Previous Article When Mathematics «Hears» a Jump: How Gaussian Weights Uncover the Secrets of Painlevé Equations Next Article Can We Really «Reprogram» the Brain? How Synapses Learn in a Noisy World

Related Publications

You May Also Like

Enter the Laboratory

Research does not end with a single experiment. Below are publications that develop similar methods, questions, or concepts.

From Research to Understanding

How This Text Was Created

This material is based on a real scientific study, not generated “from scratch.” At the beginning, neural networks analyze the original publication: its goals, methods, and conclusions. Then the author creates a coherent text that preserves the scientific meaning but translates it from academic format into clear, readable exposition – without formulas, yet without loss of accuracy.

Explaining AI mistakes

78%

Accessible for everyone

85%

Engineering depth

91%

Neural Networks Involved in the Process

We show which models were used at each stage – from research analysis to editorial review and illustration creation. Each neural network performs a specific role: some handle the source material, others work on phrasing and structure, and others focus on the visual representation. This ensures transparency of the process and trust in the results.

1.
DeepSeek-V3 DeepSeek Research Summarization Highlighting key ideas and results

1. Research Summarization

Highlighting key ideas and results

DeepSeek-V3 DeepSeek
2.
Claude Sonnet 4 Anthropic Creating Text from Summary Transforming the summary into a coherent explanation

2. Creating Text from Summary

Transforming the summary into a coherent explanation

Claude Sonnet 4 Anthropic
3.
GPT-5 OpenAI step.translate-en.title

3. step.translate-en.title

GPT-5 OpenAI
4.
Phoenix 1.0 Leonardo AI Creating Illustration Generating an image based on the prepared prompt

4. Creating Illustration

Generating an image based on the prepared prompt

Phoenix 1.0 Leonardo AI

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe