Published on

How to teach AI to search for videos by precise change descriptions – and why it matters more than you think

A deep dive into building a video search engine that doesn’t just match keywords, but actually understands fine-grained descriptions of what’s happening – and can pick the right clip out of millions.

Computer Science
Leonardo Phoenix 1.0
Author: Dr. Sophia Chen Reading Time: 6 – 8 minutes

Explaining AI mistakes

78%

Accessible for everyone

85%

Engineering depth

91%
Original title: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Publication date: Aug 19, 2025

Imagine you’re editing a film and looking for a very specific shot: not just «a child playing the piano», but precisely «a young child instead of an adult, with a teacher nearby and sheet music on the stand.» Or maybe you’re a content creator searching for a nature clip – not just any nature video, but «a lone tree on a hill, filmed with a static camera, with drifting clouds that create a sense of calm.» Sounds like sci-fi? In reality, this is one of the hottest challenges in modern AI.

Video search as an art of precision

Traditional video search works simply: you type in keywords, and the system shows you similar clips. But what if you don’t just want to find a video – you need a modified version of it? Say you have a clip of a dancer in red clothes, but you need the exact same one in blue.

This task is called compositional video retrieval (CoVR). Think of it as Google, but instead of typing keywords, you show an example and explain what exactly should be changed. Like Hermione in Harry Potter, who always knew exactly what she was looking for in the Hogwarts library – not just «something about magic.»

The problem: AI doesn’t get the nuances

Current video search systems hit a major roadblock. They’re fine with broad matches, but the moment you ask for details – they stumble.

Take a simple case. You have a video of a man playing the piano. A regular system, when asked «as a child», might return any random video with kids – sometimes not even music-related. But what you really want is the same setup, only with a child at the piano instead of an adult.

The issue is that traditional methods work like a translator who knows the words but not the meaning of the sentence. AI sees «child» and «piano» as separate islands, not as a coherent picture with precise changes.

The revolution is in the details

Researchers came up with a new approach. Instead of vague one-liners about changes, they use rich, detailed descriptions of what exactly should be altered in a video.

Picture the difference between «make the background green» and «add a tranquil street scene with a lone tree on a grassy hill, use a static camera to capture the subtle motion of the tree and clouds, creating a sense of calm and natural beauty.» The first is like a text message. The second – like a technical spec sheet.

The new dataset, Dense-WebVid-CoVR, contains 1.6 million examples with these detailed instructions. On average: 81 words per video description and 31 words for the change request. That’s seven times richer than older systems!

How it works: the architecture of understanding

The system runs like a trio of instruments in an orchestra:

Visual encoder – like an artist, watching the original video and sketching its digital portrait. Instead of analyzing every single frame, it uses the middle frame – efficient and precise.

Text encoder – like a literary critic, reading the description and extracting meaning. It builds a semantic map of what’s happening in the scene.

Reasoning encoder – the star player. Like a director, it merges visuals with text instructions and builds a unified understanding of what needs to be retrieved.

The key innovation: all three components work together, not in isolation. Earlier systems paired elements step by step – video with text, then video with edits, then text with edits. The new approach merges everything into one «brain center.»

The math behind the curtain

The system learns through contrastive training – think of it as a «spot the difference» game. It’s fed correct request-result pairs and incorrect ones, and learns to tell them apart.

The formula may look intimidating, but the core idea is simple: maximize similarity for correct pairs, minimize it for wrong ones. Like a trained sommelier distinguishing fine wine from a fake.

The temperature parameter τ = 0.07 controls how «confident» the system is. Too high – overly cautious. Too low – overly cocky.

Results: the numbers speak

The system delivers impressive benchmarks:

  • Recall@1 (best match accuracy): 71.3% vs. 67.9% for the top competitor
  • Processing speed: 3× faster than previous solutions
  • +3.4% boost on the key metric

In practice? Out of 10 searches, the system gets the right video on the first try 7 times. For AI, that’s top-notch.

Testing in the wild

The team tested the system not just on synthetic data, but on real-world cases:

Ego-CVR dataset – first-person videos where timing is crucial. The system nailed it in zero-shot mode (no extra training).

Compositional image retrieval – adapted for still images. On the CIRR dataset, it hit 56.30% accuracy, outperforming rivals.

Fashion items – searching for clothes with modifications. On FashionIQ, the system successfully retrieved dresses, shirts, and tops with the exact tweaks requested.

The secret sauce: data quality

Half the success lies in meticulous data prep. Researchers manually reviewed all 3,000 test samples. Like proofreading a critical book – every word must fit.

The quality-control pipeline had seven steps:

  • Side-by-side video comparison
  • Contextual consistency check
  • Action and object validation
  • Temporal alignment check
  • Description completeness review
  • Clarity and brevity check
  • Automatic filtering of low-quality samples

Real-world applications

Where can this already be useful?

Film production: Directors and editors can instantly find the exact shots they need. Hours of footage reduced to seconds of search.

Education: Teachers can locate videos with specific scenarios. «Find a video of a chemical reaction – not in a test tube, but in an industrial reactor.»

Content marketing: Creators can grab source clips with precise mood and style requirements.

Archives and libraries: Digital archives can offer sharper, more contextual search across historical footage.

Limits and what’s next

Of course, the system isn’t flawless. Around 2–3% of the training modification texts contain small inaccuracies. But tests show this barely impacts overall performance.

Main limitations:

  • High computational cost
  • Dependence on quality descriptions
  • Reliance on pre-trained models
  • Currently monolingual (English only)

A look ahead

This technology paves the way for smarter content search. Imagine a search engine that understands not just words, but also context, mood, and style.

Next steps in development:

  • Multilingual support
  • Handling live video and streams
  • Integration with auto-editing tools
  • Search by emotional context

Why it matters

We live in an era of information overload. Every minute, 500 hours of video land on YouTube. Without intelligent search, most of it remains untapped.

The new approach to compositional video retrieval isn’t just a technical upgrade. It’s a leap toward more intuitive human–machine interaction – where AI grasps not only what we’re searching for, but why.

After all, as the saying goes: «AI is like a child – it repeats our mistakes, but learns faster.» And the clearer we explain the task, the better it performs.

See you in the future – where search will feel as natural as conversation! 🚀

Original authors : Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
DeepSeek-V3
Claude Sonnet 4
GPT-5
Previous Article When Mathematics «Hears» a Jump: How Gaussian Weights Uncover the Secrets of Painlevé Equations Next Article Can We Really «Reprogram» the Brain? How Synapses Learn in a Noisy World

Want to play around
with AI yourself?

GetAtom packs the best AI tools: text, image, voice, even video. Everything you need for your creative journey.

Start experimenting

+ get as a gift
100 atoms just for signing up

Lab

You might also like

Read more articles

When the Market Loses its Randomness: How Price Quirks Create Infinite Profit Opportunities

Research shows that in financial models with unusual price behavior – stops, reflections, asymmetry – strange arbitrage opportunities arise, resembling a «perpetual motion machine» of trading.

Finance & Economics

How Antennas Learned to Work Without Expensive Electronics: A Cylindrical Array for Future Networks

A new antenna architecture for 6G uses simple geometry instead of thousands of phase shifters – cutting costs by 15x while maintaining connection efficiency.

Electrical Engineering & System Sciences

When Geometry Sings: How Abstract Spaces Tell Stories Through Curves

Imagine spaces where shapes intertwine like musical notes, and counting them reveals invisible symmetries – this is the world of toric Calabi-Yau manifolds.

Mathematics & Statistics

Don’t miss a single experiment!

Subscribe to our Telegram channel –
we regularly post announcements of new books, articles, and interviews.

Subscribe