Published on

How to Teach a Drone to Understand Human Speech: From Pixel to Flight

Researchers have developed See, Point, Fly, a system that lets drones fly anywhere on simple voice commands – no pre-training or tons of data required.

Computer Science
DeepSeek-V3
Leonardo Phoenix 1.0
Author: Dr. Kim Lee Reading Time: 12 – 18 minutes

Pop-culture references

85%

Creativity

87%

Technical precision

91%
Original title: See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation
Publication date: Sep 26, 2025

Imagine telling a drone, «Fly to that tree behind the red car» – and it just flies. It doesn't ask for coordinates, require a pre-programmed route, or get confused by obstacles. It sees the world through its camera, understands your words, and translates them into a sequence of movements. Sound like science fiction? Until recently, it was. But a team of researchers has created a system called See, Point, Fly (SPF) that does exactly that. And the most amazing part? It doesn't need months of training on specialized data. 🚁

The Problem: When Drones Don't Speak Plain English

Drones are already delivering packages, filming movies, and patrolling areas. But there's a catch: most of them follow pre-programmed routes or are controlled by a human with a remote. Asking a drone to «find the person in the blue jacket and follow them» was a task that, until recently, seemed nearly impossible.

Why? Because it requires three things at once:

  • Visual Perception – The drone has to see the world and understand where the tree is in the image versus the parked car.
  • Language Understanding – It must parse your command, even if you phrase it unclearly.
  • Action Planning – It needs to turn all this information into concrete movements: forward two meters, left, up.

The classic approach was like teaching a kid to drive by showing them thousands of examples: «Here, we turn left; here, we brake.» Scientists created huge datasets of flight trajectories where every drone action was manually labeled. The model learned to repeat these patterns. The problem? If a drone was only trained in an apartment, it would get lost outside. If it learned to fly during the day, it couldn't handle the evening. There was never enough data, and the world is just too diverse.

The Revolution: Language Models Learned to See

In the last couple of years, a breakthrough happened. Multimodal language models emerged – systems that understand both text and images simultaneously. Remember how GPT-4V or Gemini can look at a photo and describe what’s happening? That's a VLM – a Vision-Language Model.

These models were trained on massive internet datasets: millions of images with captions, scene descriptions, and instructions. They've seen everything from cat selfies to photos of space. And they learned not just to recognize objects, but to reason about them. They can explain why a person in a picture looks sad. They can suggest how to get from point A to point B on a map.

The natural thought followed: what if we used these models to control drones? Let the VLM look at the camera feed, read the human's command, and tell the drone where to fly!

But There's a Snag

The first attempts were naive. Researchers simply asked the VLM to generate text commands like «forward 2 meters» or «turn left 45 degrees.» Imagine this: the neural network sees an image, processes the instruction «fly to the tree», and spits out an answer in words.

It's like asking a friend in the passenger seat for directions, but they can only say, «A little to the left.» What does «a little» mean? 10 degrees? 30? Or a full 90?

The problems were obvious:

  • The set of commands was fixed and limited.
  • Precision was lacking – «roughly over there» isn't a viable instruction for an aircraft.
  • The VLM had to think in terms of «text about actions» rather than visual coordinates.

And this is where the creators of SPF pulled off a brilliant trick.

The Key Idea: Don't Tell, Point

The authors of SPF thought: why force the model to describe in words where to fly? It can see the picture! Just let it point to a spot on it. As if you were pointing a finger and saying, «Right there.»

It seems obvious, but it changes everything.

Instead of generating text like «fly forward 1.5 meters and slightly left», the model simply places a dot on the image: here's the pixel with coordinates (320, 240). From there, it's pure math. We know the drone's camera parameters, we know its field of view. So, we can convert a point on a flat image into a 3D vector: how far to fly forward, left-right, and up-down.

It's the difference between «tell me the directions to the subway» and «show me on the map.» The second one is more concrete, more precise, and easier to understand.

How It Works: A Step-by-Step Breakdown

Let's walk through a scenario. You're in Namsan Park in Seoul, you launch your drone, and you say, «Fly to the bench near the fountain.»

Step 1. The Drone Sees the World

The drone's camera sends an image to the SPF system. The picture shows trees, paths, people, a fountain in the distance, and a bench next to it.

Step 2. The VLM Analyzes the Scene

The language model receives two inputs: the image and your command. It starts to reason: «Okay, I need to find a bench. I see several. But the instruction says 'near the fountain.' There's the fountain, so I need the bench next to it.»

Step 3. The Model Places a Dot

The VLM doesn't generate text like «fly 3 meters north-northwest.» It outputs a structured response: pixel coordinates (e.g., x=450, y=300) and an approximate depth estimate – how far away the object is. The depth is discrete: «near», «medium», or «far.»

Step 4. Conversion to a 3D Command

Now for the math. We have a point on a 2D image and a distance estimate. Using the camera's internal parameters (camera intrinsics), like focal length and sensor size, we recalculate this into a 3D vector: «fly 2.1 meters forward, 0.3 meters to the right, and maintain current altitude.»

Step 5. Commands for the Motors

A drone isn't a remote-controlled car you can just push in the right direction. It has four rotors (if it's a quadcopter), and you need to calculate the speed for each one. The SPF system converts the 3D vector into control commands: roll, pitch, and thrust. It's like telling the drone, «Tilt forward a bit, give the right rotors more power, and increase the overall lift.»

Step 6. The Drone Moves

The motors engage, and the drone flies in the specified direction. But it doesn't just fly and forget! This isn't a one-time command. A moment later (say, every 0.5 seconds), the camera takes another picture, and the entire cycle repeats. This is called closed-loop control. It's like when you're walking to a store: you don't just memorize the direction at the start; you constantly look around and correct your path.

Smart Details: How to Make the Flight Smooth and Safe

The creators of SPF thought through several key features that turn the basic idea into a working system.

Adaptive Step Sizing

If the drone sees that the target is far away and the path is clear, why crawl in baby steps? SPF can adaptively change its step size. In open spaces, it takes long «leaps»; near obstacles, it makes cautious, short movements. This is implemented through a non-linear function that converts the model's depth estimate into a real flight distance.

The formula looks something like this (don't be scared):

actual_distance = base_distance × (1 + α × predicted_depth)

Where α is a coefficient that adjusts the «aggressiveness.» The result? In experiments, this sped up task completion by almost double without sacrificing safety.

On-the-Fly Obstacle Avoidance

You don't want your drone crashing into a tree. SPF integrates a simple object detection system directly into the planning process. Before the VLM picks a point, the system checks: «Hey, does this path lead straight into a wall?» If so, the model reconsiders its options, excluding dangerous zones.

The beauty of this is that it's not a separate, heavyweight obstacle detector. Modern VLMs can detect objects themselves – this is called open-vocabulary detection. You can ask the model, «Show me all obstacles in the image», and it will find not only standard ones like «car, tree, building» but also any other objects you can name with words.

Dynamic Targets

What if the target is moving? For example, the task: «Follow the person in the yellow jacket.» Since SPF operates in a closed loop, constantly updating its observations, it handles this naturally. At each step, the model re-evaluates the target's position. If the person moves left, the drone corrects its trajectory. It's like playing tag: you don't calculate the entire route in advance but react to the current position of the person you're chasing.

The Tests: Simulation and Reality

The researchers tested SPF in two environments.

The Virtual World

First up was the DRL (Drone Racing League) simulator, where thousands of scenarios can be tested safely. In this virtual environment, they set up various tasks:

  • Simple Navigation: «Fly to the blue cube.»
  • Obstacle Avoidance: «Get to the flag without hitting the barriers.»
  • Long Routes: «First to the tree, then to the bench, then to the building.»
  • Reasoning Tasks: «Find the person who needs help» (the drone has to understand that a person lying on the ground likely needs it).
  • Searching for Unseen Objects: The target isn't initially in view, and the drone must explore the area.
  • Following Moving Targets.

The results? SPF achieved a success rate of 93.9%. This means it correctly completed nearly 94 out of 100 tasks without crashing and while reaching the target.

Compare that to its competitors:

  • PIVOT (the previous top VLM method): 28.7%
  • TypeFly (a system where the VLM generates text commands): 0.9%

A nearly 65 percentage point difference isn't an improvement; it's a quantum leap.

Real-World Drones

Simulations are great, but the real world is far more complex. You have unpredictable lighting, wind, communication delays, and imperfect cameras. The researchers took standard DJI Tello EDU drones (inexpensive models used by students for projects) and flew them both indoors and outdoors.

The tasks were made more challenging: different lighting conditions (bright daylight, dusk, artificial light), varying obstacle densities, and both static and moving objects.

The result: a 92.7% success rate. Almost as good as in the simulation!

For comparison:

  • PIVOT in the real world: 23.5%
  • TypeFly: a near-total failure.

The performance in complex categories was especially impressive. In tasks requiring long routes and reasoning, SPF scored above 90%, while its competitors barely broke the 20% mark.

Why It Works: The Anatomy of Success

The researchers ran a series of experiments, disabling different components of the system to understand what makes SPF so effective.

The Output Format Is Everything

They compared three options:

  1. The VLM generates text commands («forward 2m, left 1m»).
  2. The VLM chooses from 8 predefined directions (forward, back, left, etc.).
  3. SPF: The VLM outputs 2D point coordinates.

The third option won by a huge margin. Structured spatial markup proved to be leagues more accurate than any text-based approximation.

Universal Across Different VLMs

They tested SPF with various language models: Gemini, GPT-4V, Claude, and Llama. And guess what? All of them showed high success rates (from 85% to 94%). This means the approach works not because of the magic of one specific model, but because the problem is framed correctly. The SPF architecture is like a good translator that works with any language.

Adaptivity vs. Fixed Steps

An experiment with the adaptive step scaling turned off showed that without it, the drone either crawls too slowly (with fixed short steps) or risks crashing (with fixed long steps). The adaptive approach cut task completion time by a factor of 1.8 with the same success rate.

Built-in vs. Specialized Detection

Some systems use separate object detectors (like YOLO) for obstacle avoidance. SPF integrates this task into the VLM. The advantages:

  • Lower latency (no need to run two models in parallel).
  • Flexibility: you can detect any object you can describe in words, not just pre-trained categories.
  • Simpler architecture.

Limitations: When the Magic Fails

Like any research, there are nuances. The authors are honest about them.

Perception Errors

VLMs can make mistakes, especially with small or distant objects. If the target is a tiny item 30 meters away, the drone's camera might not capture enough detail for the model to get it right. This isn't a flaw in the SPF architecture but rather a limitation of current-generation VLMs and camera resolutions.

Sensitivity to Phrasing

If a command is ambiguous, the model might misinterpret it. «Fly to the tree» – which one, if there are ten? A human would understand from context, but a VLM can sometimes get lost. The solution: more precise instructions or an additional dialogue with the user for clarification.

Inference Delays

VLMs are large neural networks, and getting a response takes time. In the current implementation, SPF updates its plan every 0.5 seconds. This is fine for most tasks, but in fast-changing scenarios (like following a car), it might be too slow. However, as hardware gets faster and models become more optimized, this problem will likely solve itself over time.

Why This Matters: A Look into the Future

SPF is more than just a cool research paper. It's a proof of concept that shifts the paradigm.

The Democratization of Drones

Previously, creating an autonomous drone for a specific task required:

  • Collecting a dataset (thousands of flight hours).
  • Training a model (expensive GPUs, weeks of computation).
  • Expertise in machine learning.

With SPF, you need a standard drone with a camera, access to a VLM API, and a few lines of code. It's like the shift from writing complex programs to talking with ChatGPT: the barrier to entry has collapsed.

New Applications

Imagine:

  • A firefighter assistance drone: «Find people in the building and deliver this message to them.»
  • A drone photographer: «Follow the group of tourists and take their picture in front of the Bongeunsa Temple.»
  • A delivery drone: «Take this package to the third-floor balcony of the building with the red roof.»

These are all normal human instructions, with no route programming required.

A Philosophical Point

There's something profound in the idea of «pointing to a spot.» It's a return to our intuitive interaction with the world. We humans constantly use visual cues: we point with our fingers, a nod, our gaze. SPF brings this natural ease to the human-machine relationship.

Code is poetry, just in a different language. And in SPF's case, this poetry isn't written in lines of text, but in coordinates in space. The algorithm doesn't describe the world in words; it sees its structure. The neural network doesn't generate commands; it shows intent.

What's Next?

The researchers are already working on improvements:

  • Boosting perception accuracy with multimodal inputs (like adding depth sensors).
  • Reducing latency by distilling large VLMs into more compact models.
  • Developing active exploration strategies (where the drone decides where to fly to better understand an unknown environment).

The project is open-source, and the code is available. This means anyone can take SPF, adapt it to their own tasks, and expand its capabilities.

Ultimately, SPF is another step toward a world where technology understands us naturally. Where the barrier of programming between intent and action disappears. Where a drone isn't a complex gadget that requires a 200-page manual, but a partner that gets it in an instant.

Or, more accurately, with just a glance. 👁️✨

Original authors : Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee, Shr-Ruei Tsai, Chin-Yang Lin, Kuan-Wen Chen, Tsung-Wei Ke, Yu-Lun Liu
GPT-5
Claude Sonnet 4.5
Gemini 2.5 Pro
Previous Article A Cosmic Ballet of Light: How the Gravity of Giant Planets Bends Starlight on its Path to Earth Next Article When a Genome Is Too Much: Learning to Hear the Whisper of Mutations in the Symphony of Cancer

We believe in the power of human – AI dialogue

GetAtom was built so anyone can experience this collaboration first-hand: texts, images, and videos are just a click away.

Start today

+ get as a gift
100 atoms just for signing up

Lab

You might also like

Read more articles

When the Market Loses its Randomness: How Price Quirks Create Infinite Profit Opportunities

Research shows that in financial models with unusual price behavior – stops, reflections, asymmetry – strange arbitrage opportunities arise, resembling a «perpetual motion machine» of trading.

Finance & Economics

How Antennas Learned to Work Without Expensive Electronics: A Cylindrical Array for Future Networks

A new antenna architecture for 6G uses simple geometry instead of thousands of phase shifters – cutting costs by 15x while maintaining connection efficiency.

Electrical Engineering & System Sciences

When Geometry Sings: How Abstract Spaces Tell Stories Through Curves

Imagine spaces where shapes intertwine like musical notes, and counting them reveals invisible symmetries – this is the world of toric Calabi-Yau manifolds.

Mathematics & Statistics

Don’t miss a single experiment!

Subscribe to our Telegram channel –
we regularly post announcements of new books, articles, and interviews.

Subscribe