Typically, computer vision systems are built like a pipeline: one module 'looks' at the image, another processes the text, a third combines the results, and a fourth handles post-processing. Every new failure means a new module. The more complex the task, the longer the chain.
The Falcon Vision team from the Technology Innovation Institute (TII, Abu Dhabi) decided to see if all this could be replaced by a single model. The result is Falcon Perception, a unified neural network with 0.6 billion parameters that understands both image and text together from the very first layer, without being split into separate blocks.
Why combine vision and language into a single model?
The task Falcon Perception solves sounds simple: give the model an image and a text description – for example, «a red mug to the left of the laptop» – and get a precise object mask in return. Simply put, the model must not only find the object but also trace its outline.
In traditional systems, the image first passes through one encoder and the text through another, and then the results are somehow combined. This approach works, but it scales poorly and accumulates errors at each step between components.
Falcon Perception takes a different path: the image is broken down into small patches (fragments), and these patches, along with text tokens, are fed into a single, shared sequence. The transformer processes them simultaneously in a unified parameter space. This is called early fusion – the combination happens not after processing, but at the very beginning.
How the model 'sees' and 'reads' at the same time
Images and text are structured differently. Pixels exist in a two-dimensional space and are better perceived when the model sees the context from all sides at once. Text, on the other hand, is read sequentially: each word builds on the previous one.
To account for this difference, Falcon Perception uses a hybrid attention mask. Visual tokens (image patches) interact with each other in both directions, allowing the model to see the entire picture at once. Text and task tokens work differently: each of them 'sees' everything that came before it, including the visual context, but not what comes after. This allows the same network to function simultaneously as both a visual encoder and a language model.
Chain-of-Perception: First Where, Then What
Identifying objects in an image is a task with a variable number of answers. One picture might have no matching objects, while another could have several hundred. Generating them one token at a time is too slow, especially when it comes to detailed masks.
To address this, the developers proposed the Chain-of-Perception interface: the model describes each object in three steps – first predicting the object's center, then its size, and finally generating a special segmentation token. This token, by interacting with the image's visual features, is transformed into a full-size binary mask.
This order is not accidental: the model first determines where the object is, then how large it is, and only then does it draw the precise outline. This reduces ambiguity and makes mask prediction more stable.
PBench: A Benchmark That Doesn't Forgive Vague Results
Existing benchmarks for such systems have long 'saturated': models consistently score 90% or higher on them. However, it is often unclear why a model made a mistake: was it unable to read the text on an object, did it misunderstand spatial relationships, or did it simply get confused in a crowd of similar items?
The team introduced its own diagnostic benchmark, PBench. It breaks down tasks by the type of capability required: object attributes, text recognition in images (OCR), spatial constraints, relational connections between objects, and dense scenes with a large number of instances. Each example tests exactly one capability, with no mixing. This allows for obtaining not a single score, but a profile: where the model is confident and where it struggles.
How It Was Trained: Three Stages and 54 Million Images
The training of Falcon Perception consists of three stages. In the first, the model learns to list objects in a scene and simultaneously specify their locations, building a general understanding of what is happening in the image. In the second, the queries become independent: the model no longer sees adjacent questions and learns to answer each one separately, as happens in practice. The third stage is a short fine-tuning process for working with very dense scenes, where a single image can contain hundreds of objects.
The training dataset includes 54 million images, 195 million positive examples, and 488 million 'hard' negative examples – cases where an object looks similar to the one requested but is not it. This ratio is important: the model must be able to confidently say 'no,' not just draw masks wherever something is found.
Before the main training, the model was initialized through distillation from two 'teacher' models with different specializations: one strong in local visual features (useful for segmentation) and the other in language alignment (useful for understanding open-ended text queries).
Results: The Gap Widens Where It's Difficult
On the SA-Co benchmark, which evaluates mask quality in an open vocabulary, Falcon Perception scores 68.0 Macro-F1 compared to 62.3 for SAM-3. The lead is particularly noticeable in categories with rich attributes (+8.2 points), food and beverages (+12.2), and sports equipment (+4.0).
There is also a weak point: presence calibration – the model's ability to confidently state that the requested object is not in the picture. On the MCC metric, Falcon Perception currently lags behind SAM-3 (0.64 versus 0.82). This has been identified as the main area for improvement.
On PBench, the picture is more interesting. For simple objects, the gap between the models is small. But as soon as the queries become more complex – requiring reading text on an object, considering spatial relationships, or understanding who is interacting with whom – the advantage of early fusion becomes significant. In the most difficult section, Dense (dense scenes), the 0.6B-parameter Falcon Perception scores 72.6 points compared to 8.9 for Qwen3-VL-30B, a model that is 50 times larger in terms of parameters.
One of the reasons for this lead in dense scenes is architectural: the autoregressive interface allows for generating an arbitrary number of objects, whereas systems with a fixed number of 'slots' for objects simply run out of resources when there are too many.
Falcon OCR: The Same Principle, but for Documents
In parallel with Falcon Perception, the team introduced Falcon OCR, a 0.3-billion-parameter model for text recognition in documents. It uses the same early fusion architecture but was trained from scratch for OCR tasks: parsing multi-column documents, mathematical formulas, tables, handwritten text, and complex layouts.
Training it separately from scratch was a conscious choice. The features needed for character recognition (subtle differences between glyphs, letter strokes) are fundamentally different from the features useful for object segmentation. Therefore, initialization from 'vision' teachers would not have helped here.
On the olmOCR benchmark, Falcon OCR scores 80.3 points, which is within 1.7 points of the best system in the comparison. Moreover, in the 'multi-column documents' (87.1%) and 'tables' (90.3%) categories, the model takes first place. On OmniDocBench, the result is 88.64, surpassing the scores of DeepSeek OCR v2, GPT 5.2, and Mistral OCR 3.
The model's compactness (0.3B versus the typical 0.9–3B parameters of competitors) directly affects processing speed: in measurements on a single A100-80GB GPU under high load, Falcon OCR surpasses comparable open models in throughput. This makes it a practical option for mass document processing.
The Main Idea Is Not in the Architectural Details
The authors themselves state this directly, alluding to the famous 'bitter lesson' thesis in machine learning: most long-term gains come from data, computation, and the training signal – not from complicating the architecture.
Falcon Perception is intentionally designed to be minimalistic: one backbone, one family of tasks, and small, specialized heads only where outputs are continuous and dense. If you need to improve understanding, add more images with complex queries. If you need better language performance, mix in text data. To scale to dense scenes, increase the context length. The architecture does not block any of these paths.
Both releases – Falcon Perception and Falcon OCR – are open for use and research.