When we think of AI-powered robots, we usually picture something large, connected to a powerful server somewhere in the cloud. But what if a robot needs to operate autonomously – without a constant internet connection, without a powerful GPU nearby, running directly 'onboard'? This is precisely the challenge that engineers at NXP and Hugging Face tackled, detailing their results in an in-depth technical blog post.
This isn't just an abstract experiment. It's a practical guide on how to take a modern AI model for robot control, train it on your own data, and run it on a small embedded device – one that can fit inside the robot's actual chassis.
What is VLA and Why is it Needed?
To understand what this is all about, we need to break down the term. VLA stands for Vision-Language-Action – in other words, 'vision, language, action.' Simply put, it's a type of AI model that can perceive an image from a camera, understand a text command, and, based on that, make a decision about a physical action – for example, where to move a robotic arm or how to pick up an object.
To put it very simply: you tell the robot, 'pick up the red block,' it looks around with its camera, finds the block, and picks it up. The model simultaneously 'sees,' 'understands,' and 'acts' – hence the name.
Such models already exist and show impressive results in laboratory settings. The problem is that they typically require significant computational resources. Running them on a small embedded chip is another story entirely.
Collecting Data by Hand
Every AI model learns from data. For robotic systems, this means recorded examples of the robot performing tasks: camera footage, joint positions, and control commands. The more numerous and diverse the examples, the better the model understands what is required of it.
In the project described, a custom dataset recorded by hand was used. An operator controlled a robotic manipulator, demonstrating the desired behavior, while the system recorded everything. This approach is called learning from demonstration – the model watches how a human performs the task and learns to replicate that behavior.
An important point: the data was recorded in a standardized format compatible with the Hugging Face ecosystem. This means it can be reused, shared with the community, and applied with other tools without needing additional conversion.
Fine-Tuning: When a Pre-trained Model is Just the Beginning
Taking a model 'from scratch' and training it entirely on your own is expensive and time-consuming. That's why the project used an approach called fine-tuning: taking an already trained model that has general capabilities and 'sharpening' it for a specific task and a specific robot.
It's similar to how an experienced chef, who knows how to cook a wide variety of dishes, works in a specific restaurant for a few weeks to get accustomed to its menu, equipment, and presentation style. They already have the basic skills – they're just adapting.
In this case, they started with the SmolVLM model – a compact multimodal model from Hugging Face that can work with images and text. It was fine-tuned on their custom recorded data, adding a 'head' to predict the robot's actions. The result was a model that understands natural language commands, analyzes the camera image, and outputs control signals for the manipulator.
The Hardest Part – Fitting it on a Tiny Chip
This is where things get really interesting from an engineering perspective. Even a VLA that is compact by large model standards still puts a serious load on an embedded device. Smartphones, and especially specialized robotics boards, are far less powerful than cloud servers.
To get the model running on the target platform – the NXP i.MX 95 processor – it had to be significantly optimized. Several techniques were used:
- Quantization – simplifying the numerical values within the model. Roughly speaking, instead of very precise numbers, rounded values are used, which reduces the model's size and speeds up calculations with minimal loss of quality.
- Hardware-specific compilation – the model is converted into a format optimized specifically for the architecture of the chip being used, allowing it to perform calculations as efficiently as possible.
As a result, they succeeded in running the model directly on the device – without the cloud, without an external server. The robot receives a command, processes the image, and makes its decision locally.
Why 'On-Device' Matters
The question may arise: why go to all this trouble? After all, you could just send data to the cloud and get a response from there.
There are several reasons. First, latency. For robots, especially those operating in real-time, even a few dozen milliseconds of delay can be critical. Local processing is faster.
Second, reliability. A robot in a factory or out in the field doesn't always have a stable network connection. If the intelligence is right on board, a loss of connectivity doesn't halt its operation.
Third, privacy and security. Data from cameras and sensors doesn't leave for external servers – it's processed locally.
This is especially relevant for industrial robotics, autonomous vehicles, medical devices, and other fields where reliability and autonomy are not just conveniences, but requirements.
An Open Approach: You Can Replicate It
One of the notable aspects of this project is its openness. The authors didn't just share their results; they described the entire process: how the data was recorded, how the fine-tuning was performed, and what optimizations were applied and why.
The tools and data formats used are based on open standards from the Hugging Face ecosystem. This means that a team working on their own robot can use this experience as a foundation, without reinventing the wheel. Recording your own demonstrations, fine-tuning the model, optimizing for your hardware – the entire pathway is now documented.
This isn't a revolution, but it is a significant practical contribution: previously, this kind of knowledge was concentrated in the closed labs of large corporations, and now it's becoming more accessible.
Where This Can All Be Useful
Embedded AI for robots isn't just about industrial manipulators. We're talking about a wide range of devices: assistant robots, autonomous drones, maintenance systems, logistics and warehouse robots, and educational platforms.
In all these cases, there's a common requirement: the device must operate autonomously, react quickly, and not depend on a constant server connection. This is precisely what the described project demonstrates.
Of course, for now, we are talking about relatively simple tasks – grasping and moving objects in a controlled environment. We're still a long way from a fully autonomous robot capable of handling unpredictable environments. But the direction is clearly set: compact, autonomous, trained on real-world data – all on a device the size of a small board.