For the last few years, the conversation about AI has almost always been about the cloud: powerful models live on servers, requests are sent there, and answers come back. But the picture is gradually changing. More and more developers and companies want AI to work directly on the device – without sending data externally, without latency, and without depending on the internet.
Google has taken a step in this direction by expanding its family of open Gemma models. And NVIDIA has joined this project to ensure the models run efficiently on a wide range of the company's hardware – from compact embedded modules to personal supercomputers.
What Is Gemma and Why Is It Needed?
Gemma is a family of open language models from Google, designed for local execution. Simply put, these are models that you can download and run on your own machine – on a work computer, a specialized device, or a powerful workstation – without connecting to the cloud.
The new additions to the lineup cover a wide range: from the very compact E2B and E4B to the heavier 26B and 31B. The numbers here roughly reflect the 'size' of the model – the larger it is, the richer its capabilities tend to be, but the higher the hardware requirements.
The models support more than just text. Gemma can work with images, video, and audio; recognize objects; process documents; and understand speech. You can mix text and images in a single prompt in any order – this is called multimodal input. Other declared features include solving complex reasoning tasks, assistance with writing and debugging code, and support for over 35 languages 'out of the box' (though pre-training was done on over 140 languages).
Small but Swift: E2B and E4B
The most compact models in the family – E2B and E4B – are designed to operate in resource-constrained environments. They are intended for so-called edge devices: small, specialized modules installed where local data processing is needed – in industrial equipment, embedded systems, and similar solutions.
The key here is complete autonomy. No internet, minimal latency, real-time operation. Such models, for example, are well-suited for on-device object recognition or voice control.
26B and 31B: For Those Who Want More
The larger models – 26B and 31B – are geared towards complex tasks: advanced reasoning, working with code, and so-called agentic scenarios. In short, an agentic AI is when the model doesn't just answer questions but independently plans and executes a sequence of actions: opening files, accessing tools, and launching tasks.
These models are optimized to run on NVIDIA RTX GPUs – the same ones found in gaming and professional PCs – as well as on the DGX Station. The DGX Station is a personal computer from NVIDIA, marketed as a 'personal supercomputer for AI.' By the standards of home and office hardware, this is a very powerful machine designed specifically for such tasks.
Agentic AI on Your Desktop
The new models' compatibility with the OpenCLAW platform deserves special attention. This is an application that allows for the creation of local AI assistants that run continuously in the background. Such an assistant can read your files, monitor open applications, and automate routine tasks – all happening locally, without sending data to the cloud.
Simply put, imagine an assistant that knows what project you are currently working on, sees your documents, and can carry out your requests without needing extra explanation. This is precisely the scenario for which the 26B and 31B models are designed, paired with OpenCLAW on RTX computers and the DGX Station.
Why NVIDIA and How It Works in Practice
NVIDIA didn't just 'allow' Gemma to run on its GPUs – the company actively participated in optimizing the models. The result: Gemma runs efficiently across the entire range of NVIDIA hardware, from the compact Jetson Orin Nano embedded modules to RTX GPUs in standard PCs and the DGX Station.
For those who want to try the models themselves, several local deployment options are available, particularly through tools like Ollama and llama.cpp. The Unsloth service, in turn, offers optimized and 'lightweight' versions of the models, as well as the ability to fine-tune them for specific tasks directly through its own Unsloth Studio interface.
What Else Is Happening in the Ecosystem
Parallel to the release of Gemma, a series of related updates have appeared in the NVIDIA ecosystem. NVIDIA introduced NemoCLAW, an open-source software stack that enhances the performance of OpenCLAW on NVIDIA devices by increasing security and expanding support for local models.
The company Accomplish.ai announced a free version of its desktop AI agent, Accomplish FREE. It uses open models, runs them locally on RTX GPUs, and dynamically redistributes the workload between local hardware and the cloud as needed. All this requires no additional configuration or API keys.
Other models that have received optimization for local agents on RTX devices include NVIDIA Nemotron 3 Nano 4B, Nemotron 3 Super 120B, as well as the Qwen 3.5 and Mistral Small 4 models.
Where Local AI Is Headed
What is happening now is a gradual shift in the center of gravity. AI is ceasing to be an exclusively cloud-based story and is beginning to live on users' devices. This changes a lot: it becomes possible to work with personal data without transferring it to third parties, reliance on a stable internet connection is reduced, and task execution latency decreases.
Gemma, paired with NVIDIA hardware, is one of the most concrete examples of how this idea is being put into practice right now. Open models available for local execution on consumer hardware are no longer a concept of the future, but a working tool that can be tried today.
However, the question of the real barrier to entry remains open. The 26B and 31B models, despite optimization, still require quite powerful hardware. For the general public, this is currently more of a tool for developers and tech-savvy users than something for daily use on an average laptop. But compact options like E2B and E4B show that the industry is actively working to lower this barrier.