Most discussions about AI focus on cloud services: a model resides in a data center, you send a request, and you get a response. However, another process has been developing in parallel for several years: the effort to run neural networks directly on a phone, laptop, or small computer, without an internet connection or third-party servers. Google has taken a significant step in this direction by releasing the Gemma 4 family of models.
What's in the Release
Gemma 4 isn't just one model, but a family of four different variants for various tasks and devices. Two of them, E2B and E4B, are specifically designed for smartphones: they are compact enough to run autonomously, without a network connection. The other two – larger models with 26 and 31 billion parameters – are geared toward PCs and laptops but can also operate locally, without the cloud.
In short: for the first time, the Gemma lineup includes models that can actually fit on a standard phone and can do more than just answer text-based questions.
What These Models Can Do
All four Gemma 4 variants are multimodal – they understand not only text but also images and video. The compact versions (E2B and E4B) go even further: they also process audio. Simply put, such a model can listen, watch, and read – all directly on the device, without sending data anywhere.
This opens up some very specific use cases: offline speech recognition, photo analysis without uploading to the cloud, and an assistant that works even without an internet connection. For those who value data privacy or simply lack a stable connection, this is a significant advantage.
It's also worth noting that Gemma 4 was designed from the ground up for so-called “agentic scenarios.” This is when the model doesn't just answer a question but performs a sequence of actions – for example, finding information, processing it, and creating a structured result. To achieve this, the model has native support for calling external functions and outputting data in a structured format.
Size Matters – But Not Always the One in the Name
One of the interesting aspects of Gemma 4 is the structure of its 26-billion-parameter model. It uses what's known as a “Mixture of Experts” (MoE) architecture. It sounds complicated, but the idea is simple: the model is large, but only a small portion of it – about 4 billion of the 26 billion parameters – is activated for each request. It's like having a team of 26 specialists, but for any given task, only the four who are needed at that moment step up.
Thanks to this, the model runs faster and requires fewer resources than one might expect from its full size.
The larger 31B model is structured differently – all its parameters are active simultaneously – but it achieves higher scores on independent benchmarks. According to the Arena AI Text leaderboard, it ranked third among open models, trailing only larger competitors.
Why This Is More Than Just Another Release
Gemma is an open family of models: the weights are published under the Apache 2.0 license, which allows for virtually unrestricted commercial use. This is important because most powerful models at this level are either closed-source or have restrictions on their use in products.
The compact E2B and E4B versions were developed in collaboration with Qualcomm and MediaTek – the manufacturers of processors found in most modern Android smartphones. This means the models are optimized for real-world hardware, not just theoretically capable of fitting into the required memory space.
Since the release of the first Gemma generation, models in this family have been downloaded over 400 million times, and the community has created over 100,000 modifications based on them. Gemma 4 is a response to this accumulated experience: addressing what worked, what was missing, and which use cases proved to be in demand.
What Remains Behind the Scenes
Despite the appeal of the “AI right on your phone” idea, it's worth keeping a few things in mind.
First, compact models are always a compromise. E2B and E4B are great for basic tasks, but you shouldn't expect the same level of reasoning from them as from the 31B version. Google itself admits that on certain benchmarks, the smallest model underperforms the previous 27-billion-parameter Gemma 3.
Second, the technical documentation was not fully published at the time of release. This means that independent verification of the models' capabilities is a matter for the near future, not an established fact.
Third, the on-device AI market itself is still taking shape. There are competing solutions – such as Qwen 3, which the larger Gemma 4 models are compared against – and it's too early to say that one approach has definitively won out over another.
Nevertheless, the direction is clear: powerful language models are becoming smaller, cheaper to run, and closer to the end device. Gemma 4 is one of the most compelling arguments that this path is now very much a reality.