If you've been following the open-source AI model market, the landscape over the past few years has been pretty consistent: major companies release powerful models, but these models require expensive hardware and only run in the cloud. With its new Gemma 4 lineup, Google is attempting to shift this balance – and judging by the initial results, they're succeeding.
What Exactly Happened?
On April 2, Google DeepMind unveiled Gemma 4, the fourth generation of its open-source language model series. This isn't just one model, but a family of four variants designed for different tasks and devices. All are released under the Apache 2.0 license, meaning they can be freely used in commercial projects with minimal restrictions.
Gemma 4 is built on the same research and technology as Gemini 3, Google's flagship closed-source model. Simply put, the open-source version has incorporated the advancements of its proprietary counterpart.
Four Sizes for Different Tasks
The family is divided into four models:
- E2B – The most compact, with about 2.3 billion active parameters. It runs on a smartphone or a single-board computer and supports audio input.
- E4B – Slightly larger, with about 4.5 billion active parameters. Also designed to run on-device, including on Android phones.
- 26B MoE – A model with a “Mixture of Experts” architecture: with 26 billion total parameters, it only uses about 4 billion during operation. This helps conserve computational resources without a significant loss in quality.
- 31B Dense – The family's flagship, with 31 billion parameters, all active simultaneously. It ranks third among open models on the international Arena AI Text leaderboard.
Running the two larger models requires a powerful GPU, such as an Nvidia H100. The compact E2B and E4B models were developed in partnership with Qualcomm and MediaTek and are specifically optimized for mobile processors, allowing them to use memory and power efficiently.
Not Just Text: Audio, Images, and Video
All four models can work not only with text but also with images and video. The compact E2B and E4B models also support audio input, which opens up the possibility of on-device speech recognition without sending data to a server.
An important technical detail here is that the models can process images with variable aspect ratios and flexibly adjust how much “attention” to devote to an image. This allows them to strike a balance between speed and quality depending on the task – for instance, quickly processing a low-resolution image or meticulously analyzing a detailed one.
What Is This Actually Useful For?
Gemma 4 was designed from the ground up for agentic scenarios – situations where the AI doesn't just answer a question but independently executes a sequence of actions, such as calling tools, retrieving data, and making decisions. To facilitate this, the models natively support structured output and external function calls.
In short, this is more than just a chatbot. It's a foundation for building autonomous assistants that can, for example, independently gather information from various sources and present a formatted result – all without requiring constant human intervention at every step.
Additionally, the models show significant progress in mathematical reasoning and precise instruction following. They support over 140 languages, and the context window is up to 128,000 tokens for the compact versions and up to 256,000 for the larger ones. For context, 128,000 tokens is equivalent to the text of several average-length novels.
Why “On-Device” Matters
Most powerful AI models operate in the cloud: a request is sent to a server, processed, and a response is returned. While convenient, this creates a dependency on an internet connection, adds latency, and raises privacy concerns, as data leaves the user's device.
Models that run locally – directly on a smartphone or laptop – are free from these issues. They work offline, respond quickly, and don't transmit any data externally. This is precisely why the compact Gemma 4 variants appeal not only to enthusiasts but also to corporate developers who require control over their data.
Even the larger models in the family, for all their power, can fit on a single GPU. This also favorably distinguishes them from some competitors that require entire processing clusters.
Context: The Ecosystem Is Already Huge
Since the release of the first-generation Gemma, developers have downloaded the models in the family over 400 million times and created more than 100,000 custom modifications based on them. This indicates that Gemma is not just a technological showcase but a tool actively used by a large community.
According to researchers at Google DeepMind, the team deliberately focused on maximizing “intelligence per parameter” – in other words, achieving the smartest possible model at the minimum size. Judging by its position on independent leaderboards, they succeeded: the flagship 31B model competes with models up to 20 times its size.
Architecturally, Gemma 4 is intentionally designed to be compatible with the broadest possible range of platforms and tools, which simplifies integration and lowers the barrier to entry for developers. The models also quantize well – a “compression” process that allows them to run on even more modest hardware with minimal loss of quality.
Overall, Gemma 4 is Google's attempt to provide developers with a serious tool that requires neither expensive infrastructure nor gated access. Whether they have succeeded will become clear over time, but the initial signs are compelling.