If you've been following the development of open language models, recent months have made one thing clear: the line between what's available only in the cloud of major companies and what can be run locally is becoming increasingly blurred. The new release from Google DeepMind confirms this.
The Gemma 4 family of models is now available to a wide audience. The models are distributed under the Apache 2.0 license, which means they can be freely used, modified, and integrated into personal and commercial projects.
Not Just Text: Images, Video, and Audio in a Single Model
Gemma 4 consists of multimodal models. Simply put, they can process not only text but also images, video, and audio. The models always generate text as output, but what they can take as input has expanded significantly.
All model variants in the family support images and text. The smaller models – E2B and E4B – also process audio. Video is supported by all sizes, although the larger versions do not process the audio track from videos.
In practice, this means the model can, for example, describe the content of a photo, answer questions about an audio recording, recognize objects in an image and return their coordinates, transcribe speech, or write HTML code from a webpage screenshot. During testing with pre-release versions, researchers were able to achieve good results without any additional model tuning – something that is difficult to replicate in itself.
Four Sizes – From “On Your Phone” to “A Serious Server”
The Gemma 4 family includes four variants: E2B, E4B, 26B/A4B (a sparse architecture model where about 4 billion parameters are active at any given time), and 31B (a dense model). All variants are released in both a base version and a version fine-tuned for dialogue.
The two smaller variants are designed to run directly on-device – on a smartphone, laptop, or other local hardware. The two larger ones are intended for server infrastructure or cloud computing.
As for quality, the 31B model achieved a calculated score of 1452 on the LMArena text benchmark, and the 26B/A4B scored 1441. For comparison, this is on par with models like GLM-5 or Kimi K2.5, but with significantly fewer parameters. The size-to-performance ratio for Gemma 4 looks very compelling.
How It Works – A Brief Look at the Architecture
You don't need to dive into the details to use the model. But if you're curious about what makes it so efficient, here are the key ideas.
The model combines two types of attention mechanisms: local (analyzing the immediate context) and global (covering the entire text). This allows it to work efficiently with long texts without wasting excess computational resources.
One interesting feature is the so-called Per-Layer Embeddings (PLE). In standard models, each token (a conventional unit of text) receives a single numerical representation at the input, which is then used at all processing levels. PLE adds a small additional signal for each layer separately – it's as if the model receives refined information about the token exactly when needed, rather than all at once at the beginning. This adds minimal overhead to memory.
Another optimization is the Shared KV Cache. The last few layers of the model do not compute their own intermediate states but instead reuse previously calculated ones. This reduces memory consumption and speeds up generation, especially when working with long texts. The impact on quality is minimal.
Run Anywhere: From Browsers to Apple Silicon
From day one, Gemma 4 is supported by a wide range of tools for running models. This is important: a new model often appears before developers' favorite tools can support it, which creates friction. The situation is different here.
The model works with transformers, llama.cpp (including compatibility with LM Studio, Jan, and local agents), MLX on Apple Silicon devices, mistral.rs (a Rust implementation), and directly in the browser via WebGPU. ONNX checkpoints are also available for running on edge devices.
For those who want to connect the model to a local assistant agent, Gemma 4 is compatible with openclaw, hermes, pi, and open code – all via a local server based on llama.cpp.
Fine-Tuning: From a Driving Simulator to Your Own Scenario
Gemma 4 supports fine-tuning – that is, tailoring the model for a specific task. This is available through TRL, and as part of the release, TRL has been updated: during training, the model can now receive images back from tools, not just text.
As a demonstration, a training scenario was prepared where Gemma 4 learns to drive a car in the CARLA simulator: the model sees the road through a camera, makes decisions, and learns from the results. After training, the model consistently avoids pedestrians. The same principle applies to robotics, browser control, and other interactive scenarios.
Fine-tuning is also available through the Vertex AI cloud platform, with an example of extending function calling capabilities with fixed visual and audio modules. For those who prefer a graphical interface, Unsloth Studio is supported – either locally or via Google Colab.
What This Means for Those Working with AI
Gemma 4 is not an experimental prototype or a demo. It's a full-fledged family of models that can be used right now: run locally, fine-tune for specific tasks, and integrate into agent systems.
The open license resolves typical questions about usage restrictions. Out-of-the-box support for multimodality – images, audio, and video – expands the range of tasks without needing to combine several separate models. And on-device availability means it's applicable not just where a GPU server is present.
Many open questions remain: the training data and recipe have not been disclosed, and its performance on specialized domains has yet to be tested by the community. But Gemma 4 is off to a convincing start.