Published on April 3, 2026

Gemma 4: Google DeepMind's Multimodal AI Models for On-Device Use

Gemma 4: Google DeepMind's Multimodal AI That Runs Directly On-Device

Google DeepMind has released Gemma 4 – an open family of multimodal models that process text, images, video, and audio directly on-device.

Products / Technical context 5 – 7 minutes min read
Event Source: Hugging Face 5 – 7 minutes min read

If you've been following the development of open language models, recent months have made one thing clear: the line between what's available only in the cloud of major companies and what can be run locally is becoming increasingly blurred. The new release from Google DeepMind confirms this.

The Gemma 4 family of models is now available to a wide audience. The models are distributed under the Apache 2.0 license, which means they can be freely used, modified, and integrated into personal and commercial projects.

Gemma 4: Multimodal AI for Images, Video, and Audio

Not Just Text: Images, Video, and Audio in a Single Model

Gemma 4 consists of multimodal models. Simply put, they can process not only text but also images, video, and audio. The models always generate text as output, but what they can take as input has expanded significantly.

All model variants in the family support images and text. The smaller models – E2B and E4B – also process audio. Video is supported by all sizes, although the larger versions do not process the audio track from videos.

In practice, this means the model can, for example, describe the content of a photo, answer questions about an audio recording, recognize objects in an image and return their coordinates, transcribe speech, or write HTML code from a webpage screenshot. During testing with pre-release versions, researchers were able to achieve good results without any additional model tuning – something that is difficult to replicate in itself.

Gemma 4 Models: Sizes from On-Device to Server Deployment

Four Sizes – From “On Your Phone” to “A Serious Server”

The Gemma 4 family includes four variants: E2B, E4B, 26B/A4B (a sparse architecture model where about 4 billion parameters are active at any given time), and 31B (a dense model). All variants are released in both a base version and a version fine-tuned for dialogue.

The two smaller variants are designed to run directly on-device – on a smartphone, laptop, or other local hardware. The two larger ones are intended for server infrastructure or cloud computing.

As for quality, the 31B model achieved a calculated score of 1452 on the LMArena text benchmark, and the 26B/A4B scored 1441. For comparison, this is on par with models like GLM-5 or Kimi K2.5, but with significantly fewer parameters. The size-to-performance ratio for Gemma 4 looks very compelling.

Gemma 4 Architecture: Understanding Its Efficiency

How It Works – A Brief Look at the Architecture

You don't need to dive into the details to use the model. But if you're curious about what makes it so efficient, here are the key ideas.

The model combines two types of attention mechanisms: local (analyzing the immediate context) and global (covering the entire text). This allows it to work efficiently with long texts without wasting excess computational resources.

One interesting feature is the so-called Per-Layer Embeddings (PLE). In standard models, each token (a conventional unit of text) receives a single numerical representation at the input, which is then used at all processing levels. PLE adds a small additional signal for each layer separately – it's as if the model receives refined information about the token exactly when needed, rather than all at once at the beginning. This adds minimal overhead to memory.

Another optimization is the Shared KV Cache. The last few layers of the model do not compute their own intermediate states but instead reuse previously calculated ones. This reduces memory consumption and speeds up generation, especially when working with long texts. The impact on quality is minimal.

Running Gemma 4: Tool Support and Compatibility

Run Anywhere: From Browsers to Apple Silicon

From day one, Gemma 4 is supported by a wide range of tools for running models. This is important: a new model often appears before developers' favorite tools can support it, which creates friction. The situation is different here.

The model works with transformers, llama.cpp (including compatibility with LM Studio, Jan, and local agents), MLX on Apple Silicon devices, mistral.rs (a Rust implementation), and directly in the browser via WebGPU. ONNX checkpoints are also available for running on edge devices.

For those who want to connect the model to a local assistant agent, Gemma 4 is compatible with openclaw, hermes, pi, and open code – all via a local server based on llama.cpp.

Fine-Tuning Gemma 4: Customization for Specific Tasks

Fine-Tuning: From a Driving Simulator to Your Own Scenario

Gemma 4 supports fine-tuning – that is, tailoring the model for a specific task. This is available through TRL, and as part of the release, TRL has been updated: during training, the model can now receive images back from tools, not just text.

As a demonstration, a training scenario was prepared where Gemma 4 learns to drive a car in the CARLA simulator: the model sees the road through a camera, makes decisions, and learns from the results. After training, the model consistently avoids pedestrians. The same principle applies to robotics, browser control, and other interactive scenarios.

Fine-tuning is also available through the Vertex AI cloud platform, with an example of extending function calling capabilities with fixed visual and audio modules. For those who prefer a graphical interface, Unsloth Studio is supported – either locally or via Google Colab.

Gemma 4 Significance for AI Developers

What This Means for Those Working with AI

Gemma 4 is not an experimental prototype or a demo. It's a full-fledged family of models that can be used right now: run locally, fine-tune for specific tasks, and integrate into agent systems.

The open license resolves typical questions about usage restrictions. Out-of-the-box support for multimodality – images, audio, and video – expands the range of tasks without needing to combine several separate models. And on-device availability means it's applicable not just where a GPU server is present.

Many open questions remain: the training data and recipe have not been disclosed, and its performance on specialized domains has yet to be tested by the community. But Gemma 4 is off to a convincing start.

Original Title: Welcome Gemma 4: Frontier multimodal intelligence on device
Publication Date: Apr 2, 2026
Hugging Face huggingface.co A U.S.-based open platform and company for hosting, training, and sharing AI models.
Previous Article Qwen3.6-Plus: Alibaba's New Model on the Path to True AI Agents Next Article Google Vids: Free AI Video and Music Generation – What's New in the Editor

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Alibaba has introduced Qwen3.5, the first model in the Qwen3 family, adept at processing text, images, and audio natively, without needing additional adapters.

Alibaba Cloudwww.alibabacloud.com Feb 17, 2026

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe