Published on April 3, 2026

Gemma 4: Google DeepMind's Multimodal AI Models for On-Device Use

Gemma 4: Google DeepMind's Multimodal AI That Runs Directly On-Device

Google DeepMind has released Gemma 4 – an open family of multimodal models that process text, images, video, and audio directly on-device.

Products / Technical context 5 – 7 minutes min read

Event Source: Hugging Face 5 – 7 minutes min read

If you've been following the development of open language models, recent months have made one thing clear: the line between what's available only in the cloud of major companies and what can be run locally is becoming increasingly blurred. The new release from Google DeepMind confirms this.

The Gemma 4 family of models is now available to a wide audience. The models are distributed under the Apache 2.0 license, which means they can be freely used, modified, and integrated into personal and commercial projects.

Gemma 4: Multimodal AI for Images, Video, and Audio

Not Just Text: Images, Video, and Audio in a Single Model

Gemma 4 consists of multimodal models. Simply put, they can process not only text but also images, video, and audio. The models always generate text as output, but what they can take as input has expanded significantly.

All model variants in the family support images and text. The smaller models – E2B and E4B – also process audio. Video is supported by all sizes, although the larger versions do not process the audio track from videos.

In practice, this means the model can, for example, describe the content of a photo, answer questions about an audio recording, recognize objects in an image and return their coordinates, transcribe speech, or write HTML code from a webpage screenshot. During testing with pre-release versions, researchers were able to achieve good results without any additional model tuning – something that is difficult to replicate in itself.

Gemma 4 Models: Sizes from On-Device to Server Deployment

Four Sizes – From “On Your Phone” to “A Serious Server”

The Gemma 4 family includes four variants: E2B, E4B, 26B/A4B (a sparse architecture model where about 4 billion parameters are active at any given time), and 31B (a dense model). All variants are released in both a base version and a version fine-tuned for dialogue.

The two smaller variants are designed to run directly on-device – on a smartphone, laptop, or other local hardware. The two larger ones are intended for server infrastructure or cloud computing.

As for quality, the 31B model achieved a calculated score of 1452 on the LMArena text benchmark, and the 26B/A4B scored 1441. For comparison, this is on par with models like GLM-5 or Kimi K2.5, but with significantly fewer parameters. The size-to-performance ratio for Gemma 4 looks very compelling.

Gemma 4 Architecture: Understanding Its Efficiency

How It Works – A Brief Look at the Architecture

You don't need to dive into the details to use the model. But if you're curious about what makes it so efficient, here are the key ideas.

The model combines two types of attention mechanisms: local (analyzing the immediate context) and global (covering the entire text). This allows it to work efficiently with long texts without wasting excess computational resources.

One interesting feature is the so-called Per-Layer Embeddings (PLE). In standard models, each token (a conventional unit of text) receives a single numerical representation at the input, which is then used at all processing levels. PLE adds a small additional signal for each layer separately – it's as if the model receives refined information about the token exactly when needed, rather than all at once at the beginning. This adds minimal overhead to memory.

Another optimization is the Shared KV Cache. The last few layers of the model do not compute their own intermediate states but instead reuse previously calculated ones. This reduces memory consumption and speeds up generation, especially when working with long texts. The impact on quality is minimal.

Running Gemma 4: Tool Support and Compatibility

Run Anywhere: From Browsers to Apple Silicon

From day one, Gemma 4 is supported by a wide range of tools for running models. This is important: a new model often appears before developers' favorite tools can support it, which creates friction. The situation is different here.

The model works with transformers, llama.cpp (including compatibility with LM Studio, Jan, and local agents), MLX on Apple Silicon devices, mistral.rs (a Rust implementation), and directly in the browser via WebGPU. ONNX checkpoints are also available for running on edge devices.

For those who want to connect the model to a local assistant agent, Gemma 4 is compatible with openclaw, hermes, pi, and open code – all via a local server based on llama.cpp.

Fine-Tuning Gemma 4: Customization for Specific Tasks

Fine-Tuning: From a Driving Simulator to Your Own Scenario

Gemma 4 supports fine-tuning – that is, tailoring the model for a specific task. This is available through TRL, and as part of the release, TRL has been updated: during training, the model can now receive images back from tools, not just text.

As a demonstration, a training scenario was prepared where Gemma 4 learns to drive a car in the CARLA simulator: the model sees the road through a camera, makes decisions, and learns from the results. After training, the model consistently avoids pedestrians. The same principle applies to robotics, browser control, and other interactive scenarios.

Fine-tuning is also available through the Vertex AI cloud platform, with an example of extending function calling capabilities with fixed visual and audio modules. For those who prefer a graphical interface, Unsloth Studio is supported – either locally or via Google Colab.

Gemma 4 Significance for AI Developers

What This Means for Those Working with AI

Gemma 4 is not an experimental prototype or a demo. It's a full-fledged family of models that can be used right now: run locally, fine-tune for specific tasks, and integrate into agent systems.

The open license resolves typical questions about usage restrictions. Out-of-the-box support for multimodality – images, audio, and video – expands the range of tasks without needing to combine several separate models. And on-device availability means it's applicable not just where a GPU server is present.

Many open questions remain: the training data and recipe have not been disclosed, and its performance on specialized domains has yet to be tested by the community. But Gemma 4 is off to a convincing start.

#event #technical context #neural networks #ai development #engineering #infrastructure #open language models #multimodal models #multimodal ai

Link to Original: https://huggingface.co/blog/gemma4

Original Title: Welcome Gemma 4: Frontier multimodal intelligence on device

Publication Date: Apr 2, 2026

Hugging Face huggingface.co A U.S.-based open platform and company for hosting, training, and sharing AI models.

Previous Article Qwen3.6-Plus: Alibaba's New Model on the Path to True AI Agents Next Article Google Vids: Free AI Video and Music Generation – What's New in the Editor

Gemma 4: Google DeepMind's Multimodal AI Models for On-Device Use

Gemma 4: Multimodal AI for Images, Video, and Audio

Gemma 4 Models: Sizes from On-Device to Server Deployment

Gemma 4 Architecture: Understanding Its Efficiency

Running Gemma 4: Tool Support and Compatibility

Fine-Tuning Gemma 4: Customization for Specific Tasks

Gemma 4 Significance for AI Developers

Related Publications

Qwen3.5: The First Natively Multimodal Model

Qwen3.6-Plus: Alibaba's New Model on the Path to True AI Agents

Liquid AI Releases LFM2-24B, Its Largest Language Model – And It Runs on a Regular Laptop

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration