When Google releases a new family of open models, the question «but what can we run this on?» arises almost immediately. With Gemma 4, AMD aimed to address this in advance: support for the entire new model lineup was available on release day. This doesn't just cover server hardware but also consumer GPUs and laptop processors.
Gemma 4 is a family of four open models from Google, varying in size and architecture. The most compact model has approximately 2 billion active parameters, while the largest has 31 billion. Some models are built using a classic «dense» architecture, while others use a «Mixture of Experts» approach. In simple terms, the model activates only the necessary portion of its «knowledge» depending on the task, which helps save computational resources.
The models are multimodal: they work with text, images, and some variants also handle audio. The context window reaches 256,000 tokens – an impressive amount, roughly equivalent to several thick novels. Claimed strengths include understanding 140 languages, handling code, recognizing text and objects in images, and voice input.
Compared to the previous generation, Gemma 3, the architecture has been redesigned to improve efficiency and quality when handling long contexts. The modules for image and audio processing have also been updated. Collectively, this makes Gemma 4 an interesting option for so-called «agentic scenarios», where the model doesn't just answer questions but independently executes chains of actions.
From Data Center to Laptop – Everything is Covered
AMD has announced support for Gemma 4 across three tiers of its product line:
- Instinct GPUs – server accelerators for data centers and corporate infrastructure;
- Radeon GPUs – graphics cards for workstations and home PCs;
- Ryzen AI – processors for AI laptops, including those with a dedicated Neural Processing Unit (NPU).
Support is implemented through several popular tools: LM Studio for easy local execution, as well as a number of open-source projects aimed at developers.
Running in the Cloud and on Servers
For server-side scenarios, Gemma 4 can be deployed using two main frameworks: vLLM and SGLang. Both are optimized for high performance when serving many concurrent requests, which is crucial for production environments.
vLLM supports several generations of Instinct and Radeon GPUs. SGLang is tailored for top-tier server accelerators from the MI300X, MI325X, and MI35X series. Notably, the entire Gemma 4 lineup – including the MoE architecture models – fits on a single MI300X accelerator with its 192 GB of memory, even with the full context window. For higher-load scenarios, multiple accelerators can be used in parallel.
Running on Personal Hardware – Easier Than You Think
For those who want to run Gemma 4 locally – on their personal computer or laptop – AMD offers two paths.
The first is through LM Studio. This is an application with a graphical user interface that allows you to download and run the model in just a few clicks. It works with Ryzen AI and Ryzen AI Max processors, as well as Radeon and Radeon PRO cards. For full acceleration, up-to-date AMD Software: Adrenalin Edition drivers are required.
The second path is through Lemonade Server. This is a more flexible option for those who want to interact with the model via an API compatible with the OpenAI format. Lemonade supports acceleration on both the GPU via ROCm and the NPU in Ryzen AI processors.
The NPU: A Story in Itself
The Neural Processing Unit (NPU) in Ryzen AI processors is a specialized chip within the processor, designed specifically for neural network tasks. It consumes significantly less power than a GPU, which is critical for a laptop's battery life.
Support for Gemma 4 on the NPU will arrive with the next Ryzen AI SW update. Initially, two compact models will be available: Gemma-4 E2B and E4B. For developers, this support will be implemented through interfaces like OnnxRuntime, simplifying integration into their own applications.
Why This Matters for Users
«Day-one» support is not just a marketing gimmick. Previously, users and developers often had to wait weeks or even months for a new model to appear in a user-friendly tool or to work on specific hardware. In this case, AMD synced up with Google's release in advance.
For the average user, this means they can try out the new model immediately – via LM Studio, without waiting for patches or updates. For developers, it means they can start building their own projects on Gemma 4 right away, without worrying about the supporting infrastructure lagging behind.
The open weights of Gemma 4, combined with broad hardware support, make it a viable option for those who want to run powerful language models locally – without cloud dependency and without needing a server rack on hand.