Published February 6, 2026

Reducing Embedding Model Memory Usage on AMD Ryzen AI NPU

How to Curb the «Appetites» of Embedding Models on AMD Ryzen AI

AMD has introduced a simple method for compressing embedding models for local NPUs: converting weights from FP32 to BF16 format using just a few lines of Python code.

Technical context Development
Event Source: AMD Reading Time: 4 – 5 minutes

Memory Constraints in AMD Ryzen AI NPU for Embedding Models

What's the Deal with Memory?

Embedding models – the ones that transform text into numerical vectors for searching or comparison – usually run on GPUs or CPUs. But if you want to run them locally on a laptop powered by an AMD Ryzen AI processor, a not-so-obvious problem arises: the integrated neural processing unit (NPU) has tight constraints on memory capacity.

Simply put, the model might just fail to fit into the allocated space. This is especially true if it's saved in ONNX format with FP32 weights – meaning 32-bit floating-point numbers. This is the standard, but it's often overkill.

Converting FP32 to BF16 to Reduce Model Size

The Idea: Swap FP32 for BF16

AMD suggested a specific way to tackle this: convert the weights from FP32 to BF16 (bfloat16). This is a 16-bit format that cuts the model size roughly in half while maintaining a wide dynamic range of values, which is crucial for computational stability.

The bottom line is that for most embedding-related tasks, this loss of precision isn't a dealbreaker. The model will perform almost exactly as before, but it will take up half as much space in the NPU memory.

Converting ONNX Model Weights from FP32 to BF16

How It's Done

In their technical materials, AMD shared working Python code. The logic behind the process is straightforward:

  • take the ONNX model;
  • iterate through all the weights;
  • if a weight is stored in FP32 format, convert it to BF16;
  • save the modified model.

The code consists of literally a few functions: one reads the weights, the second transforms them via bitwise operations (FP32 → BF16), and the third writes the changes back. No messy dependencies – just standard libraries like numpy and tools for working with the ONNX format.

An example from the publication demonstrates how the script identifies weights, prints their data type, converts, and saves them:

print(f"\nFound weight: {weight_name}")
print(f"Original data type: {onnx...}")
print("\nConverting FP32 to BF16"...)
bf16_data_uint16 = float32_to_bfloat16(fp32_data)

It's not magic; it's the automation of a process that could be done manually if one had the urge to dive deep into the ONNX file structure.

Why This Matters Specifically for Ryzen AI

The NPU in AMD Ryzen AI processors is a dedicated block for accelerating neural network tasks. It's more energy-efficient than a GPU but has a limited amount of available memory. If a model «doesn't fit», it's impossible to run it on the NPU – forcing you to use the CPU or GPU, which negates the benefits of a specialized chip.

Converting to BF16 allows you to fit more models or work with larger architectures without overstepping memory limits. This is particularly relevant for BCE (Binary Cross-Entropy) models and other embeddings frequently used in search, classification, or text-matching tasks.

Limitations and Considerations of BF16 Conversion

What's Left Behind the Scenes

AMD hasn't provided specific benchmarks – neither regarding model quality after conversion nor inference (output) speed. This is more of a proof-of-concept than a turnkey solution with guaranteed results.

It's also not entirely clear how universal this method is. For some models, the loss of precision might become noticeable, especially if the task requires high sensitivity to the smallest differences in data. In such cases, you'll have to run manual tests and evaluate the changes in metrics.

Another nuance is technical support. The code is presented as an example rather than an official software product. If the ONNX format updates or the structure of a specific model changes, the script might require manual adjustments.

The Bottom Line

This is a practical solution for those planning to run embedding models on the NPUs of AMD Ryzen AI processors who are facing a memory shortage. The FP32 → BF16 conversion doesn't require deep expertise and takes only a few minutes if you have the original model in ONNX format.

The approach isn't exactly new, but AMD has confirmed its viability for their hardware and provided ready-to-use code that is easy to adapt for current tasks. For developers working with local AI, this is a useful tool – provided the model «remains viable» after conversion.

Original Title: Practical Technique for Reducing Memory Usage of BCE Models on Ryzen AI
Publication Date: Feb 6, 2026
AMD www.amd.com An international company manufacturing processors and computing accelerators for AI workloads.
Previous Article BrowseSafe: How to Protect Browser AI Agents from Hidden Attacks Next Article Claude Opus 4.6: Anthropic Releases Its Most Powerful Model Version Yet

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 3 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 3 Pro Google DeepMind
3.
Gemini 3 Flash Preview Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 3 Flash Preview Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe