Published on February 6, 2026

Reducing Embedding Model Memory Usage on AMD Ryzen AI NPU

How to Curb the «Appetites» of Embedding Models on AMD Ryzen AI

AMD has introduced a simple method for compressing embedding models for local NPUs: converting weights from FP32 to BF16 format using just a few lines of Python code.

Development / Technical context 4 – 5 minutes min read

Event Source: AMD 4 – 5 minutes min read

Memory Constraints in AMD Ryzen AI NPU for Embedding Models

What's the Deal with Memory?

Embedding models – the ones that transform text into numerical vectors for searching or comparison – usually run on GPUs or CPUs. But if you want to run them locally on a laptop powered by an AMD Ryzen AI processor, a not-so-obvious problem arises: the integrated neural processing unit (NPU) has tight constraints on memory capacity.

Simply put, the model might just fail to fit into the allocated space. This is especially true if it's saved in ONNX format with FP32 weights – meaning 32-bit floating-point numbers. This is the standard, but it's often overkill.

Converting FP32 to BF16 to Reduce Model Size

The Idea: Swap FP32 for BF16

AMD suggested a specific way to tackle this: convert the weights from FP32 to BF16 (bfloat16). This is a 16-bit format that cuts the model size roughly in half while maintaining a wide dynamic range of values, which is crucial for computational stability.

The bottom line is that for most embedding-related tasks, this loss of precision isn't a dealbreaker. The model will perform almost exactly as before, but it will take up half as much space in the NPU memory.

Converting ONNX Model Weights from FP32 to BF16

How It's Done

In their technical materials, AMD shared working Python code. The logic behind the process is straightforward:

take the ONNX model;
iterate through all the weights;
if a weight is stored in FP32 format, convert it to BF16;
save the modified model.

The code consists of literally a few functions: one reads the weights, the second transforms them via bitwise operations (FP32 → BF16), and the third writes the changes back. No messy dependencies – just standard libraries like numpy and tools for working with the ONNX format.

An example from the publication demonstrates how the script identifies weights, prints their data type, converts, and saves them:

print(f"\nFound weight: {weight_name}")
print(f"Original data type: {onnx...}")
print("\nConverting FP32 to BF16"...)
bf16_data_uint16 = float32_to_bfloat16(fp32_data)

It's not magic; it's the automation of a process that could be done manually if one had the urge to dive deep into the ONNX file structure.

Why This Matters Specifically for Ryzen AI

The NPU in AMD Ryzen AI processors is a dedicated block for accelerating neural network tasks. It's more energy-efficient than a GPU but has a limited amount of available memory. If a model «doesn't fit», it's impossible to run it on the NPU – forcing you to use the CPU or GPU, which negates the benefits of a specialized chip.

Converting to BF16 allows you to fit more models or work with larger architectures without overstepping memory limits. This is particularly relevant for BCE (Binary Cross-Entropy) models and other embeddings frequently used in search, classification, or text-matching tasks.

Limitations and Considerations of BF16 Conversion

What's Left Behind the Scenes

AMD hasn't provided specific benchmarks – neither regarding model quality after conversion nor inference (output) speed. This is more of a proof-of-concept than a turnkey solution with guaranteed results.

It's also not entirely clear how universal this method is. For some models, the loss of precision might become noticeable, especially if the task requires high sensitivity to the smallest differences in data. In such cases, you'll have to run manual tests and evaluate the changes in metrics.

Another nuance is technical support. The code is presented as an example rather than an official software product. If the ONNX format updates or the structure of a specific model changes, the script might require manual adjustments.

The Bottom Line

This is a practical solution for those planning to run embedding models on the NPUs of AMD Ryzen AI processors who are facing a memory shortage. The FP32 → BF16 conversion doesn't require deep expertise and takes only a few minutes if you have the original model in ONNX format.

The approach isn't exactly new, but AMD has confirmed its viability for their hardware and provided ready-to-use code that is easy to adapt for current tasks. For developers working with local AI, this is a useful tool – provided the model «remains viable» after conversion.

#applied analysis #technical context #neural networks #engineering #computer systems #products #model quantization #inference optimization

Link to Original: https://www.amd.com/en/developer/resources/technical-articles/2026/practical-technique-for-reducing-memory-usage-of-bce-models-on-r.html

Original Title: Practical Technique for Reducing Memory Usage of BCE Models on Ryzen AI

Publication Date: Feb 6, 2026

AMD www.amd.com An international company manufacturing processors and computing accelerators for AI workloads.

Previous Article BrowseSafe: How to Protect Browser AI Agents from Hidden Attacks Next Article Claude Opus 4.6: Anthropic Releases Its Most Powerful Model Version Yet

Reducing Embedding Model Memory Usage on AMD Ryzen AI NPU

Memory Constraints in AMD Ryzen AI NPU for Embedding Models

Converting FP32 to BF16 to Reduce Model Size

Converting ONNX Model Weights from FP32 to BF16

Why This Matters Specifically for Ryzen AI

Limitations and Considerations of BF16 Conversion

The Bottom Line

Related Publications

AMD Quark ONNX: Automated Search for Optimal Quantization Strategies

Nitro-AR: A Compact Transformer for Image Generation

Teaching Comms to Recognize Signals Without the Math Overload: A Neural Net for OFDM at -40°C

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration