Memory Constraints in AMD Ryzen AI NPU for Embedding Models
What's the Deal with Memory?
Embedding models – the ones that transform text into numerical vectors for searching or comparison – usually run on GPUs or CPUs. But if you want to run them locally on a laptop powered by an AMD Ryzen AI processor, a not-so-obvious problem arises: the integrated neural processing unit (NPU) has tight constraints on memory capacity.
Simply put, the model might just fail to fit into the allocated space. This is especially true if it's saved in ONNX format with FP32 weights – meaning 32-bit floating-point numbers. This is the standard, but it's often overkill.
Converting FP32 to BF16 to Reduce Model Size
The Idea: Swap FP32 for BF16
AMD suggested a specific way to tackle this: convert the weights from FP32 to BF16 (bfloat16). This is a 16-bit format that cuts the model size roughly in half while maintaining a wide dynamic range of values, which is crucial for computational stability.
The bottom line is that for most embedding-related tasks, this loss of precision isn't a dealbreaker. The model will perform almost exactly as before, but it will take up half as much space in the NPU memory.
Converting ONNX Model Weights from FP32 to BF16
How It's Done
In their technical materials, AMD shared working Python code. The logic behind the process is straightforward:
- take the ONNX model;
- iterate through all the weights;
- if a weight is stored in FP32 format, convert it to BF16;
- save the modified model.
The code consists of literally a few functions: one reads the weights, the second transforms them via bitwise operations (FP32 → BF16), and the third writes the changes back. No messy dependencies – just standard libraries like numpy and tools for working with the ONNX format.
An example from the publication demonstrates how the script identifies weights, prints their data type, converts, and saves them:
print(f"\nFound weight: {weight_name}")
print(f"Original data type: {onnx...}")
print("\nConverting FP32 to BF16"...)
bf16_data_uint16 = float32_to_bfloat16(fp32_data)
It's not magic; it's the automation of a process that could be done manually if one had the urge to dive deep into the ONNX file structure.
Why This Matters Specifically for Ryzen AI
The NPU in AMD Ryzen AI processors is a dedicated block for accelerating neural network tasks. It's more energy-efficient than a GPU but has a limited amount of available memory. If a model «doesn't fit», it's impossible to run it on the NPU – forcing you to use the CPU or GPU, which negates the benefits of a specialized chip.
Converting to BF16 allows you to fit more models or work with larger architectures without overstepping memory limits. This is particularly relevant for BCE (Binary Cross-Entropy) models and other embeddings frequently used in search, classification, or text-matching tasks.
Limitations and Considerations of BF16 Conversion
What's Left Behind the Scenes
AMD hasn't provided specific benchmarks – neither regarding model quality after conversion nor inference (output) speed. This is more of a proof-of-concept than a turnkey solution with guaranteed results.
It's also not entirely clear how universal this method is. For some models, the loss of precision might become noticeable, especially if the task requires high sensitivity to the smallest differences in data. In such cases, you'll have to run manual tests and evaluate the changes in metrics.
Another nuance is technical support. The code is presented as an example rather than an official software product. If the ONNX format updates or the structure of a specific model changes, the script might require manual adjustments.
The Bottom Line
This is a practical solution for those planning to run embedding models on the NPUs of AMD Ryzen AI processors who are facing a memory shortage. The FP32 → BF16 conversion doesn't require deep expertise and takes only a few minutes if you have the original model in ONNX format.
The approach isn't exactly new, but AMD has confirmed its viability for their hardware and provided ready-to-use code that is easy to adapt for current tasks. For developers working with local AI, this is a useful tool – provided the model «remains viable» after conversion.