Published on February 26, 2026

Fine-Tuning Small Language Models with Synthetic Data and LuminaSFT

How to Make Small Language Models Think Better: AMD's Experience with Synthetic Data

AMD has introduced LuminaSFT, an approach that uses synthetic data to fine-tune small language models and achieve surprisingly high performance.

Development 6 – 9 minutes min read

Event Source: AMD 6 – 9 minutes min read

Large language models – GPT-like giants with tens and hundreds of billions of parameters – have long been the benchmark in the world of AI. But over the last couple of years, another question has become increasingly prominent: does a model have to be huge to be useful?

Small language models – commonly known as SLMs – are becoming increasingly common in real-world projects. They are cheaper to run, respond faster, and can be executed directly on a user's device without connecting to the cloud. Simply put: they cost less and, with the right configuration, can rival their larger counterparts on specific tasks.

And it's this 'with the right configuration' part that the development we're about to discuss is all about.

Challenges of Collecting High Quality Training Data for LLMs

The Problem: Good Data Is Expensive

For a model to perform well on a specific task – like answering questions in a particular domain or conducting a dialogue in a desired style – it needs to be fine-tuned. This process is called fine-tuning. It requires examples: questions and correct answers, instructions and their executions, or dialogues – depending on the task.

The problem is that good training data is often scarce. Collecting it manually is time-consuming and expensive. Bringing in experts for labeling is even more costly. And even when data is available, it might be proprietary, unstructured, or not diverse enough.

This is where the idea of synthetic data comes in: what if we could generate the necessary training examples using a larger model? Take a large model, ask it to create 'question-answer' training pairs, filter for high-quality ones, and use them to fine-tune a smaller model. It sounds like a reasonable plan – and this is exactly what the approach called LuminaSFT, described by the AMD team, implements.

What Is LuminaSFT and How Does It Work

LuminaSFT is not a standalone model or a product, but a methodology: a set of approaches for generating synthetic data to fine-tune small language models and doing it in a way that ensures the result is truly high-quality.

The key idea is simple: a large model acts as a 'teacher.' It generates a variety of tasks and the correct answers to them – which then become training examples for the small 'student' model. The goal here is not just to generate heaps of text, but to obtain examples that will genuinely improve the model's behavior.

Within LuminaSFT, this process is structured into several steps:

Generating diverse instructions. The large model creates a wide range of tasks – varying in type, complexity, and subject matter. This is crucial to prevent the small model from being 'pigeonholed' into a single scenario after training.
Filtering and quality assessment. Not all generated examples are equally useful. Some are too simple, inaccurate, or repetitive. Therefore, the data goes through a selection process, keeping only what genuinely contributes to the training.
Fine-tuning the small model. The filtered examples are used for fine-tuning. Afterward, the small model starts to perform better on the specific tasks for which the data was prepared.

All of this is implemented using AMD GPUs and the ROCm platform – AMD's framework for working with neural networks, which serves as an alternative to the more well-known NVIDIA CUDA platform.

Benefits and Risks of Using Synthetic Data for Model Training

Why Synthetic Data Isn't Just a 'Generate and Forget' Process

Synthetic data is a topic that has been discussed more and more frequently lately. And it has both obvious advantages and hidden pitfalls.

The advantage is obvious: there's no need to hire labelers or search for rare datasets; you can quickly obtain data for a specific task. This is especially valuable when you need to fine-tune a model for a narrow domain – such as medical consultations or technical support, where high-quality open-source data is almost non-existent.

But there are risks. If the large model generates examples with errors, the small model will learn them. If the data lacks diversity, the model will become predictable and inflexible. If there is any bias embedded in the synthetic data, it will be passed on to the fine-tuned model.

This is precisely why LuminaSFT places so much emphasis on diversity and filtering. The authors of the approach specifically designed the generation process so that the examples would differ from each other in style and content – this reduces the risk of the model simply 'memorizing' templates instead of learning to solve problems.

Performance of Small Language Models Compared to Large Models

Small Doesn't Mean Weak

The results demonstrated by the approach look convincing. Small models, fine-tuned on synthetic data using the LuminaSFT methodology, show performance comparable to that of larger models on a range of tasks – especially where the training data was well-tailored to a specific domain.

This is an important point. A small language model is not a 'stripped-down' version of a large one, but a separate tool with its own advantages. If it's well-configured for a specific task, it can perform just as well while being significantly cheaper to operate.

To put it simply: you don't always need a sports car. Sometimes a well-tuned city hatchback handles its route better – and uses less gas.

Practical Applications and Target Audience for LuminaSFT

Who Is This For in Practice

LuminaSFT is not a tool for the end-user, but a methodology for those who develop or customize language models for specific tasks.

In short, it's potentially useful for:

teams looking to deploy their own language model without incurring massive computational costs;
developers working in niche subject areas who can't find a suitable open-source dataset;
those already using or considering AMD hardware for AI – and who want to understand what can be done with this equipment.

This last point is important for context. AMD is actively developing its AI division, and publications like the one on LuminaSFT are both a demonstration of technical capabilities and an attempt to attract developers to its ecosystem. It's a subtle, yet quite transparent, move.

Limitations and Challenges of the Synthetic Data Approach

What's Left Out of the Picture

The methodology appears sound, but open questions remain.

First, the quality of synthetic data directly depends on the quality of the 'teacher' model. If the large model makes mistakes on a certain topic, these errors will end up in the training set. No amount of filtering can guarantee one hundred percent purity.

Second, this approach works well when the task is clearly defined. The more vague or broad the model's application area is, the more difficult it becomes to select the right set of synthetic examples.

Third, it is still a resource-intensive process. Generating synthetic data using a large model costs money and time. The expense simply shifts – from manual labeling to computation. For some, this is more cost-effective; for others, it's not.

But overall, the direction is clear and logical: using large models as a tool to create knowledge for smaller ones. This is not a new idea in machine learning – similar approaches (sometimes called 'knowledge distillation') have been around for a long time. LuminaSFT is a specific implementation of this logic, tailored for modern open-source models and AMD's hardware platform.

And if methodologies like this become simpler and more accessible, small models will become an increasingly serious alternative for those who are unwilling or unable to pay for access to the computational giants.

#applied analysis #methodology #neural networks #ai training #engineering #data #model optimization #synthetic data

Link to Original: https://rocm.blogs.amd.com/artificial-intelligence/luminasft/README.html

Original Title: LuminaSFT: Generating Synthetic Fine-Tuning Data for Small Language Models – ROCm Blogs

Publication Date: Feb 24, 2026

AMD www.amd.com An international company manufacturing processors and computing accelerators for AI workloads.

Previous Article Cache as a Resource: How Alibaba Cloud Teaches AI Not to Calculate the Same Thing Twice Next Article Modular Intelligence: How AI Learns to Think Like Humans

Fine-Tuning Small Language Models with Synthetic Data and LuminaSFT

Challenges of Collecting High Quality Training Data for LLMs

What Is LuminaSFT and How Does It Work

Benefits and Risks of Using Synthetic Data for Model Training

Performance of Small Language Models Compared to Large Models

Practical Applications and Target Audience for LuminaSFT

Limitations and Challenges of the Synthetic Data Approach

Related Publications

SyGra Studio: A Tool for Generating Synthetic Data Based on Knowledge Graphs

How Agentic Models Are Trained After Base Training

What Affects Text-to-Image Model Quality: PhotoRoom's Research on Important Training Details

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration