Published on October 18, 2025

How to Teach a Computer to Convert MRI to CT: The Neural Networks That See Bones Where They're Not Supposed to Be

A new neural network architecture converts MRI and cone-beam CT scans into high-quality CT images, enabling doctors to plan radiation therapy with greater precision.

Electrical Engineering & System Sciences 12 – 18 minutes min read

Author: Dr. Anna Muller 12 – 18 minutes min read

Imagine you're preparing for an important presentation. You have two drafts: one with a great structure but no details, and another that's detailed but poorly organized. What you need is a third version that combines the strengths of both. This is roughly the challenge radiologists face when planning radiation therapy for cancer patients.

They need two types of images. Magnetic Resonance Imaging (MRI) shows soft tissues and tumors with incredible clarity – it's like an HD photo of your internal anatomy. But to calculate the radiation dosage, you need a Computed Tomography (CT) scan. It shows how dense your tissues are and how X-rays will pass through them. Without this, you can't accurately calculate where to direct the radiation.

The problem is that performing both an MRI and a CT for every patient means extra radiation exposure, time, and cost. So, what if we could teach a computer to create a CT scan from an MRI? It sounds like science fiction, but that's exactly what researchers are working on, and a new solution has recently emerged that outperforms previous methods.

Why Is This So Difficult?

Let's break down why you can't just take an MRI and «convert» it into a CT scan, like changing a photo's file format.

An MRI shows how hydrogen atoms in your tissues react to a magnetic field. This provides an excellent picture of soft tissues – muscles, tumors, the brain. But it contains no information whatsoever about tissue density, which is essential for calculating radiation dosage. It's like being given a city map with street names but no building heights; for some tasks, it's simply not enough.

A CT scan, on the other hand, is based on X-rays. It's great at showing bones and tissue density, but soft tissues are much less visible. Then there's Cone-Beam CT (CBCT) – a simplified version used right in the radiation therapy room to verify the patient's position. But its quality is mediocre, it's full of artifacts, and it's not suitable for precise calculations.

The task, then, is this: take an image from one world (MRI or CBCT) and create an image from another (high-quality CT), ensuring that all anatomical details remain in place while information about tissue density appears where it didn't exist before.

Neural Networks Learn to See What Isn't There

Over the last few years, a quiet revolution has taken place in medical imaging. Deep learning – a technology where a neural network learns to find patterns in massive amounts of data – has become the primary tool for processing medical images.

One of the most successful architectures is called nnU-Net. It's not just a neural network but an entire system that automatically configures itself for your data. You upload your images, specify what you want as the output, and nnU-Net decides on its own what patch size to use, how many layers are needed, and how to set up the training process. It was originally designed for segmentation – the task of outlining the liver, heart, or a tumor on a scan. But it turned out that with minor modifications, it works perfectly for image translation, too.

The new study used two versions of this architecture: a standard one and a residual one.

The standard version is straightforward: an MRI goes in, and a synthetic CT comes out. About thirty million parameters are adjusted to make the result as similar as possible to a real CT.

The residual version is more sophisticated. It doesn't try to create a CT from scratch but instead learns to find the difference between the input and the target images. It's like being asked to draw a portrait, but instead of starting with a blank canvas, you're given a photo and told, «Just show me what needs to be changed». This approach allows the network to focus on the truly important details. The residual version has more parameters – about fifty-seven million – but this enables it to better capture fine anatomical structures.

A Loss Function That Understands Anatomy

But this raises a question: how do you explain to a neural network what's important? Typically, simple metrics are used, like the mean absolute error of pixel brightness. If the predicted image differs from the real one by ten brightness units in every pixel, the network gets penalized. The smaller the deviation, the better.

The problem is that such a metric doesn't understand anatomy. It doesn't care whether the network made an error in the brightness of soft tissue or bone. But for a doctor, the difference is huge. If the border between the liver and the kidney is blurred by a millimeter, it can be critical. If a bone is slightly less bright but its shape is correct, it's not as serious.

The solution was found in what are known as perceptual loss functions. The idea is to compare not the pixels themselves, but the features extracted from the images by another neural network – one that's already been trained to understand anatomy.

The researchers took a pre-trained segmentation network from the TotalSegmentator project. This network can recognize dozens of organs and structures: the liver, lungs, bones, and major blood vessels. It looks at a CT scan and understands: here are the ribs, here is the heart, and here is the spine.

The new loss function, named AFP (Anatomical Feature-Prioritized), works like this: the real and synthetic CT scans are passed through this segmentation network, and the features extracted at different levels are compared. If the features are similar, it means the synthesis network has reproduced the anatomy correctly. If not, it gets penalized.

This is similar to how a teacher grades an essay: it's not just about grammar (pixel brightness), but also about meaning (anatomical accuracy). AFP forces the network to do more than just copy brightness values; it has to actually understand what is where.

How It Works in Practice

A multi-center dataset called SynthRAD2025 was used for training. It included images of three anatomical regions: head and neck, thorax, and abdomen. All scans were pre-aligned to reduce complexity for the network.

The images were divided into 3D patches – small cubes. Imagine slicing a loaf of bread, and then dicing each slice. Only instead of bread, you have volumetric medical data. The patch size was selected automatically based on the anatomical region.

MRIs were normalized so that in every case, the mean brightness was zero and the standard deviation was one. For CT and CBCT scans, extreme values were first clipped (to remove outliers), and then normalization was also applied. This is a standard procedure that helps the network learn more effectively.

The training was lengthy. The standard network trained for a thousand epochs, and the residual one for fifteen hundred. An epoch is one full pass of all the training data through the network. After that, the best models were fine-tuned for another five hundred epochs, this time with the AFP loss added. So, first the network learned to simply reproduce brightness, and then it learned to understand anatomy.

An interesting detail: the researchers intentionally did not use additional augmentations – artificially expanding the dataset with rotations, flips, and noise. This was done to ensure the results were as reproducible as possible and not dependent on random effects.

During the inference stage, when the trained network creates synthetic CTs for new patients, a clever trick was used. The image was divided into overlapping patches, each patch was processed separately, and then the results in the overlapping regions were averaged. It's like having three people paint different parts of a picture and then blending their brushstrokes where their sections meet – this way, you avoid sharp seams.

The Results: Numbers and Images

The models were evaluated using two groups of metrics.

The first group was intensity metrics. These show how accurately the network reproduced brightness levels: Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM). Here, the models trained only with L1 loss – that simple metric comparing pixel brightness directly – came out on top. This makes sense: if a network is trained to minimize brightness deviation, it will perform best on that specific metric.

The second group was segmentation metrics. Here, the synthetic and real CTs were passed through TotalSegmentator, their organ masks were generated, and then the masks were compared. The two main metrics were the Dice coefficient (which shows how well the masks overlap) and the Hausdorff distance (which shows the maximum deviation of the boundaries). Here, the models with AFP were the winners. They better reproduced the shape and position of organs, the boundaries were sharper, and the bones were more distinct.

But the truly interesting part is when you look at the images themselves.

When converting MRI to CT, the models with AFP produced much sharper bones. The shoulder blades, ribs, and spine – all were clearly visible, almost like on a real CT scan. The models without AFP produced blurrier bones, even though the overall tissue brightness might have been closer to the original.

The residual architecture amplified this effect. It was even better at handling details – small bone structures, organ boundaries. It's the difference between a photo from a standard phone and one from a professional camera: the overall contours are the same, but the level of detail is completely different.

When converting CBCT to regular CT, the task was more challenging. Cone-beam tomography is full of artifacts – false streaks, blurs, and noise. The network had to not only translate the image but also «clean it up». And once again, the combination of the residual architecture with AFP delivered the best results. Pathologies – like tumors or altered tissues – were reproduced more accurately, and organ boundaries didn't blur.

Why Metrics Can Be Deceiving

This presents a paradox. The models with AFP score worse on MAE and PSNR but better on segmentation metrics and visual quality. How is that possible?

It's because different metrics optimize for different things. MAE and PSNR look at each pixel individually and calculate how much it differs from the original. If a bone is shifted by one pixel but its brightness is correct, these metrics will register an error. If the bone is in the right place but is slightly brighter or darker, they will also register an error. To them, both scenarios are equally bad.

But for a doctor, they are not. For radiation therapy planning, it is critical that organs are in their correct locations. If the liver on a synthetic CT is shifted by three millimeters, it's a disaster – the radiation could miss its target. If the liver is in the right place but its average brightness differs by five Hounsfield units, it's not ideal, but it's workable.

AFP teaches the network to do exactly this: preserve the anatomical structure, not just copy brightness values. That's why its intensity metrics are slightly worse, but its clinical applicability is higher.

This is similar to car navigation systems. One might show the shortest route by distance, while another shows the fastest route by time. Technically, the first is more «accurate» (meters don't lie), but the second is more useful because it accounts for traffic, stoplights, and real-world conditions.

Limitations and Artifacts

Of course, nothing is perfect. The models with AFP sometimes produced «checkerboard» artifacts – abrupt transitions in brightness in some areas. This is related to how convolutional layers work in the decoder. The researchers identified the cause: transposed convolutions, which increase image resolution, create these artifacts when used with perceptual loss functions.

They found a simple solution: replacing the transposed convolutions with trilinear interpolation. This mitigated the problem but didn't eliminate it completely. Future work will focus on further optimizing the decoder architecture – perhaps using hybrid approaches that retain the benefits of AFP while minimizing artifacts.

Another challenge is the training time. Fifteen hundred epochs plus five hundred epochs of fine-tuning requires dozens of hours on powerful GPUs. But this is a one-time cost. Once trained, the network is fast: creating a synthetic CT for a new patient takes only a few minutes.

Why This Matters for Real-World Medicine

Let's go back to the beginning. Why is all this necessary?

Imagine a patient with a brain tumor. They get an MRI to see the exact tumor boundaries – it might be close to critical structures, and the doctor needs maximum clarity. But for radiation therapy planning, a CT is also needed to calculate how X-rays will pass through the skull and tissues. This means the patient must undergo both scans. That's more time, additional radiation, and extra costs.

Now, imagine you could perform only the MRI and then generate an artificial CT – one so accurate that a doctor could confidently plan the treatment. The patient receives less radiation exposure, the process is faster, and the clinic saves resources.

Or consider another scenario: cone-beam CT. It's done in the radiation therapy room before each session to ensure the patient is positioned correctly. But the quality of these scans is low, and you can't recalculate the radiation dose based on them. If you could «enhance» a CBCT to a diagnostic-quality level, you could adapt the treatment plan during the course of therapy, accounting for changes in the tumor and organ positions.

This is precisely what these neural networks are for. This isn't an academic exercise; it's a tool that could change the practice of radiation therapy.

What's Next

The proposed method is not the final destination but rather a solid platform for future development. The combination of nnU-Net's automatic configuration, the benefits of residual learning, and an anatomically-aware loss function has proven its effectiveness. The network performs reliably across different anatomical regions, doesn't require complex augmentations, and doesn't break on new data.

The next steps are clear. The decoder needs to be optimized to completely eliminate artifacts. It's possible to experiment with hybrid loss functions – perhaps adding elements of generative adversarial networks (GANs), where one network creates images and another tries to distinguish them from real ones. We could also try transformers – an architecture showing impressive results in natural image and text processing.

But the most important part is already done: it has been proven that deep learning can solve the task of cross-modality medical image synthesis at a clinically significant level. This is no longer a theory but a working technology that is ready to enter real-world practice.

There was a time when planning radiation therapy took weeks and required multiple scans. Then came automatic contouring systems. Now, the era of synthetic images is arriving – when a single scan is sufficient for tasks that once required two or three. The goal is to make medical imaging as reliable as a utility, always available and accurate.

This isn't an overnight revolution. But every step like this brings us closer to a future of medicine where technology works for precision, speed, and patient comfort. Where a doctor gets all the necessary data from a single scan. Where a neural network sees bones where they aren't physically visible – and does it so well that it can be trusted.

#applied analysis #research review #neural networks #machine learning #computer vision #biology #ai in medicine #intermodal synthesis

Source: https://arxiv.org/abs/2509.22394v1

Original Title: Deep Learning-Based Cross-Anatomy CT Synthesis Using Adapted nnResU-Net with Anatomical Feature Prioritized Loss

Article Publication Date: Sep 26, 2025

Original Article Authors : Javier Sequeiro González, Arthur Longuefosse, Miguel Díaz Benito, Álvaro García Martín, Fabien Baldacci

Dr. Anna Muller View Profile

«Energy must be as reliable as air.»

View Profile

I am an engineer for whom the future of energy is not about abstractions, but about concrete solutions for real life.

Previous Article When Mathematics Paints on an Ellipse: Taming the Boundless Next Article Why the Labor Market Doesn't Follow the Textbooks: The Illusions We Mistook for Laws

How to Teach a Computer to Convert MRI to CT: The Neural Networks That See Bones Where They're Not Supposed to Be

Why Is This So Difficult?

Neural Networks Learn to See What Isn't There

A Loss Function That Understands Anatomy

How It Works in Practice

The Results: Numbers and Images

Why Metrics Can Be Deceiving

Limitations and Artifacts

Why This Matters for Real-World Medicine

What's Next

Related Publications

How AI Learned to Spot Brain Vessels Where Doctors Struggle: A Real Breakthrough in Doppler Diagnostics

Как научить ИИ предсказывать рак при -40°C: История одного трансформера, который не боится пропусков в данных

Можно ли проследить, как программируется детский мозг? История одного алгоритма

From Research to Understanding

Neural Networks Involved in the Process

1. Research Summarization

2. Creating Text from Summary

3. step.translate-en.title

4. Editorial Review

5. Preparing Description for Illustration

6. Creating Illustration