Published on March 11, 2026

A Voice at the Appointment: Why AI Can't Make Out the Doctor

Researchers tested whether AI systems can comprehend real-world medical conversations – and the results delivered a harsh verdict for the entire industry.

Electrical Engineering & System Sciences 11 – 16 minutes min read
Author: Dr. Alexey Petrov 11 – 16 minutes min read
«What hooked me about this story wasn't the technical side – that was all predictable. It was something else: twelve teams, six to eight weeks of work, the best available tools – and it still wasn't enough. This isn't a failure; it's an honest calibration. These are the kinds of results I find truly valuable: not the ones that show 'how well everything works,' but those that reveal just how far we still have to go. I wonder how many commercial products in this field would pass the same test – and be willing to publish the results.» – Dr. Alexey Petrov

Picture this. A rural medical camp somewhere in India. A young doctor is holding a consultation. The patient is speaking a mix of Hindi and a local dialect, a generator is humming nearby, a child is crying in the next room, and a staff member interrupts mid-conversation to double-check a medication dosage. In twenty minutes, the doctor has to ask about symptoms, explain the treatment, write a prescription, and remember key details for the medical record.

Now for the question: can a speech recognition program handle this? Not in a soundproof studio, not in the quiet of an office – but right here, in this chaos?

In 2023–2024, a group of researchers decided to find an honest answer to this question. They organized a competition called DISPLACE-M, which stands for «DIarization and Speech Processing for LAnguage understanding in Conversational Environments – Medical.» Behind the complex name lies a simple and brutal task: take real medical conversations, give them to development teams from around the world, and see what happens.

The results were instructive.

Why Do We Even Need an AI That Listens to Doctors?

Let's start with the obvious. A doctor spends a huge amount of time not on treatment, but on documentation. Notes, charts, referrals, discharge summaries – all of this is work that could technically be done automatically if a system knew how to listen to a conversation and extract the necessary information.

A 2021 World Health Organization report states it directly: most existing telemedicine solutions can't work with live speech. They don't understand natural dialogue, they can't distinguish a patient's question from a doctor's answer, and they are unable to automatically generate a concise summary of the appointment. This means that even the most advanced telemedicine platform still requires manual labor where it could theoretically be automated.

For wealthy countries with a single official language and high-quality microphones, this might be a solvable problem – though it's still a struggle. But let's take India. A country with twenty-two official languages and hundreds of dialects. Where a patient might use three languages in a single sentence. Where appointments are often conducted in conditions far from studio-quality. Here, the task becomes fundamentally different.

It was for this exact scenario that DISPLACE-M was created. The organizers collaborated with HealthQuad – a major investor in digital health in India – and the startup PhonicAI, which provided medical records for evaluation. The task was framed honestly: not to create a polished prototype, but to test how well real systems perform on real data.

What Data Formed the Core of the Challenge?

The dataset collected for DISPLACE-M is perhaps the most valuable thing to come out of the entire project.

It contains 35 hours of live medical conversations. No actors, no scripts, no studio recordings. Real appointments: telemedicine centers, mobile medical camps, home visits. Patients and healthcare workers speak as they normally do – interrupting each other, switching from one language to another, using slang and local expressions. 25 hours were given for training and system development, while 10 hours were reserved for a «blind» test, meaning the teams saw these recordings for the first time and couldn't fine-tune their systems for them.

Every conversation was manually annotated. This means that real people listened to all 35 hours and added labels: who is speaking at any given moment, what exactly is being said word-for-word, what each fragment is about, and what the brief summary of the entire conversation is. This is a colossal effort, and it's what makes this dataset a useful tool for evaluating artificial intelligence systems.

Why is such annotation so important? Because without it, you can't assess how well a system is working. If a program claims the doctor said one thing and the patient another, you need something to compare it against. The ground-truth annotation is that very «correct version» against which errors are measured.

The Four Tasks That Needed to Be Solved

The competition was divided into four separate tasks. Each represented a different level of complexity, and all were linked in a single chain: the output of one task became the input for the next.

Task One: Who Is Speaking

This is called speaker diarization. Roughly speaking, the task is this: listen to the recording and label each segment – is it the doctor speaking or the patient?

Sounds simple. In practice, it's not. When two people talk at the same time, when one cuts the other off mid-word, when background noise overlaps with speech – the algorithm starts to get confused. The error is measured by the DER (Diarization Error Rate) metric: the lower the number, the more accurately the system identified who spoke when.

Task Two: What Was Said

This is Automatic Speech Recognition (ASR) – converting audio into text. It's a task we're all familiar with from voice assistants on our phones. Only here, it's significantly more complex: medical terminology, multiple languages at once, noise, accents.

The error is measured by the tcpWER metric – a word-level error rate that accounts for timestamps. Simply put: how many words did the system recognize incorrectly relative to the total number of words in the ground-truth transcription?

Task Three: What Was Discussed

Let's say we already have the text. Now we need to understand its structure: this fragment is about symptoms, this one about the diagnosis, this one about instructions for taking medication. This is topic classification, and it's needed to automatically structure the appointment record.

Quality is assessed using standard classification accuracy metrics, including the F1-score – a measure that balances both the number of correct answers and the number of missed and false ones.

Task Four: The Summary

The final task is the one closest to practical application. Take the entire conversation and condense it into a short summary: what is bothering the patient, what did the doctor say, what was prescribed. This is the text that could then be automatically entered into a medical record.

The quality of the summary is measured by the ROUGE-L metric, which compares how much the generated text matches a human-written reference summary at the level of word sequences.

How the Baseline Systems Were Designed

The organizers didn't just throw the participants into the deep end. They provided baseline systems – starter solutions for each of the four tasks. This is important because it allows for a fair comparison: how much were the teams able to improve upon the starting point?

For diarization, the baseline system worked roughly like this: first, it determined where in the recording there was speech versus silence or noise. Then, for each speech fragment, a voice «fingerprint» – a mathematical representation of a specific person's voice characteristics – was calculated. These fingerprints were then clustered by similarity: similar voices belong to one speaker, dissimilar ones to different speakers. This approach works reasonably well in ideal conditions. In real-world ones, it's significantly worse.

For speech recognition, they took multilingual neural network models pre-trained on large volumes of audio data and fine-tuned them on the DISPLACE-M medical recordings. Additionally, noise suppression was applied, along with artificial «contamination» of the training data with noise to get the model accustomed to difficult conditions.

For topic classification, language models were used – systems capable of converting text into numerical vectors that preserve the semantic relationships between words. Based on these vectors, a classifier was trained to assign text fragments to the correct topics.

For summarization, sequence-to-sequence models were employed – architectures that read a long text and generate a short one. Specifically, variants of BART and T5 were used, neural network models developed around 2019–2020 that have become the standard for text summarization tasks.

What the Results Showed

Twelve teams from different countries spent six to eight weeks working on improving the systems. This represents a serious amount of effort. And the results did indeed turn out better than the baselines – sometimes significantly so.

In diarization, the top teams used more complex voice «fingerprints» – in particular, the ECAPA-TDNN architecture, developed around 2020 specifically to improve speaker identification in noisy environments. This yielded a noticeable improvement. They also experimented with jointly identifying speech segments and speakers in a single step, instead of two separate ones.

In speech recognition, improvements were achieved by using larger pre-trained models and fine-tuning them on Indian accents and medical vocabulary. The SpecAugment technique was used – randomly masking parts of the audio spectrum during training, which makes the model more robust to real-world distortions.

In topic classification, language models fine-tuned on specialized medical texts showed good results. Some teams tried training a system on several related tasks simultaneously, which helped the model generalize its knowledge better.

In summarization, teams experimented with more powerful architectures, as well as a hybrid approach: first extracting key phrases and sentences, and then generating the summary based on them.

But here is the main conclusion, which the organizers stated directly: none of the systems reached a level sufficient for real-world application in medicine. Even the best results remained far from the accuracy that would be acceptable if the system were actually influencing medical documentation or helping a doctor make decisions.

Why Is This So Difficult?

Using an analogy, imagine you're asked to simultaneously transcribe the speech of two people talking softly in a noisy café, periodically switching from Hindi to a local dialect and back, using professional jargon, and sometimes interrupting each other. All while not just writing down the words, but also understanding the conversation's structure and drafting a report at the end. Difficult? This is exactly what is required of the system.

The researchers identified several key reasons why the task proved so hard.

The first is the complexity of real-world data. Spontaneous speech, noise, and overlapping speech are all fundamentally different from the conditions in which speech systems are typically trained and tested. Most public benchmarks use much «cleaner» data, and systems that perform well on them see a sharp drop in quality when faced with a true field recording.

The second reason is the language barrier. Existing multilingual models are trained on data where Indian languages are disproportionately underrepresented compared to English. And code-switching – mixing languages within a single sentence – is a whole separate linguistic reality for which specialized tools are almost nonexistent.

The third reason is the medical context. General-purpose language models don't know that «amlodipine» is a medication name and not a random string of sounds. Without specialized training on medical data, a system will consistently make mistakes on terms that appear in every other sentence a doctor says.

The fourth reason is systemic interdependence. An error in the first step (incorrectly identifying who is speaking) accumulates and is amplified at each subsequent step. If the diarization confuses the doctor and the patient, the transcription will be wrong, the topic classification will be wrong, and the final summary will be meaningless. This is a fundamental difference from tasks where each step is independent.

What This Means in Practice

The honest answer to the question «When will AI be able to properly understand a doctor during an appointment?» is this: not anytime soon, and it will require serious work, not just another iteration of existing models.

What specifically is needed? First, far more annotated medical data in Indian languages. The DISPLACE-M dataset is an important step, but 35 hours of recordings is catastrophically insufficient for training reliable systems. For comparison, major English-language datasets for speech recognition are measured in thousands of hours.

Second, we need models specifically designed to handle code-switching, where multiple languages are combined in a single sentence. This isn't some exotic phenomenon; it's the daily reality of millions of medical interactions.

Third, we need a deeper understanding of the acoustics of real medical spaces: how to process specific types of noise, how to deal with echo in small rooms, how to work with recordings made on cheap microphones.

And finally, we need an honest standard for evaluation. DISPLACE-M is precisely an attempt to create such a standard. It's not about showing impressive numbers on clean data, but about testing systems on what they will actually have to work with. This is painful for developers, but it's essential for progress.

Why This Matters More Than It Seems

It's easy to dismiss this: so AI can't understand a doctor's conversation in an Indian medical camp – who cares? In reality, a lot of people should.

The shortage of healthcare professionals is a global problem. According to the WHO, the world could face a shortfall of about 10 million healthcare workers by 2030. Systems that can automate documentation and routine conversation analysis could free up doctors from paperwork and allow them to see more patients. But only if these systems work reliably – and don't generate errors in medical records.

This is why the bar must be set high. This is why an honest benchmark on real field data is more important than a flashy demo with studio recordings.

DISPLACE-M has shown that the work is just beginning. And that, paradoxically, is good news – because now, at least, it's clear where the real front line of the work is.

Original Title: Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge
Article Publication Date: Mar 3, 2026
Original Article Authors : Dhanya E, Ankita Meena, Manas Nanivadekar, Noumida A, Victor Azad, Ashwini Nagaraj Shenoy, Pratik Roy Chowdhuri, Shobhit Banga, Vanshika Chhabra, Chitralekha Bhat, Shareef babu Kalluri, Srikanth Raj Chetupalli, Deepu Vijayasenan, Sriram Ganapathy
Previous Article When You Only See Part of the Game: How Economists Guess the Rules From Others' Moves Next Article Leibniz's Rule and Hyperforces: How Mathematics Helps Us Understand the Behavior of Liquids

Related Publications

You May Also Like

Enter the Laboratory

Research does not end with a single experiment. Below are publications that develop similar methods, questions, or concepts.

From Research to Understanding

How This Text Was Created

This material is based on a real scientific study, not generated “from scratch.” At the beginning, neural networks analyze the original publication: its goals, methods, and conclusions. Then the author creates a coherent text that preserves the scientific meaning but translates it from academic format into clear, readable exposition – without formulas, yet without loss of accuracy.

Realism

95%

Theoretical depth

81%

Engineering pragmatism

96%

Neural Networks Involved in the Process

We show which models were used at each stage – from research analysis to editorial review and illustration creation. Each neural network performs a specific role: some handle the source material, others work on phrasing and structure, and others focus on the visual representation. This ensures transparency of the process and trust in the results.

1.
Gemini 2.5 Flash Google DeepMind Research Summarization Highlighting key ideas and results

1. Research Summarization

Highlighting key ideas and results

Gemini 2.5 Flash Google DeepMind
2.
Claude Sonnet 4.6 Anthropic Creating Text from Summary Transforming the summary into a coherent explanation

2. Creating Text from Summary

Transforming the summary into a coherent explanation

Claude Sonnet 4.6 Anthropic
3.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

3. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
4.
Gemini 2.5 Flash Google DeepMind Editorial Review Correcting errors and clarifying conclusions

4. Editorial Review

Correcting errors and clarifying conclusions

Gemini 2.5 Flash Google DeepMind
5.
DeepSeek-V3.2 DeepSeek Preparing Description for Illustration Generating a textual prompt for the visual model

5. Preparing Description for Illustration

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
6.
FLUX.2 Pro Black Forest Labs Creating Illustration Generating an image based on the prepared prompt

6. Creating Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe