If you've ever chosen a service for automatic speech-to-text or subtitling, you've likely come across the acronym WER – Word Error Rate. It's the primary benchmark in the industry: the lower the WER, the more accurate the model. Makes sense, right? In practice – not quite.
What Is WER and Why Does Everyone Focus on It?
Simply put, WER counts how many words a model recognized incorrectly. If 100 words were spoken in the audio and the transcription has 5 errors, the WER is 5%. It sounds like a fair way to measure quality.
This is why speech model developers are eager to publish this metric. It's understandable, numerical, and easy to compare. But behind this simplicity lie several serious problems.
The Same Text, Different Answers
The first trap is in the calculation principle itself. WER compares the model's output with a “reference” transcript created by a human. But people transcribe differently.
For example, the phrase “well, basically, yeah” could be written as one sentence, as a few words separated by commas, or even shortened to just “yeah” – depending on who created the reference transcript and what the instructions were. If the model transcribes it differently, it's counted as an error, even if the meaning is correct.
The result: two equally accurate services can get completely different WER scores simply because the reference text was written in one style and not another.
Test Sets Aren't Real Life
The second problem is what the models are tested on. Most public benchmarks use standard datasets: lecture recordings, news programs, and prepared speeches. The speech in them is clear, the pace is moderate, and the accent is neutral.
The real world is different. There's cafe noise, phone calls, regional accents, people who speak quickly, pause mid-sentence, or interrupt each other. On such audio, the WER of a “benchmark top performer” might turn out to be far from impressive.
In short: performing well on a standard test is not the same as working well with your specific content.
Punctuation, Casing, and “Invisible” Errors
WER is typically calculated without considering capitalization and punctuation. This seems reasonable – why penalize a model for writing “Moscow” in lowercase? But there's a downside.
Imagine a transcript of a business meeting where all names are in lowercase, there isn't a single comma, and periods are placed randomly. The WER might be perfect, but the text is unreadable and unusable without manual editing.
In tasks where structure and formatting are important – such as meeting minutes, subtitles, or medical records – this nuance becomes critical.
One Missed Word and Everything Falls Apart
Another feature of WER is that not all errors are equal. A missed conjunction like “and” and a missed patient's name in a medical document are identical units in the formula. But the consequences are completely different.
Furthermore, a single error in the middle of a sentence can “shift” the count and cause the next few words to be marked as errors too – even if the model recognized them correctly. The metric penalizes the model not just once, but in a cascade.
What Can Be Done About It
Does all this mean that WER is useless? No. It's still a convenient way to quickly compare models under the same conditions. The problem isn't the metric itself, but how it's used: as the one and only final argument.
Here are a few practical guidelines for those choosing a transcription tool:
- Test it on your own audio. Public benchmarks are someone else's data under someone else's conditions. Upload real recordings from your use case and compare the results manually.
- Focus on what matters most to you. If you need readable subtitles, check the punctuation and structure. If names and terms are important, check their accuracy separately.
- Consider the speech type. Conversational speech, interviews, business calls, medical consultations – these are all different operating modes for a model.
- Don't rely on a single number. WER is just one indicator, not the whole picture of the model.
Why This Matters Right Now
Speech technologies are now being integrated into a wide variety of products, from online meeting tools to customer service systems and medical documentation. As the stakes get higher, so does the cost of choosing the wrong tool – all because someone only compared a single number in a marketing chart.
Model developers themselves – notably, the AssemblyAI team that raised this issue – admit that the industry needs fairer, more multidimensional ways to assess quality. WER will remain part of this toolkit, but it shouldn't be the only yardstick.
This isn't a call to distrust numbers. It's a reminder that behind every metric is a methodology – and you should understand it before making decisions based on it.