Published on March 19, 2026

Извлечение сущностей из живой речи: как ИИ распознает важные данные

How AI Learns to 'Hear' What Matters: Extracting Data from Live Speech in Real Time

We explore how modern speech recognition systems have learned to extract specific data – phone numbers, addresses, and emails – from conversations on the fly.

Development 5 – 8 minutes min read

Event Source: AssemblyAI 5 – 8 minutes min read

Imagine a call center operator talking to a customer. The customer provides their address and phone number, and the system instantly, without a pause, records this data in the correct field. No manual entry, no need to ask again. This is precisely the task solved by a field known as real-time entity extraction from live speech.

It might sound like a technical detail, but behind it lies a whole class of practical problems where the speed and accuracy of conversation processing directly impact the outcome.

Что такое сущности и зачем их извлекать из речи

What Are «Entities» and Why Extract Them

In the context of speech and text processing, entities are specific, structured units of information: names, addresses, phone numbers, email addresses, dates, amounts, and the like. Simply put, they are everything in a conversation that carries specific factual weight and can be used directly.

When a person says, «Sign me up, my phone number is plus seven nine hundred twelve...» – the system needs to do more than just convert sound to text. It must also understand that the sequence of digits is a phone number and not, say, a product code or a date. This is the task of entity extraction.

In offline mode – when a recording already exists and can be analyzed in its entirety – this has long been a solved problem. The difficulty arises when dealing with a live stream: the conversation isn't over, words are coming in continuously, and a decision must be made right now, without waiting for a pause.

Почему извлечение сущностей в реальном времени — отдельная задача

Why Real Time Is a Separate Challenge

When a system works with a finished recording, it has the full context: it can see the beginning and end of a phrase, go back, and double-check. In streaming mode, this isn't possible. Words arrive one by one, and decisions have to be made on the fly: is this part of a phone number or just some numbers in the middle of a sentence?

Moreover, live speech is unpredictable. People misspeak, pause in unexpected places, and say a phone number with an intonation that doesn't match how it's typically read aloud. The system must handle all of this – without slowing down the conversation with noticeable delays.

This is precisely why streaming entity extraction is not just «the same thing, but faster.» It's a distinct engineering challenge.

Три типа данных, которые сложнее всего распознать в живой речи

Three Types of Data That Are Hardest to «Catch on the Fly»

Email addresses, phone numbers, and physical addresses are perhaps the most finicky categories for real-time recognition. Let's look at why.

Email Addresses

When someone dictates an email address aloud, they usually say it in parts: «john at gmail dot com.» The system must understand that «at» is the @ symbol, «dot» is «.», and assemble it all into a readable address. People say this differently: some say «at», others use a regional equivalent, and some just pause where the symbol should be.

Phone Numbers

Phone numbers are dictated differently even within the same country: in groups of two or three digits, as a whole, with or without the country code. The system must be able to assemble them from fragments without confusing them with other numerical sequences in the speech.

Physical Addresses

This is perhaps the most complex case. An address isn't just a set of words; it's a structure: street, house number, apartment, city, postal code. In live speech, a person might list these components in any order, omit details they consider obvious, or add clarifications along the way. Recognizing where an address begins and ends is a non-trivial task in itself.

Как работает извлечение сущностей из живой речи на практике

How It Works in Practice

At the core of such systems is a combination of two processes: first, speech is converted to text (this is called transcription), and then a model searches the text for entities – that is, it determines which fragment is a phone number, which is an address, and which is an email.

In streaming mode, both processes must run in parallel and with minimal latency. The system receives audio in chunks, transcribes them on the fly, and simultaneously analyzes the incoming text for significant fragments.

It's also important to consider that partial transcription results – that is, words the system hasn't fully «heard» yet – can change as new data arrives. This means an extracted entity may sometimes need to be refined or corrected when the next audio fragment comes in.

Где применяется извлечение сущностей из живой речи

Where It's Already Being Used

There are many practical scenarios. Here are a few where this technology already makes sense or is actively in use:

Call centers and support services. Automatically recording a customer's contact information during a conversation, without operator intervention.
Medical appointments and consultations. A doctor or assistant dictates patient data aloud, and the system immediately structures it into a record.
Voice assistants. When you ask an assistant to «save an address» or «record a number», this is the mechanism working behind the scenes.
Dispatch services. An emergency line operator takes a call, and the system records the address and contact details in the background, saving critical seconds.

Что пока остаётся сложным в извлечении сущностей из живой речи

What Still Remains a Challenge

Despite the progress, the technology has understandable limitations.

Accents and dialects still pose difficulties, especially when a person pronounces numbers or special characters unconventionally. Background noise also has an impact: in a crowded place or with a poor connection, transcription becomes less accurate, which in turn affects data extraction.

Another problem is contextual ambiguity. The phrase «call me on eight nine hundred twelve» might be the start of a phone number in one context, but just a number in another. The system needs to rely on context, and in real time, the context is always incomplete.

Finally, different languages and data formats require separate configuration. A phone number in Russia and one in Germany look and sound different, and there is no universal solution yet – systems are most often tailored to a specific market or format.

Почему направление извлечения сущностей из речи будет развиваться

Why This Field Will Continue to Evolve

The demand for automating routine data operations isn't going away – quite the opposite. The more interactions shift to a voice format (call centers, voice assistants, dictation), the more acute the need for a system that doesn't just hear, but also understands the structure of what is said.

Entity extraction from live speech is one of those cases where the difference between «almost works» and «works reliably» is very noticeable in practice. An error in a single digit of a phone number or a misrecognized postal code in an address can render the data useless.

This is why the task of accurately and quickly extracting structured data from streaming audio remains one of the most actively developed areas in the industry. It's not the most visible technology, but it's one of the cornerstones supporting the reliability of many services we use every day.

#applied analysis #technical context #neural networks #ai linguistics #engineering #data #human–machine interaction #audio manipulation #voice transcription

Link to Original: https://www.assemblyai.com/blog/real-time-entity-extraction-from-audio

Original Title: Real-time entity extraction from speech: Capturing emails, phone numbers, and addresses in live audio

Publication Date: Mar 18, 2026

AssemblyAI www.assemblyai.com A U.S.-based AI company developing speech recognition and audio intelligence models, providing developer APIs for transcription, voice analysis, and voice-driven applications.

Previous Article How AI Learns to Distinguish Voices in Real Time: A Task Harder Than It Seems Next Article MolmoPoint: A New Approach to How AI 'Points' at Objects in Images

Извлечение сущностей из живой речи: как ИИ распознает важные данные

Что такое сущности и зачем их извлекать из речи

Почему извлечение сущностей в реальном времени — отдельная задача

Три типа данных, которые сложнее всего распознать в живой речи

Email Addresses

Phone Numbers

Physical Addresses

Как работает извлечение сущностей из живой речи на практике

Где применяется извлечение сущностей из живой речи

Что пока остаётся сложным в извлечении сущностей из живой речи

Почему направление извлечения сущностей из речи будет развиваться

Related Publications

How AI Learns to Distinguish Voices in Real Time: A Task Harder Than It Seems

AI Agents Write CUDA Kernels: GPT and Claude Learn to Generate GPU Code

Tencent Hunyuan Reveals How to Pinpoint Bottlenecks in Language Model Training

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration