When a language model processes a query, there is intense data movement between servers «behind the scenes». The more complex the task, the more movement there is. The Perplexity AI team decided to find out if this process could be sped up, and it turned out to be quite possible – by letting servers talk directly to each other, bypassing extra links in the chain.
Why Language Models Need Faster Data Transfer Between Servers
The Problem at Hand
Modern language models don't run on a single machine; they are distributed across multiple servers. When a model needs to perform computations or transfer data from one part of the system to another, the server's central processing unit (CPU) is usually involved – like a dispatcher, it receives the data, processes it, and sends it on its way. While this works, it creates a «bottleneck»: the CPU is busy moving information around instead of performing actual computations.
This is felt particularly sharply in new model architectures. For instance, when one neural network generates draft answers while another checks and refines them. Or when a system queries external sources during the text generation process. In these scenarios, servers have to exchange data very actively – and every single time, it goes through the CPU.
What RDMA Is and Why It Matters
RDMA (Remote Direct Memory Access) stands for exactly that. Simply put, it's a way of transferring data where one server can write information directly into the memory of another without distracting its processor. Imagine that instead of passing a letter through a secretary, you just place it directly on your colleague's desk.
The technology isn't new – it has been used in high-performance computing for a long time – but its application for language model systems is only just starting to gain momentum. The reason is that data exchange patterns in AI systems differ from classic supercomputer tasks.
RDMA Applications in Speculative Decoding and RAG Systems
How It Works in Practice
The Perplexity team has developed a tool that allows servers in a language model system to transfer data to each other directly – «point-to-point», as engineers call it. This is particularly useful in two situations.
First – when using speculative decoding. The idea is that a small, fast model quickly generates several options for text continuation, and a larger model checks them and picks the best one. These models usually reside on different servers and need to constantly exchange intermediate results. With RDMA, this exchange happens much faster.
Second – systems with retrieval-augmented generation (RAG), where the model queries an external knowledge base during the response formation. For example, it searches for up-to-date information in documents to provide a more accurate answer. This also requires instantaneous data transfer between system components.
Performance Benefits and Latency Reduction with RDMA
What It Delivers in Numbers
Researchers conducted tests and found that direct data transfer significantly reduces latency. In some scenarios, wait times are cut several-fold compared to the standard approach where data passes through the CPU. This isn't just an abstract improvement; the user actually gets a response from the model much faster.
The difference is especially noticeable when transferring large volumes of data or in long chains of servers. The more complex the system architecture, the greater the payoff from using direct transfer.
Future of RDMA Implementation in AI Infrastructure
Why It Matters Now
Language models are growing in scale, and the tasks they solve are becoming more diverse. It's no longer enough to just take text and generate a response. Modern systems combine multiple models, access external data, and use specialized components for different subtasks. All of this requires an active exchange of information between servers.
Classical data transfer methods are becoming a bottleneck for such scenarios. RDMA is one way to remove this hurdle. It's not the only way, of course, but it is quite elegant: the technology is already proven in other fields; it just needs to be properly adapted to the specifics of how neural networks operate.
What's Next
The Perplexity team has made the results of their work open-source so that other researchers and engineers can implement this approach. This isn't a final solution for all communication problems in AI systems, but it's a major step toward creating an efficient infrastructure.
It is quite likely that we will see more solutions like this in the near future: the industry is actively seeking ways to make models faster and cheaper without sacrificing quality. Direct server-to-server data transfer is one of those opportunities that is gradually becoming an industry standard.