The Perplexity team has published an article detailing how they managed to adapt training technology for trillion-parameter models to run on the AWS cloud platform. Long story short: they took an existing approach that was hardwired to NVIDIA hardware and rewrote it to work efficiently on Amazon's standard network infrastructure.
Why Trillion-Parameter Models Cannot Fit on Single GPUs
The Trillion-Parameter Problem
Modern large language models just keep growing. If a model with 100–200 billion parameters was considered massive a couple of years ago, we are now talking about a trillion or more. The problem is that such models physically do not fit into the memory of a single GPU – not even the most powerful one.
Therefore, they have to be «spread out» across multiple devices. But when the GPU count reaches hundreds or thousands, another complication arises: they need to constantly exchange data with one another. And if this connection is slow, the entire training process turns into an endless waiting game.
NVIDIA NVLink and Megatron-LM for Distributed Model Training
How This Is Usually Solved
NVIDIA offers a technology called NVLink for such tasks. It is a specialized high-speed bus that connects GPUs within a single server or between servers. It works fast, but there is a catch: it is a proprietary solution that requires specific «hardware» and has poor compatibility with other platforms.
There is an open-source framework called Megatron-LM from NVIDIA, which is capable of training huge models by distributing them across many GPUs. However, it was originally designed specifically for NVLink. If you do not have access to this technology, you are, roughly speaking, «out of the game».
Adapting Megatron-LM to Work with AWS EFA Instead of NVLink
What Perplexity Did
The Perplexity team decided to break this dependency. They rewrote parts of Megatron-LM so that the framework could operate via AWS EFA (Elastic Fabric Adapter) – Amazon's networking technology that provides high-speed communication between servers in the cloud. EFA uses a standard protocol that is not tied to a specific hardware vendor.
Now, trillion-parameter models can be trained on standard AWS cloud instances without requiring specific equipment from NVIDIA. This makes the process more flexible: you can rent capacity from Amazon, train the model, and not worry about the infrastructure being locked into a single vendor.
Benefits of Training Large Models on Standard Cloud Infrastructure
Why This Matters 🤔
First, it lowers the barrier to entry. While training ultra-large models previously required either purchasing expensive servers with NVLink support or renting them from a narrow circle of providers, it is now possible to use publicly available cloud infrastructure.
Second, it is a matter of portability. When a framework works with only one technology, you effectively become its hostage. If a better offer from another cloud provider appears tomorrow, moving the training process there would be difficult or even impossible. Perplexity's solution makes development less dependent on a specific supplier.
Third, it opens up new opportunities for researchers and smaller teams who may not have the budget for exclusive hardware but do have access to major cloud platforms.
Technical Changes to Megatron-LM Communication Layer
Under the Hood
Without diving too deep into the technical weeds: the primary work involved replacing the communication layer. Megatron-LM uses NCCL (NVIDIA Collective Communications Library) – a library for data exchange between GPUs. This library is optimized for NVLink and can exhibit low performance on other types of connections.
The Perplexity team adapted the framework to use AWS EFA efficiently. According to them, this required rethinking some data distribution and synchronization algorithms, but they eventually achieved performance sufficient for training models at a trillion-parameter scale.
Performance Trade-offs and Platform Portability Concerns
Limitations and Questions
It is important to understand that this is not a «magic bullet». Perplexity does not claim that their approach is faster or more efficient than training via NVLink. Rather, it is a compromise: you gain greater flexibility and hardware independence, but you might sacrifice some «raw» performance.
There also remains the question of how easily this approach scales to other cloud platforms. AWS EFA is still a proprietary solution from one specific provider. If someone wants to repeat a similar trick on Google Cloud or Azure, additional adaptation for their network protocols will be required.
Finally, Perplexity's article is more of a description of a concept and an architectural approach than a ready-to-use open-source tool. It is still unclear whether the company plans to release the code to the public or if it will remain an internal development.
Moving Away from Vendor Lock-in in AI Model Training
What This Means for the Industry
Perplexity's work proves that dependency on closed technologies is «not a death sentence». Even in resource-intensive tasks like training trillion-parameter models, one can find paths toward greater openness and cross-platform compatibility.
This is especially relevant now, as the cost of training neural networks continues to rise and competition between cloud giants intensifies. The ability to choose a platform without being tied to specific «hardware» could be a deciding factor for many developers.
We will see if other companies follow this example and how widely such an approach takes root in the industry in the coming years.