When people talk about artificial intelligence, the conversation usually revolves around the models themselves: which one is smarter, which responds faster, which writes better code. But behind the scenes of this race, there's a completely different kind of work – meticulous, subtle, and yet fundamental. It's carried out by people who focus on what are known as kernels – low-level software components that directly control how GPUs perform computations.
This is precisely what the team at Together AI does. And if you've heard of FlashAttention or ThunderKittens, know that it's their handiwork.
Why Do We Need “Kernels” in the First Place?
Simply put: A GPU is an extremely powerful piece of hardware, but to make it operate at full capacity specifically for AI tasks, you need to write special low-level code that instructs the chip on what to do and in what order. This code is what's known as kernels.
Most AI system developers work a level or two higher up: they use pre-built libraries and frameworks that already handle all this “manual” work with the hardware. But someone has to write those libraries, too. Someone has to ensure that new GPU architectures are utilized as efficiently as possible – not leaving half of the chip's potential to sit idle.
The Together AI team works on precisely this intermediate layer – between the hardware and the models that run on that hardware.
FlashAttention: When Optimization Changes Everything
One of the team's most famous projects is FlashAttention. In short, it's a way to significantly speed up one of the key operations in modern language models: the attention mechanism. This operation is crucial for the model to “understand” relationships between words and parts of a text, but it is also one of the most resource-intensive.
FlashAttention reimagined how this operation is executed on the GPU: instead of constantly moving data between the chip's different types of memory (a slow process), the algorithm rearranges the computations to keep data in fast memory for as long as possible. The result is a tangible speedup and lower memory usage.
This is more than just a technical detail. FlashAttention has influenced how many modern models are structured and has become one of those “quiet” inventions that are almost invisible to the end user but are critically important for the entire industry.
ThunderKittens: The Tool That Builds Tools
Another project from the team, ThunderKittens, addresses a broader challenge. Writing efficient kernels by hand is extremely difficult: it requires a deep understanding of a specific GPU's architecture, careful tracking of how data moves inside the chip, and consideration of dozens of constraints. This is work that demands niche expertise and takes a lot of time.
ThunderKittens is a framework of sorts that simplifies the process of writing these kernels. It provides more user-friendly building blocks without sacrificing performance. To put it simply: where writing a good kernel once required a highly specialized expert with vast experience, ThunderKittens lowers this barrier to entry.
This matters because GPUs are constantly updated and new architectures emerge, meaning kernels must be adapted for new hardware each time. A tool that makes this process faster and more accessible holds real practical value for the entire industry.
The Gap Between Theory and Practice
There's an interesting phenomenon in the GPU world: manufacturers publish impressive performance numbers for their chips, and these numbers are real, but they are only achievable under ideal conditions. In real-world AI applications, hardware often operates at just 30–50% of its potential, sometimes even less.
The work of the Together AI team is all about closing this gap. Each optimization, each improvement to a kernel, is a step toward making real-world performance approach the theoretical maximum. And at a time when the cost of computation remains one of the primary constraints on AI development, this work has a direct impact on what is even possible.
Why This Matters Beyond Just One Company
Together AI positions itself as an open platform: a significant portion of its developments are released as open source. FlashAttention and ThunderKittens are available to everyone and are already being used in research and products worldwide.
This creates a fascinating model: a small team of highly specialized experts creates infrastructural solutions that are then used by the entire industry. Major labs, startups, and academic researchers all rely, to some extent, on the work done by teams like this.
In other words, progress in AI doesn't just depend on who is designing new model architectures or compiling datasets. It also depends on those who make sure it all works efficiently on real hardware. And teams like this one are a vital part of that chain.
What's Next?
As GPUs become increasingly complex and models grow larger, work at the kernel level only becomes more challenging. New chips introduce new capabilities – and new limitations that need to be considered. Meanwhile, the demand for efficiency grows: training and running large models remains costly, and any improvement in hardware utilization has a direct effect on the economics of the entire sector.
In this respect, teams that work at the intersection of hardware and software are unlikely to become redundant. On the contrary, their role is only set to grow as AI systems become more complex and large-scale.
This is the part of the industry that rarely makes the news. But it is largely what determines how fast and economically the models used by millions of people every day actually run.