Published on April 2, 2026

How Together AI optimizes GPUs for faster AI models

The People Making GPUs Run Incredibly Fast: Inside the Together AI Team

How a small research team turns the theoretical potential of GPUs into real-world performance for AI systems – the story of the Together AI team.

Infrastructure 5 – 7 minutes min read

Event Source: Together.ai 5 – 7 minutes min read

When people talk about artificial intelligence, the conversation usually revolves around the models themselves: which one is smarter, which responds faster, which writes better code. But behind the scenes of this race, there's a completely different kind of work – meticulous, subtle, and yet fundamental. It's carried out by people who focus on what are known as kernels – low-level software components that directly control how GPUs perform computations.

This is precisely what the team at Together AI does. And if you've heard of FlashAttention or ThunderKittens, know that it's their handiwork.

Why kernels are essential for AI GPU performance

Why Do We Need “Kernels” in the First Place?

Simply put: A GPU is an extremely powerful piece of hardware, but to make it operate at full capacity specifically for AI tasks, you need to write special low-level code that instructs the chip on what to do and in what order. This code is what's known as kernels.

Most AI system developers work a level or two higher up: they use pre-built libraries and frameworks that already handle all this “manual” work with the hardware. But someone has to write those libraries, too. Someone has to ensure that new GPU architectures are utilized as efficiently as possible – not leaving half of the chip's potential to sit idle.

The Together AI team works on precisely this intermediate layer – between the hardware and the models that run on that hardware.

FlashAttention how optimization boosts language models

FlashAttention: When Optimization Changes Everything

One of the team's most famous projects is FlashAttention. In short, it's a way to significantly speed up one of the key operations in modern language models: the attention mechanism. This operation is crucial for the model to “understand” relationships between words and parts of a text, but it is also one of the most resource-intensive.

FlashAttention reimagined how this operation is executed on the GPU: instead of constantly moving data between the chip's different types of memory (a slow process), the algorithm rearranges the computations to keep data in fast memory for as long as possible. The result is a tangible speedup and lower memory usage.

This is more than just a technical detail. FlashAttention has influenced how many modern models are structured and has become one of those “quiet” inventions that are almost invisible to the end user but are critically important for the entire industry.

ThunderKittens a framework for efficient kernel development

ThunderKittens: The Tool That Builds Tools

Another project from the team, ThunderKittens, addresses a broader challenge. Writing efficient kernels by hand is extremely difficult: it requires a deep understanding of a specific GPU's architecture, careful tracking of how data moves inside the chip, and consideration of dozens of constraints. This is work that demands niche expertise and takes a lot of time.

ThunderKittens is a framework of sorts that simplifies the process of writing these kernels. It provides more user-friendly building blocks without sacrificing performance. To put it simply: where writing a good kernel once required a highly specialized expert with vast experience, ThunderKittens lowers this barrier to entry.

This matters because GPUs are constantly updated and new architectures emerge, meaning kernels must be adapted for new hardware each time. A tool that makes this process faster and more accessible holds real practical value for the entire industry.

Bridging the gap between theoretical and actual GPU performance in AI

The Gap Between Theory and Practice

There's an interesting phenomenon in the GPU world: manufacturers publish impressive performance numbers for their chips, and these numbers are real, but they are only achievable under ideal conditions. In real-world AI applications, hardware often operates at just 30–50% of its potential, sometimes even less.

The work of the Together AI team is all about closing this gap. Each optimization, each improvement to a kernel, is a step toward making real-world performance approach the theoretical maximum. And at a time when the cost of computation remains one of the primary constraints on AI development, this work has a direct impact on what is even possible.

The broader impact of optimizing AI hardware

Why This Matters Beyond Just One Company

Together AI positions itself as an open platform: a significant portion of its developments are released as open source. FlashAttention and ThunderKittens are available to everyone and are already being used in research and products worldwide.

This creates a fascinating model: a small team of highly specialized experts creates infrastructural solutions that are then used by the entire industry. Major labs, startups, and academic researchers all rely, to some extent, on the work done by teams like this.

In other words, progress in AI doesn't just depend on who is designing new model architectures or compiling datasets. It also depends on those who make sure it all works efficiently on real hardware. And teams like this one are a vital part of that chain.

The future of kernel optimization in AI and hardware

What's Next?

As GPUs become increasingly complex and models grow larger, work at the kernel level only becomes more challenging. New chips introduce new capabilities – and new limitations that need to be considered. Meanwhile, the demand for efficiency grows: training and running large models remains costly, and any improvement in hardware utilization has a direct effect on the economics of the entire sector.

In this respect, teams that work at the intersection of hardware and software are unlikely to become redundant. On the contrary, their role is only set to grow as AI systems become more complex and large-scale.

This is the part of the industry that rarely makes the news. But it is largely what determines how fast and economically the models used by millions of people every day actually run.

#applied analysis #technical context #neural networks #ai development #engineering #computer systems #infrastructure #gpu optimization #computational resource optimization

Link to Original: https://www.together.ai/blog/inside-the-together-ai-kernels-team

Original Title: Inside the Together AI kernels team

Publication Date: Apr 1, 2026

Together.ai www.together.ai A U.S.-based platform for running and scaling open AI models.

Previous Article Sony AI in March: A Book on Diffusion Models, Over Ten Accepted Papers, and a Researcher's Recognition Next Article Alibaba Releases Qwen3.6-Plus: AI Model That Codes and «Sees» the World

How Together AI optimizes GPUs for faster AI models

Why kernels are essential for AI GPU performance

FlashAttention how optimization boosts language models

ThunderKittens a framework for efficient kernel development

Bridging the gap between theoretical and actual GPU performance in AI

The broader impact of optimizing AI hardware

The future of kernel optimization in AI and hardware

Related Publications

JAX-AITER: How AMD Is Simplifying Fast AI Model Development on Its GPUs

AI Agents Write CUDA Kernels: GPT and Claude Learn to Generate GPU Code

Unsloth Speeds Up MoE Model Training 12x and Boosts Context Window

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration