Published on April 2, 2026

How Together AI optimizes GPUs for faster AI models

The People Making GPUs Run Incredibly Fast: Inside the Together AI Team

How a small research team turns the theoretical potential of GPUs into real-world performance for AI systems – the story of the Together AI team.

Infrastructure 5 – 7 minutes min read
Event Source: Together.ai 5 – 7 minutes min read

When people talk about artificial intelligence, the conversation usually revolves around the models themselves: which one is smarter, which responds faster, which writes better code. But behind the scenes of this race, there's a completely different kind of work – meticulous, subtle, and yet fundamental. It's carried out by people who focus on what are known as kernels – low-level software components that directly control how GPUs perform computations.

This is precisely what the team at Together AI does. And if you've heard of FlashAttention or ThunderKittens, know that it's their handiwork.

Why kernels are essential for AI GPU performance

Why Do We Need “Kernels” in the First Place?

Simply put: A GPU is an extremely powerful piece of hardware, but to make it operate at full capacity specifically for AI tasks, you need to write special low-level code that instructs the chip on what to do and in what order. This code is what's known as kernels.

Most AI system developers work a level or two higher up: they use pre-built libraries and frameworks that already handle all this “manual” work with the hardware. But someone has to write those libraries, too. Someone has to ensure that new GPU architectures are utilized as efficiently as possible – not leaving half of the chip's potential to sit idle.

The Together AI team works on precisely this intermediate layer – between the hardware and the models that run on that hardware.

FlashAttention how optimization boosts language models

FlashAttention: When Optimization Changes Everything

One of the team's most famous projects is FlashAttention. In short, it's a way to significantly speed up one of the key operations in modern language models: the attention mechanism. This operation is crucial for the model to “understand” relationships between words and parts of a text, but it is also one of the most resource-intensive.

FlashAttention reimagined how this operation is executed on the GPU: instead of constantly moving data between the chip's different types of memory (a slow process), the algorithm rearranges the computations to keep data in fast memory for as long as possible. The result is a tangible speedup and lower memory usage.

This is more than just a technical detail. FlashAttention has influenced how many modern models are structured and has become one of those “quiet” inventions that are almost invisible to the end user but are critically important for the entire industry.

ThunderKittens a framework for efficient kernel development

ThunderKittens: The Tool That Builds Tools

Another project from the team, ThunderKittens, addresses a broader challenge. Writing efficient kernels by hand is extremely difficult: it requires a deep understanding of a specific GPU's architecture, careful tracking of how data moves inside the chip, and consideration of dozens of constraints. This is work that demands niche expertise and takes a lot of time.

ThunderKittens is a framework of sorts that simplifies the process of writing these kernels. It provides more user-friendly building blocks without sacrificing performance. To put it simply: where writing a good kernel once required a highly specialized expert with vast experience, ThunderKittens lowers this barrier to entry.

This matters because GPUs are constantly updated and new architectures emerge, meaning kernels must be adapted for new hardware each time. A tool that makes this process faster and more accessible holds real practical value for the entire industry.

Bridging the gap between theoretical and actual GPU performance in AI

The Gap Between Theory and Practice

There's an interesting phenomenon in the GPU world: manufacturers publish impressive performance numbers for their chips, and these numbers are real, but they are only achievable under ideal conditions. In real-world AI applications, hardware often operates at just 30–50% of its potential, sometimes even less.

The work of the Together AI team is all about closing this gap. Each optimization, each improvement to a kernel, is a step toward making real-world performance approach the theoretical maximum. And at a time when the cost of computation remains one of the primary constraints on AI development, this work has a direct impact on what is even possible.

The broader impact of optimizing AI hardware

Why This Matters Beyond Just One Company

Together AI positions itself as an open platform: a significant portion of its developments are released as open source. FlashAttention and ThunderKittens are available to everyone and are already being used in research and products worldwide.

This creates a fascinating model: a small team of highly specialized experts creates infrastructural solutions that are then used by the entire industry. Major labs, startups, and academic researchers all rely, to some extent, on the work done by teams like this.

In other words, progress in AI doesn't just depend on who is designing new model architectures or compiling datasets. It also depends on those who make sure it all works efficiently on real hardware. And teams like this one are a vital part of that chain.

The future of kernel optimization in AI and hardware

What's Next?

As GPUs become increasingly complex and models grow larger, work at the kernel level only becomes more challenging. New chips introduce new capabilities – and new limitations that need to be considered. Meanwhile, the demand for efficiency grows: training and running large models remains costly, and any improvement in hardware utilization has a direct effect on the economics of the entire sector.

In this respect, teams that work at the intersection of hardware and software are unlikely to become redundant. On the contrary, their role is only set to grow as AI systems become more complex and large-scale.

This is the part of the industry that rarely makes the news. But it is largely what determines how fast and economically the models used by millions of people every day actually run.

Original Title: Inside the Together AI kernels team
Publication Date: Apr 1, 2026
Together.ai www.together.ai A U.S.-based platform for running and scaling open AI models.
Previous Article Sony AI in March: A Book on Diffusion Models, Over Ten Accepted Papers, and a Researcher's Recognition Next Article Alibaba Releases Qwen3.6-Plus: AI Model That Codes and «Sees» the World

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe