Published on February 13, 2026

AI Agents for Generating GPU Code and CUDA Kernels

AI Agents Write CUDA Kernels: GPT and Claude Learn to Generate GPU Code

Two AI agents can create optimized CUDA kernels to speed up operations straight from a task description. Let's dive into what this means for people working with models.

Development / Technical context 6 – 8 minutes min read

Event Source: Hugging Face 6 – 8 minutes min read

Typically, if you want to accelerate data processing on a GPU, you either need to use pre-built libraries or write low-level CUDA code yourself. The second option requires serious expertise: you need to understand the GPU architecture, manage memory, and handle thread synchronization. This is a specialized profession, and not every model developer knows how to do it.

Now, there's another way: describe the task in natural language, and an AI agent will generate an optimized CUDA kernel for you. Two such agents are already available – one uses GPT-4o, the other Claude 3.5 Sonnet. Both are integrated into the Hugging Face ecosystem and accessible through the Transformers interface.

What is a CUDA Kernel and Why Write One

When you work with a neural network, most of the computations happen on the GPU. Libraries like PyTorch or cuDNN provide ready-made operations: matrix multiplications, convolutions, activations. They work fast, but they are general-purpose. If you have a specific task – for example, you need to combine several operations into one or implement a non-standard function – pre-built blocks can be inefficient.

In such cases, you write your own CUDA kernel – a function that runs directly on the GPU and does exactly what you need. This can provide a significant speed boost, especially if the operation is repeated frequently. But writing such code requires a deep understanding of the hardware and the C++ language.

How the Agents Work

Both agents are structured similarly. You describe the operation in text – for example, “implement a LayerNorm layer with GELU activation.” The agent analyzes the request, generates CUDA code, compiles it, and returns a ready-to-use function that can be called from Python.

Internally, the process looks like this: the agent first generates the kernel code, then tries to compile it. If compilation fails or the results are incorrect, the agent receives an error message and attempts to fix the code. This is an iterative process – the agent may make several attempts until it gets a working version.

At their core are language models: GPT-4o for one agent, Claude 3.5 Sonnet for the other. Both models can generate code and work with technical descriptions, but their approaches differ slightly. Claude shows more stable results on tasks involving matrix operations, while GPT-4o is sometimes faster at handling non-standard requests.

What You Can Already Do

The agents can create kernels for basic operations: normalization, activations, element-wise transformations, and simple matrix operations. They handle typical tasks encountered when working with transformers or convolutional networks.

For example, you can ask it to implement a custom activation function that isn't in the standard library, or combine several operations into one to avoid extra memory accesses. The agent will generate code that does exactly that, and you can use it like a regular function in PyTorch.

Important: the agents do not replace manual optimization. The code they generate works, but it's not always maximally efficient. If you need performance on par with industrial-grade libraries, you'll likely need to refine it. But for prototyping, experiments, or tasks where speed is not critical, it's a perfectly viable tool.

What Are the Limitations

First – reliability. The agent might generate code with errors, especially if the task is vaguely formulated or requires non-trivial logic. Sometimes, it can't compile the result even after several attempts. In such cases, you have to either clarify the request or fix the code manually.

Second – performance. The agent doesn't know all the intricacies of a specific GPU's architecture. It might miss optimization opportunities, for example, by not using shared memory effectively or failing to account for data alignment. The generated code usually runs slower than what an experienced CUDA programmer would produce.

Third – task complexity. The agents handle relatively simple operations. If you need to implement a complex algorithm with non-trivial thread management or a multi-level memory hierarchy, the agent will likely fail without significant assistance.

Who Might Find This Useful

First and foremost – researchers and model developers who don't specialize in low-level programming. If you're working with PyTorch and need to quickly test an idea that requires a non-standard operation, an agent can save you time. Instead of studying CUDA or searching for a pre-existing implementation, you just describe the task and get working code.

It's also useful for learning. You can see how the agent implements a particular operation and use that as a starting point to understand CUDA mechanics. Of course, the generated code isn't always perfect, but it can show you the basic structure and logic.

For tasks where maximum performance is crucial – like in production or when training large models – agents don't yet replace manual work. But they can speed up prototyping and lower the entry barrier for those who haven't worked with GPU programming before.

What This Means in a Broader Context

This is another example of how language models are starting to assist with technical tasks that previously required narrow specialization. We've already seen AI assistants write Python code, generate SQL queries, and help with debugging. Now, they've reached low-level programming.

Of course, this doesn't mean CUDA programmers are no longer needed. Auto-generated code doesn't yet match the quality of professionally written kernels. But tools like these can change how tasks are distributed: routine operations are delegated to the agent, while specialists focus on truly complex optimization.

Another point is accessibility. Previously, creating custom kernels was the domain of a small group of developers. Now, it's becoming more accessible. While the results won't always be optimal, the barrier to entry is lowered. This can speed up experiments and allow more people to try non-standard approaches.

Is It Worth a Try?

If you work with models and encounter situations where pre-built operations aren't suitable, it's worth a try. Both agents are available through Transformers and are easy to run. Don't expect perfect results on the first try, but for rapid prototyping, it's a perfectly viable option.

If you're just starting to get the hang of GPU programming, the agents can help you understand the basic principles. You'll see how CUDA kernels are structured and can experiment with different operations without having to immediately dive into hundreds of pages of documentation.

For tasks where performance is critical, agents won't replace manual work just yet. But they can be useful during the research phase, when iteration speed is more important than the absolute efficiency of the code.

#applied analysis #technical context #neural networks #ai development #engineering #human–machine interaction #development tools #gpu optimization #generative agents

Link to Original: https://huggingface.co/blog/custom-cuda-kernels-agent-skills

Original Title: Custom Kernels for All from Codex and Claude

Publication Date: Feb 13, 2026

Hugging Face huggingface.co A U.S.-based open platform and company for hosting, training, and sharing AI models.

Previous Article MiniMax Introduces Forge: A Platform for Training AI Agents on Powerful Computing Clusters Next Article Olmix: Allen AI's Approach to Data Mixing Across All Stages of Language Model Training

AI Agents for Generating GPU Code and CUDA Kernels

What is a CUDA Kernel and Why Write One

How the Agents Work

What You Can Already Do

What Are the Limitations

Who Might Find This Useful

What This Means in a Broader Context

Is It Worth a Try?

Related Publications

How to Run an AI Coding Agent on AMD Instinct GPUs

How AMD GPUs Accelerate Graph Visualization – And Where AI Fits In

AMD Introduces GPU Partitioning for Concurrent LLM Execution

From Source to Analysis

Neural Networks Involved in the Process

1. Analyzing the Original Publication and Writing the Text

2. step.translate-en.title

3. Text Review and Editing

4. Preparing the Illustration Description

5. Creating the Illustration