Hardware accelerator design is traditionally an area that requires significant expertise. A single incorrectly set parameter in the circuit description can cost several days of debugging. This is precisely why any reduction in the barrier to entry here holds real value. A recent competitive experiment demonstrated that language models can become full-fledged partners in this work – not replacing the engineer, but significantly speeding up their process.
The Challenge Tackled by the Students
The task was to optimize the HMAC-SHA-256 cryptographic algorithm – a widely used method for data authentication. As part of a student competition with an LLM optimization track, participants took an existing implementation of this algorithm for an UltraScale+ series programmable logic device (FPGA) and aimed to make it faster using language models. The development environment was Vitis HLS 2025.2, and the AI advisors were the gpt-5-turbo and claude-sonnet-4 models.
The initial implementation ran in 880 clock cycles with a clock period of about 4 ns per cycle – totaling approximately 3.54 microseconds per operation. After two weeks of work, the team achieved a 2.22x speedup and a 52% reduction in latency. The chip's resources remained almost untouched – logic utilization did not exceed 2% of the available capacity.
What Actually Happens Inside SHA-256
In short: the SHA-256 algorithm takes input data in 512-bit blocks and processes each block through three sequential stages. First, the data is padded to the required format; then, a 64-word schedule is expanded from the initial 16 words; and finally, 64 rounds of computation are run, transforming the block into a piece of the final hash. HMAC adds another layer on top of this: the key is mixed with the data in a specific way, and the entire compression process is run twice.
In the original hardware implementation, the three stages were connected by queues (FIFOs) and marked as parallel. However, there was an architectural problem: the on-paper parallelism didn't work in practice. The second stage couldn't start until the first one had processed an entire block, and the third couldn't start until the second had finished its part. The queues between them added overhead without providing any real temporal overlap. The third stage (64 rounds of computation) took up 58% of the total execution time and was the obvious bottleneck.
How the Conversation with AI Began
The first prompt to the language model was intentionally open-ended. The engineer provided the full source code and the synthesis report, and then asked:
"You are a senior Vitis HLS engineer and hardware architect. Analyze this code and tell me in what directions it can be optimized and how realistic this is."
The model returned a structured analysis: it pointed out the pseudo-parallelism of the first two stages, identified the third stage as the dominant one in terms of latency, and suggested several avenues for improvement. The key point here is not about SHA-256 per se, but that the language model, given the synthesis report along with the code, is able to replicate the bottleneck analysis work that previously only an experienced engineer could do by manually reading latency tables.
The Four-Phase Workflow that Works
During the experiment, a stable work pattern emerged, which the authors describe as a four-phase cycle. It is repeated at each optimization step and, judging by the results, is highly portable to other projects – not just SHA-256.
Load the Context
The first phase is to provide the model with maximum information about the project's current state: source code, synthesis report, and target device specifications. A common mistake is to ask for optimization advice without specific numbers. Without them, the model defaults to general recommendations instead of addressing the real problem.
Explore Options
The second phase is to ask for several strategies with a clear explanation of the trade-offs. A prompt that worked well looked like this:
"Based on this synthesis report and device constraints, propose several optimization strategies. For each, explain the expected latency improvement, resource cost, and potential risks."
In response, the model proposed five strategies – from full combinational unrolling (maximum performance, but unrealistic for timing) to two-way parallelism (a good boost with acceptable resource costs). The engineer chose the second option after verifying that the logic resource utilization would remain within the device's budget.
Generate Code Precisely
The third phase is to formulate the code generation prompt as specifically as possible. Prompts that worked well were those that named a specific action, stated the goal, explicitly defined what should not be touched, and warned about known pitfalls:
"Merge generateMsgSchedule and PreProcessing into a single function to improve parallelism. Generate the combined function and the updated top-level function. Do not modify other function calls. Be careful about directive conflicts and invalid connections."
This level of specificity consistently yielded better results than open-ended prompts like «make it faster.»
Verify and Provide Feedback
The fourth phase is to run synthesis and simulation, and if something goes wrong, turn the error messages into the next prompt. For example:
"Synthesis is reporting errors [XFORM 203-313] and [RTGEN 206-102]. Fix the conflicts."
The model diagnosed the causes: a conflict between two directives applied at the same level of the hierarchy, and an unused dataflow stream left over after merging the functions. The corrected version was sent for re-synthesis, and the cycle continued. Typically, one optimization step took two to three iterations.
Where AI Helps and Where It Fails
Over three major optimization iterations, a clear picture of the language models' strengths and weaknesses emerged.
The models excelled at recognizing architectural anti-patterns: they saw that a parallel execution directive was applied to stages with sequential dependencies and suggested removing the unnecessary layer of indirection. They also noticed that the recurrence depth in the schedule generation was 16 words, not 64, and proposed using a circular buffer – a technically correct and efficient solution that requires an understanding of the algorithm.
The first version of the generated code was functional in about 80% of cases – structurally correct, with the right interfaces and directive placement. The remaining 20% required manual correction.
Consistent failures occurred in three situations. First, the models violated rules for combining directives – for example, placing a pipeline directive inside a function already managed by a dataflow directive, which is not allowed at the same hierarchy level. This is a documented tool limitation, but it seems to be poorly represented in the models' training data. Second, after merging functions, the models left dead code – stream duplication utilities that no longer had two consumers and caused connection errors. Third, the boundary conditions in the algorithmic logic – such as the data padding paths in SHA-256 – required careful manual verification using test vectors.
The authors note that these weaknesses are not fundamental limitations of language models. They reflect a lack of specialized context in the training data. Structured prompts already reduce the frequency of errors, and further improvement is possible by connecting specialized agents with knowledge bases on specific development tools.
The Final Result
The final architecture combined two transformations. First, two of the three functions were merged into one: the queue between them was eliminated, reducing the three sequential stages to two. Second, two-way parallelism: a special module distributes data blocks alternately between two identical computational paths, each containing the merged pre-processing stage and its own instance of the compression function. Another module collects the results in the correct order.
Each of the two paths processes one block in 72 + 69 cycles. The two paths operate in parallel, which halves the effective processing time. After accounting for the overhead from distribution, collection, and the HMAC wrapper, the final speedup was 2.22x for the entire operation.
Correctness was verified at each stage through co-simulation using the official test vectors from the FIPS 180-4 standard. The chip's resource consumption remained well below 2% of its available capacity – the speedup was achieved through architectural restructuring, not through the 'brute force' of additional hardware.
What This Means for the Future of Hardware Design
The authors estimate that AI interactions contributed to about 60% of the intellectual work; the remaining 40% was human-led validation of synthesis, debugging, and decision-making. The entire process took about two weeks.
The proposed four-phase cycle is not specific to SHA-256. It is a general approach for any hardware core where an engineer can rely on actual results from synthesis and simulation. In such a cycle, the language model is most valuable as a tool for accelerating architectural reasoning and generating draft code, while the human is responsible for the final decisions: managing constraints, validating with development tools, and ensuring end-to-end correctness.
Hardware development still requires rigorous physical verification, so language models expand the possibilities for architectural exploration but do not replace the final validation. Even at their current level, they have already lowered the barrier to entry and compressed the optimization cycle so much that engineers without deep FPGA experience were able to achieve results that previously required years of specialization.