Imagine a doctor who sees a patient once a year. The patient complains of fatigue, the doctor nods, orders some tests – and a few weeks later, it turns out the problem actually started eight months ago. Treating it now is much harder than if the patient had been wearing a fitness tracker on their wrist, recording vitals every day.
This is exactly what the traditional approach to performance testing on supercomputers looks like. You develop a program, run it on a supercomputer, get the numbers – and then forget about it until the next big test. Somewhere in between, something went wrong: the compiler was updated, the job scheduler settings were changed, a library version was swapped. The program suddenly starts running 30% slower. Who's to blame? When did it happen? It's a mystery.
The team developing the exaCB framework decided to put that very 'fitness tracker' on one of Europe's most powerful supercomputers – the JUPITER system. And here's what came of it.
What Is «Exascale» and Why Does It Matter
Before diving into exaCB, you need to understand the scale of the task. The word «exascale» refers to a computing system capable of performing 1018 floating-point operations per second. That's a quintillion operations. To put that in perspective: if every person on Earth performed one calculation per second, it would take all of humanity over a hundred years to do what this computer does in a single second.
JUPITER is an exascale system deployed at the Jülich Supercomputing Centre in Germany. It was created to solve problems in climatology, materials science, molecular biology, nuclear physics, and dozens of other scientific fields. On a system like this, hundreds of applications run simultaneously, written by different teams in different programming languages, with different algorithms and requirements.
This is where the real engineering nightmare begins. How do you ensure that all these applications run correctly, efficiently, and don't degrade over time? How do you catch the moment when one of them starts consuming twice the energy without delivering any more results?
CI/CD: A Developer's Tool Comes to the World of Supercomputers
In the world of standard software development, there's a long-standing practice called Continuous Integration and Continuous Delivery (CI/CD). Roughly speaking, it's an assembly line that automatically checks the code every time a developer makes a change. You write a new function, and the system automatically compiles it, runs tests, checks if anything broke, and reports back.
It's like the automatic spell checker in your word processor: it works constantly in the background and flags errors immediately, not after you've already sent the email.
The problem is that standard CI/CD systems check if the code works correctly, but not how efficiently it works. A program might run correctly, but do so twice as slowly as before. Or it might consume significantly more energy. On a laptop, this is an annoyance. In the world of supercomputers, where you're dealing with millions of CPU hours and megawatts of electricity, it's a catastrophe.
This is precisely why the concept of continuous benchmarking (CB) emerged – an extension of CI/CD that adds constant monitoring of performance and energy consumption to the correctness checks.
exaCB is a framework (a set of tools and rules) for organizing continuous benchmarking on exascale systems. It was developed in preparation for JUPITER's launch and was applied as part of the JUREAP (JUPITER Research and Early Access Program) – a kind of 'early access period' where scientific teams could test their applications on the system before its full deployment.
The main idea behind exaCB is simple: Let every application run on the supercomputer automatically record its performance data into a unified database, where it can be retrieved, compared, and analyzed at any time.
Sounds logical. But, as always, the devil is in the details.
Architecture: How It Works on the Inside
The exaCB architecture resembles a well-organized newsroom. There are correspondents – the applications collecting 'news' (performance data). There are editors – parsers that standardize this data. There's an archive – an InfluxDB database where everything is stored. And there are the Grafana dashboards – the storefront where you can see the whole picture: trends, anomalies, and comparisons.
More specifically, the system consists of several components:
- Benchmark Repository – A centralized repository that stores configurations, run scripts, and parameters for all applications. Think of it as a shared 'cookbook': here's how to run this application, here's what to measure, and here's where to send the results.
- CI/CD Pipelines – Automated processes based on GitLab CI that run on a schedule or when code changes. They can interact with the Slurm job scheduler, which manages the job queues on the supercomputer.
- Metric Collectors – Modules that gather data from various sources: execution time, power consumption, memory load, I/O operations. This is done using specialized tools like Score-P (for profiling parallel applications), LIKWID (for hardware counters), and Perf (for Linux kernel-level profiling).
- Results Database – InfluxDB, a database optimized for time-series data. It's perfectly suited for the task of 'recording a measurement result with a timestamp.'
- Visualization System – Grafana with pre-configured dashboards that allow users to see performance trends, compare applications, and spot deviations.
The Main Innovation: The Maturity Ladder
If exaCB had required every team to immediately implement the full suite of monitoring tools, connect all the counters, and ensure perfect reproducibility, most teams would have simply refused to participate. It would be like demanding an amateur runner to immediately run a marathon.
Instead, the creators of exaCB came up with four levels of integration maturity, which they named CoL (Continuity Levels). Each successive level adds more detail and complexity, but you can start with the simplest one.
CoL 0: «At Least It Runs»
The zeroth level is literally just a check to see if the application compiles and runs without errors. It records the execution time and the return code (success or failure). No complex tools, minimal requirements.
It's like the first visit to the doctor: the pulse is there, blood pressure is measured, the patient is alive – that's a good start.
CoL 1: Basic Performance Metrics
The first level starts collecting quantitative data: execution time, throughput, floating-point operations per second (FLOPS). The data is written to the database and becomes available for trend analysis. Now you can see: 'A week ago, this application ran in 10 minutes, and now it takes 15.'
CoL 2: Energy Efficiency and Detailed Monitoring
The second level adds energy consumption measurement and deeper profiling: cache load, memory usage, vector operations. This allows you to answer not just 'how fast?' but also 'at what cost?'
One of the most interesting insights gained during JUREAP is related to this level: it turned out that peak power consumption does not always coincide with peak performance. An application can run fast but be extremely wasteful – or run slower but with a much better ratio of results to energy spent.
CoL 3: Reproducibility and Full Context
The highest level is scientific rigor in the truest sense. Absolutely everything is recorded: compiler versions, optimization flags, node configurations, environment variables, library versions. The result is a complete 'passport' for each run, allowing it to be reproduced a year later on a different system to obtain comparable data.
This is exactly what is usually missing from scientific publications: 'We ran this on a cluster and got these results' – but who can reproduce that measurement, and when?
JUREAP: The Testing Ground
The JUREAP program became the ideal testing ground for exaCB. Within this program, more than 70 scientific applications from a wide range of fields – from molecular dynamics to climate models – were integrated into the continuous benchmarking system.
For each team, the integration process looked something like this:
- Create a repository with configuration files for exaCB, describing how to build and run the application.
- Write build and run scripts (exaCB provided templates, so there was no need to reinvent the wheel).
- Connect monitoring tools as the team became ready (CoL 1, 2, or 3).
- Configure the output data format to JSON so the exaCB parser can automatically extract key metrics.
- Set up a GitLab CI pipeline for automated runs.
Importantly, not all applications reached the same CoL, and that was perfectly okay. Some teams stuck to the zeroth level, and even this provided valuable data on baseline functionality on the new system. Others went all the way to full monitoring with energy metrics and detailed profiling.
What Was Discovered: Real-World Findings
The data collected by exaCB during JUREAP led to several concrete and practically significant discoveries.
I/O Bottlenecks
A number of applications showed an unexpected, significant drop in performance – not due to the computations themselves, but because of slow data reading and writing to disk. Without constant monitoring, this might have gone unnoticed: the application was 'working' with no errors, but was spending 40% of its time waiting on the file system.
Visualization in Grafana made this problem obvious, allowing teams to optimize file operations even before starting full-scale work on the system.
The Impact of Slurm Scheduler Settings
Slurm is the supercomputer's 'dispatcher,' deciding which job to run on which node, how to allocate resources, and in what order to process the queue. It turned out that different Slurm configurations produced significantly different results for the same applications – not only in terms of performance but also energy consumption.
This discovery allowed for the optimization of scheduler settings for specific classes of tasks.
Cross-Application Analysis: Common Patterns
One of the most interesting benefits of a unified database is the ability to compare applications from completely different fields. In JUREAP, it was found that several applications using similar numerical algorithms (e.g., iterative linear solvers) exhibited similar performance and power consumption patterns. This means an optimization found for one application is potentially applicable to others.
It is precisely this kind of systemic knowledge that can only be accumulated with a common data infrastructure – and it's precisely this that distinguishes exaCB from scattered, one-off tests.
The Challenges They Faced
It would be unfair not to mention the difficulties. Implementing exaCB in a real-world exascale environment exposed several serious engineering problems.
Heterogeneity of Applications. Seventy-plus applications means seventy-plus different stories: different programming languages, build systems, dependencies, and output formats. exaCB's modular architecture and incremental approach helped mitigate this problem, but it couldn't be eliminated entirely – each integration required individual attention.
Infrastructure Scalability. When dozens of applications run regularly and each run generates hundreds of metrics, the data volumes quickly become significant. InfluxDB handles this task but requires proper configuration of storage schemas and aggregation policies.
Reproducibility in an Unstable Environment. A supercomputer is a living system: firmware gets updated, library versions change, maintenance is performed. Ensuring full reproducibility in such conditions is extremely difficult. CoL 3 requires meticulous recording of the entire context, which demands discipline and extra effort from the teams.
The Human Factor. Technical tools are only as good as the people using them. Persuading teams to regularly update configurations, monitor data quality, and act on detected regressions is not an engineering challenge, but an organizational one. The incremental approach and ready-made templates lowered the barrier to entry but didn't remove it completely.
Why This Matters Beyond Supercomputers
You might think this is all a story about very expensive and highly specialized machines that has nothing to do with regular software development. But that's not the case.
The principles implemented in exaCB are universal:
- Continuous measurement is better than periodic. The more frequently you measure performance, the earlier you detect problems and the cheaper they are to fix.
- A unified data format opens up new questions. When all results are stored in one place and in one format, you can ask questions that were simply impossible to ask before, like, 'Which of our applications perform worst after the compiler update?'
- Incremental adoption works better than revolutionary change. Demanding 'do everything right from the start' kills adoption. The ability to start small and gradually increase complexity works.
- Performance and energy efficiency are different things. Fast doesn't mean frugal. In an era where the cost of electricity is an increasingly significant factor in operating computing systems, the ability to measure both parameters simultaneously is not a luxury, but a necessity.
The transition to exascale systems in the 2020s exposed a problem that, on a smaller scale, was merely an inconvenience: without systematic and continuous performance monitoring, it is impossible to manage complex software ecosystems. exaCB is one of the first practical answers to this challenge, tested not in a lab but on a real system with real applications and real development teams.
AI is like a child who learns from its mistakes. A supercomputer is like a complex organism that needs regular check-ups. And in both cases, good monitoring is not bureaucracy, but an honest conversation about what's really happening on the inside.