Published on March 31, 2026

exaCB: Supercomputer Performance Monitoring System

exaCB: How to Teach a Supercomputer to Monitor Its Own 'Health'

How the exaCB continuous benchmarking system helps monitor the performance of dozens of scientific applications on the exascale supercomputer JUPITER.

Computer Science 10 – 15 minutes min read
Author: Dr. Sophia Chen 10 – 15 minutes min read
«As I was finishing this text, one thought kept nagging me: we're building systems capable of a quintillion operations per second, but until now, we haven't had a proper tool to simply ask them, 'How are you doing?'. exaCB is, in essence, the first systematic answer to this question in the exascale world. It worries me a little that the insights about the mismatch between peak power consumption and peak performance are only emerging now – how many extra megawatts were wasted before, simply because no one was looking? I'd like to believe that this incremental approach will take root not only in HPC but in the broader engineering culture as well.» – Dr. Sophia Chen

Imagine a doctor who sees a patient once a year. The patient complains of fatigue, the doctor nods, orders some tests – and a few weeks later, it turns out the problem actually started eight months ago. Treating it now is much harder than if the patient had been wearing a fitness tracker on their wrist, recording vitals every day.

This is exactly what the traditional approach to performance testing on supercomputers looks like. You develop a program, run it on a supercomputer, get the numbers – and then forget about it until the next big test. Somewhere in between, something went wrong: the compiler was updated, the job scheduler settings were changed, a library version was swapped. The program suddenly starts running 30% slower. Who's to blame? When did it happen? It's a mystery.

The team developing the exaCB framework decided to put that very 'fitness tracker' on one of Europe's most powerful supercomputers – the JUPITER system. And here's what came of it.

What Is Exascale Computing and Its Importance

What Is «Exascale» and Why Does It Matter

Before diving into exaCB, you need to understand the scale of the task. The word «exascale» refers to a computing system capable of performing 1018 floating-point operations per second. That's a quintillion operations. To put that in perspective: if every person on Earth performed one calculation per second, it would take all of humanity over a hundred years to do what this computer does in a single second.

JUPITER is an exascale system deployed at the Jülich Supercomputing Centre in Germany. It was created to solve problems in climatology, materials science, molecular biology, nuclear physics, and dozens of other scientific fields. On a system like this, hundreds of applications run simultaneously, written by different teams in different programming languages, with different algorithms and requirements.

This is where the real engineering nightmare begins. How do you ensure that all these applications run correctly, efficiently, and don't degrade over time? How do you catch the moment when one of them starts consuming twice the energy without delivering any more results?

CI/CD Practices in Supercomputing Environments

CI/CD: A Developer's Tool Comes to the World of Supercomputers

In the world of standard software development, there's a long-standing practice called Continuous Integration and Continuous Delivery (CI/CD). Roughly speaking, it's an assembly line that automatically checks the code every time a developer makes a change. You write a new function, and the system automatically compiles it, runs tests, checks if anything broke, and reports back.

It's like the automatic spell checker in your word processor: it works constantly in the background and flags errors immediately, not after you've already sent the email.

The problem is that standard CI/CD systems check if the code works correctly, but not how efficiently it works. A program might run correctly, but do so twice as slowly as before. Or it might consume significantly more energy. On a laptop, this is an annoyance. In the world of supercomputers, where you're dealing with millions of CPU hours and megawatts of electricity, it's a catastrophe.

This is precisely why the concept of continuous benchmarking (CB) emerged – an extension of CI/CD that adds constant monitoring of performance and energy consumption to the correctness checks.

Meet exaCB

exaCB is a framework (a set of tools and rules) for organizing continuous benchmarking on exascale systems. It was developed in preparation for JUPITER's launch and was applied as part of the JUREAP (JUPITER Research and Early Access Program) – a kind of 'early access period' where scientific teams could test their applications on the system before its full deployment.

The main idea behind exaCB is simple: Let every application run on the supercomputer automatically record its performance data into a unified database, where it can be retrieved, compared, and analyzed at any time.

Sounds logical. But, as always, the devil is in the details.

Architecture: How It Works on the Inside

The exaCB architecture resembles a well-organized newsroom. There are correspondents – the applications collecting 'news' (performance data). There are editors – parsers that standardize this data. There's an archive – an InfluxDB database where everything is stored. And there are the Grafana dashboards – the storefront where you can see the whole picture: trends, anomalies, and comparisons.

More specifically, the system consists of several components:

  • Benchmark Repository – A centralized repository that stores configurations, run scripts, and parameters for all applications. Think of it as a shared 'cookbook': here's how to run this application, here's what to measure, and here's where to send the results.
  • CI/CD Pipelines – Automated processes based on GitLab CI that run on a schedule or when code changes. They can interact with the Slurm job scheduler, which manages the job queues on the supercomputer.
  • Metric Collectors – Modules that gather data from various sources: execution time, power consumption, memory load, I/O operations. This is done using specialized tools like Score-P (for profiling parallel applications), LIKWID (for hardware counters), and Perf (for Linux kernel-level profiling).
  • Results Database – InfluxDB, a database optimized for time-series data. It's perfectly suited for the task of 'recording a measurement result with a timestamp.'
  • Visualization System – Grafana with pre-configured dashboards that allow users to see performance trends, compare applications, and spot deviations.

exaCB Maturity Levels for Performance Benchmarking

The Main Innovation: The Maturity Ladder

If exaCB had required every team to immediately implement the full suite of monitoring tools, connect all the counters, and ensure perfect reproducibility, most teams would have simply refused to participate. It would be like demanding an amateur runner to immediately run a marathon.

Instead, the creators of exaCB came up with four levels of integration maturity, which they named CoL (Continuity Levels). Each successive level adds more detail and complexity, but you can start with the simplest one.

CoL 0: «At Least It Runs»

The zeroth level is literally just a check to see if the application compiles and runs without errors. It records the execution time and the return code (success or failure). No complex tools, minimal requirements.

It's like the first visit to the doctor: the pulse is there, blood pressure is measured, the patient is alive – that's a good start.

CoL 1: Basic Performance Metrics

The first level starts collecting quantitative data: execution time, throughput, floating-point operations per second (FLOPS). The data is written to the database and becomes available for trend analysis. Now you can see: 'A week ago, this application ran in 10 minutes, and now it takes 15.'

CoL 2: Energy Efficiency and Detailed Monitoring

The second level adds energy consumption measurement and deeper profiling: cache load, memory usage, vector operations. This allows you to answer not just 'how fast?' but also 'at what cost?'

One of the most interesting insights gained during JUREAP is related to this level: it turned out that peak power consumption does not always coincide with peak performance. An application can run fast but be extremely wasteful – or run slower but with a much better ratio of results to energy spent.

CoL 3: Reproducibility and Full Context

The highest level is scientific rigor in the truest sense. Absolutely everything is recorded: compiler versions, optimization flags, node configurations, environment variables, library versions. The result is a complete 'passport' for each run, allowing it to be reproduced a year later on a different system to obtain comparable data.

This is exactly what is usually missing from scientific publications: 'We ran this on a cluster and got these results' – but who can reproduce that measurement, and when?

JUREAP Program as exaCB Testing Ground

JUREAP: The Testing Ground

The JUREAP program became the ideal testing ground for exaCB. Within this program, more than 70 scientific applications from a wide range of fields – from molecular dynamics to climate models – were integrated into the continuous benchmarking system.

For each team, the integration process looked something like this:

  1. Create a repository with configuration files for exaCB, describing how to build and run the application.
  2. Write build and run scripts (exaCB provided templates, so there was no need to reinvent the wheel).
  3. Connect monitoring tools as the team became ready (CoL 1, 2, or 3).
  4. Configure the output data format to JSON so the exaCB parser can automatically extract key metrics.
  5. Set up a GitLab CI pipeline for automated runs.

Importantly, not all applications reached the same CoL, and that was perfectly okay. Some teams stuck to the zeroth level, and even this provided valuable data on baseline functionality on the new system. Others went all the way to full monitoring with energy metrics and detailed profiling.

Practical Discoveries from exaCB Data

What Was Discovered: Real-World Findings

The data collected by exaCB during JUREAP led to several concrete and practically significant discoveries.

I/O Bottlenecks

A number of applications showed an unexpected, significant drop in performance – not due to the computations themselves, but because of slow data reading and writing to disk. Without constant monitoring, this might have gone unnoticed: the application was 'working' with no errors, but was spending 40% of its time waiting on the file system.

Visualization in Grafana made this problem obvious, allowing teams to optimize file operations even before starting full-scale work on the system.

The Impact of Slurm Scheduler Settings

Slurm is the supercomputer's 'dispatcher,' deciding which job to run on which node, how to allocate resources, and in what order to process the queue. It turned out that different Slurm configurations produced significantly different results for the same applications – not only in terms of performance but also energy consumption.

This discovery allowed for the optimization of scheduler settings for specific classes of tasks.

Cross-Application Analysis: Common Patterns

One of the most interesting benefits of a unified database is the ability to compare applications from completely different fields. In JUREAP, it was found that several applications using similar numerical algorithms (e.g., iterative linear solvers) exhibited similar performance and power consumption patterns. This means an optimization found for one application is potentially applicable to others.

It is precisely this kind of systemic knowledge that can only be accumulated with a common data infrastructure – and it's precisely this that distinguishes exaCB from scattered, one-off tests.

Challenges in Implementing Continuous Benchmarking

The Challenges They Faced

It would be unfair not to mention the difficulties. Implementing exaCB in a real-world exascale environment exposed several serious engineering problems.

Heterogeneity of Applications. Seventy-plus applications means seventy-plus different stories: different programming languages, build systems, dependencies, and output formats. exaCB's modular architecture and incremental approach helped mitigate this problem, but it couldn't be eliminated entirely – each integration required individual attention.

Infrastructure Scalability. When dozens of applications run regularly and each run generates hundreds of metrics, the data volumes quickly become significant. InfluxDB handles this task but requires proper configuration of storage schemas and aggregation policies.

Reproducibility in an Unstable Environment. A supercomputer is a living system: firmware gets updated, library versions change, maintenance is performed. Ensuring full reproducibility in such conditions is extremely difficult. CoL 3 requires meticulous recording of the entire context, which demands discipline and extra effort from the teams.

The Human Factor. Technical tools are only as good as the people using them. Persuading teams to regularly update configurations, monitor data quality, and act on detected regressions is not an engineering challenge, but an organizational one. The incremental approach and ready-made templates lowered the barrier to entry but didn't remove it completely.

Relevance of exaCB Principles Beyond Supercomputers

Why This Matters Beyond Supercomputers

You might think this is all a story about very expensive and highly specialized machines that has nothing to do with regular software development. But that's not the case.

The principles implemented in exaCB are universal:

  • Continuous measurement is better than periodic. The more frequently you measure performance, the earlier you detect problems and the cheaper they are to fix.
  • A unified data format opens up new questions. When all results are stored in one place and in one format, you can ask questions that were simply impossible to ask before, like, 'Which of our applications perform worst after the compiler update?'
  • Incremental adoption works better than revolutionary change. Demanding 'do everything right from the start' kills adoption. The ability to start small and gradually increase complexity works.
  • Performance and energy efficiency are different things. Fast doesn't mean frugal. In an era where the cost of electricity is an increasingly significant factor in operating computing systems, the ability to measure both parameters simultaneously is not a luxury, but a necessity.

The transition to exascale systems in the 2020s exposed a problem that, on a smaller scale, was merely an inconvenience: without systematic and continuous performance monitoring, it is impossible to manage complex software ecosystems. exaCB is one of the first practical answers to this challenge, tested not in a lab but on a real system with real applications and real development teams.

AI is like a child who learns from its mistakes. A supercomputer is like a complex organism that needs regular check-ups. And in both cases, good monitoring is not bureaucracy, but an honest conversation about what's really happening on the inside.

Original Title: exaCB: Reproducible Continuous Benchmark Collections at Scale Leveraging an Incremental Approach
Article Publication Date: Mar 23, 2026
Original Article Authors : Jayesh Badwaik, Mathis Bode, Michal Rajski, Andreas Herten
Previous Article How Nuclear Matter 'Freezes' Interactions: The Renormalization Group and the Logic of a Dense Medium Next Article Cartilage as a Living Battery: What Electrical Signals Reveal About Joint Health

Related Publications

You May Also Like

Enter the Laboratory

Research does not end with a single experiment. Below are publications that develop similar methods, questions, or concepts.

Together AI has introduced an updated GPU Clusters platform that now offers auto-scaling, self-healing from failures, and improved observability, making it easier for teams to work with AI models.

Together.aiwww.together.ai Mar 19, 2026

From Research to Understanding

How This Text Was Created

This material is based on a real scientific study, not generated “from scratch.” At the beginning, neural networks analyze the original publication: its goals, methods, and conclusions. Then the author creates a coherent text that preserves the scientific meaning but translates it from academic format into clear, readable exposition – without formulas, yet without loss of accuracy.

Engineering depth

91%

Explaining AI mistakes

78%

Accessible for everyone

85%

Neural Networks Involved in the Process

We show which models were used at each stage – from research analysis to editorial review and illustration creation. Each neural network performs a specific role: some handle the source material, others work on phrasing and structure, and others focus on the visual representation. This ensures transparency of the process and trust in the results.

1.
Gemini 2.5 Flash Google DeepMind Research Summarization Highlighting key ideas and results

1. Research Summarization

Highlighting key ideas and results

Gemini 2.5 Flash Google DeepMind
2.
Claude Sonnet 4.6 Anthropic Creating Text from Summary Transforming the summary into a coherent explanation

2. Creating Text from Summary

Transforming the summary into a coherent explanation

Claude Sonnet 4.6 Anthropic
3.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

3. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
4.
Gemini 2.5 Flash Google DeepMind Editorial Review Correcting errors and clarifying conclusions

4. Editorial Review

Correcting errors and clarifying conclusions

Gemini 2.5 Flash Google DeepMind
5.
DeepSeek-V3.2 DeepSeek Preparing Description for Illustration Generating a textual prompt for the visual model

5. Preparing Description for Illustration

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
6.
FLUX.2 Pro Black Forest Labs Creating Illustration Generating an image based on the prepared prompt

6. Creating Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to know about new
experiments first?

Subscribe to our Telegram channel — we share all the latest
and exciting updates from NeuraBooks.

Subscribe