Published on April 9, 2026

How to Detect AI-Generated Text Using Digital Watermarks

Digital Fingerprints: How to Spot AI-Generated Text

A deep dive into how the 'Gumbel watermarking' method embeds an invisible trace into AI text – and why it's so tricky to find.

Computer Science 10 – 15 minutes min read
Author: Dr. Rafael Santos 10 – 15 minutes min read
«When I finished reading this study, I couldn't shake the feeling that we're just beginning the conversation about how trust works in a world where text can be written by both humans and machines. The math here is beautiful – almost like a well-structured rhythm section – but one question keeps nagging at me: how quickly will practice outpace theory if language models start 'learning' to bypass such detectors? This isn't paranoia, just an honest technical question for which we don't have an answer yet.» – Dr. Rafael Santos

Imagine you're hearing a samba. Even if the musicians are playing softly, the rhythm always has a characteristic pattern that a trained ear can use to say, without a doubt: 'That's samba, not something else.' Now, imagine the opposite task: to take a piece of text and figure out if it was written by a human or a language model – and to do so without having a 'trained ear' on hand.

This is precisely the problem addressed by the field of digital watermarks for texts created by large language models. In 2022, Scott Aaronson proposed an elegant approach that became known as Gumbel watermarking. And the study I want to talk about takes a step forward: it proposes a more accurate and simpler way to detect this watermark. Sound technical? I promise to explain it in a way that makes everything click – through metaphors, analogies, and a bit of Brazilian passion for numbers.

Why Are Text Watermarks Necessary

Why Do We Need Watermarks in Text Anyway?

In the world of images, watermarks have been around for a long time. Photo banks place a semi-transparent logo over an image, and just like that, authorship is protected. But with text, it's more complicated. Text is a sequence of words, and you can't just 'draw' a logo over it. Any change you make must be unnoticeable to the reader but detectable by a machine.

Why is this necessary? There are several reasons. First, the question of authorship: if a large language model generates millions of texts a day, it's important to have a tool that can determine who, or what, is behind a specific message. Second, it's a matter of trust: newsrooms, educational institutions, and publishing platforms are all interested in being able to distinguish between human and machine-generated text. Third, it's a question of accountability: if a model generates harmful content, a watermark helps trace the source.

Early attempts to embed hidden signals in text were like an old Brazilian game: add an unnoticeable word substitution, slightly alter the sentence order, or tinker with the punctuation. These methods are called steganography. The problem was that they either disrupted the natural flow of the language or were easily defeated by attacks – all it took was a little rephrasing, and the signal was lost.

With the advent of powerful language models, a need arose for fundamentally different approaches – ones that embed the signal not in specific words, but in the statistics of word choice. And this is where things get really interesting.

How Gumbel Watermarking Works Explained

How Gumbel Watermarking Works: A Simple Explanation

To understand Gumbel watermarking, you first need to understand how a language model chooses the next word.

Imagine the model is a DJ at a carnival. In front of him is a huge stack of vinyl records – these are all the words in the vocabulary, which can number in the tens of thousands. For each word, he has a score: how well this word fits next in the current sentence. Some words get high scores, others low. The DJ takes these scores, applies a specific formula (known in science as 'softmax'), turns them into probabilities, and randomly picks a record. Not always the most popular one, because randomness makes the text lively and diverse.

Now let's add Gumbel watermarking. Before making a choice, the DJ flips a special coin for each record – this is called a Gumbel random variable. This is a kind of mathematical 'noise' that is added to each score. The noise is random but follows a known law – the Gumbel distribution, which mathematician Emil Gumbel described back in the mid-20th century in the context of extreme value theory.

But watermarking isn't just about adding noise. The key is that one specific record at any given moment receives a small additional 'bonus' – a small constant, denoted in formulas as c. Which record gets this bonus is determined by a special hash function: essentially, it's a deterministic but outwardly random way to choose a 'favorite' based on the preceding context.

The constant c is small. So small that a reader will never notice the difference: the text remains natural, lively, and grammatically correct. But statistically – and this is where the magic of numbers begins – the tokens that received this bonus will appear in the text slightly more often than they would have without it.

It's as if the DJ, unbeknownst to the crowd, were subtly reaching for one particular stack of records more often. A single choice looks random. But if you look at hundreds and thousands of choices, a pattern emerges.

Alternative Text Watermarking Approaches

What Came Before: Other Approaches to Watermarking

Alongside Gumbel watermarking, other schemes exist. One of the best-known approaches is described in a paper by Kirchenbauer et al. (2023). The idea there is different: the vocabulary is randomly divided into two lists – 'green' and 'red.' During generation, the model is instructed to prefer tokens from the 'green' list. Detection then boils down to a simple check: if 'green' words appear statistically more often than normal in the text, then we have a watermarked text.

This is elegant and relatively simple. But this approach has a vulnerability: the division into lists itself can become known to an attacker, who can then purposefully wash out the watermark by replacing 'green' words with 'red' ones.

Gumbel watermarking works differently: it manipulates not the choice from lists, but the random sampling process itself. This makes it theoretically more robust – but also more difficult to detect. That's why the question of a good detector for the Gumbel scheme remained open.

The New Gumbel Watermark Detector

The New Detector: Simple as a Good Pass

The researchers proposed a detector that can be described as follows: if watermarking was applied, the tokens that received the bonus c will 'win' the competition against neighboring tokens slightly more often than they should have by pure probability.

The detection algorithm works step-by-step.

  1. Identify 'bonus' tokens. For each word in the analyzed text, the same hash function used during watermarking is applied. This allows us to know which word in each position should have received the bonus if watermarking was used.
  2. Get baseline scores. Using the same language model that supposedly generated the text, scores (logits) for all possible words are calculated for each position – that is, what the model 'thought' about the probability of each word in that context.
  3. Compare the reality with the expectation. For each 'bonus' token, we look at how much its actual score exceeds that of the nearest competitor without a bonus. If there is a watermark, this excess should be systematically positive.
  4. Apply a statistical test. The accumulated differences across the entire text are checked with a standard statistical test – for example, a z-test. If the average deviation is statistically significant from zero, the text is likely watermarked.

Nothing extra. No complex neural networks on top of neural networks. No fine-tuning for a specific model. It's like a good pass in football: not flashy, but precise and timely.

The Mathematical Basis of Gumbel Watermarking

The Math Behind the Scenes: Why Gumbel?

Why is the Gumbel distribution so convenient in this scheme? It's not a random choice.

The Gumbel distribution appears in mathematics wherever you need to describe the maximum of many random variables. This is exactly what a language model does when choosing a token: it searches for the maximum among the scores of all possible words. It turns out that if you add a random variable from a Gumbel distribution to each score, the probability that the i-th token will be the maximum is exactly equal to the softmax probability – the standard way of choosing the next word in language models. This mathematical identity makes the Gumbel scheme theoretically clean: the watermark is embedded into a process that is already 'Gumbelian' by nature.

When the constant c is added to one of the tokens, it shifts its position in the sorted list of scores. The researchers showed that this shift can be precisely described mathematically through the properties of order statistics of the Gumbel distribution. This is what allows for a rigorous proof that the proposed detector is nearly optimal.

What does 'nearly optimal' mean? It means that among all possible detectors that are unaware of the model's internal workings (so-called 'model-agnostic' detectors), the proposed algorithm minimizes two types of errors simultaneously. The first is a false negative: when a watermarked text is not recognized as such. The second is a false positive: when ordinary human text is mistakenly identified as watermarked. It has been theoretically proven that as the text volume increases, the detector approaches the Bayesian optimum – that is, the limit of accuracy that is achievable under the given conditions.

Practical Testing of Watermark Detection

How It Was Tested in Practice

Theory is one thing, but science demands practical verification. The researchers conducted experiments using GPT-2, one of the well-known language models that is well-studied enough for such tests. Texts were generated in two ways: with Gumbel watermarking at different values of the constant c, and with no watermarking at all.

The new detector was then applied to these texts. The results were convincing: even with very small values of c, where the watermark had almost no effect on the text quality, the detector confidently distinguished between watermarked and non-watermarked texts. Standard classification quality metrics were measured:

  • Precision – what proportion of texts identified as watermarked actually are.
  • Recall – what proportion of truly watermarked texts were correctly identified.
  • F1-score – the harmonic mean of the previous two indicators, a generalized quality score.

On all three metrics, the new detector surpassed or matched existing methods, especially at small values of c.

Resistance to attacks was also tested. What happens if someone tries to 'wash out' the watermark – for example, by rephrasing the text, deleting random fragments, or substituting individual words? The detector showed sufficient robustness in scenarios where the attacker does not know the exact parameters of the watermarking. Of course, if the adversary has full knowledge of the scheme and complete control over the generation process, the situation changes. But this is more of a theoretical scenario than a practical one.

Another important practical conclusion: the longer the text, the more reliably the detector works. This is predictable from a theoretical point of view – more data means more reliable statistics – but it's nice to see it confirmed experimentally. For short texts of a few sentences, the detector is less stable; for long ones, it works confidently.

Limitations of Watermarking and Detection

Limitations and Being Honest with Ourselves

Any good research paper is honest about its limitations. This study is no exception.

The first and main limitation is the assumption of token independence. In their theoretical proofs, the authors assume that each token is chosen independently of all others – the so-called i.i.d. (independent and identically distributed) assumption. In real language models, this is, of course, not the case: words in a text are deeply interconnected, and the model considers the entire preceding context. The authors honestly call this an idealization needed for rigorous mathematical proofs. In practice, as experiments show, the detector works well even without this condition – but the theoretical guarantee of strict optimality relies on it.

The second limitation is the need for access to the model. To calculate the baseline token scores, the detector needs the same model that generated the text – or at least access to its interface for calculating next-word probabilities. This is not always possible, especially if the model is proprietary or its parameters are unknown. The authors note that in some cases, API access is sufficient, rather than the model itself – but this still imposes practical limitations.

The third limitation is resistance to targeted attacks. If an adversary knows the watermarking scheme and has full control over the generation process, they can develop ways to neutralize the watermark. There is no absolute protection – and this must be accepted as a fact.

Future of AI Text Watermarking

Why This Matters and What's Next

Watermarks for AI texts are not an academic game of numbers. They are a tool on whose development depends how transparent and accountable the environment in which language models operate alongside humans will be.

As language models become more powerful and accessible – a consistent trend observed from 2018 to 2023 and one that appears to be continuing – the line between human and machine-generated text becomes increasingly blurred. Tools that allow us to find that line are becoming more and more valuable.

Gumbel watermarking, combined with the new detector, is one of the most theoretically sound approaches in this field at the time of the study's publication. It is simple to implement, model-agnostic, and approaches a mathematically provable optimum. It is not the final word, but it is a very clear and solid step in the right direction.

Among the directions for future research, the authors name three key areas:

  • Relaxing the assumption of token independence by developing a theory that accounts for the real interdependence of words in a text.
  • Creating detectors that are resistant to more sophisticated attacks without sacrificing generation quality.
  • Investigating more complex watermarking schemes capable of embedding not just a binary signal ('watermarked / not watermarked') but richer information – such as a specific model identifier or a timestamp.

Algorithms aren't better than us – they're just different. But sometimes that 'difference' allows them to notice what we miss: the faint rhythm of a watermark in thousands of random words. Like a seasoned musician who recognizes the samba even in the noise of a crowd.

Original Title: Refined Detection for Gumbel Watermarking
Article Publication Date: Mar 31, 2026
Original Article Author : Tor Lattimore
Previous Article The Brain as a Computer: How Spatial Navigation Makes Us Universal Thinkers Next Article Chandrasekhar's H-Function: An Exact Solution to the Equation Governing Light in Scattering Media

Related Publications

You May Also Like

Enter the Laboratory

Research does not end with a single experiment. Below are publications that develop similar methods, questions, or concepts.

OpenAI has developed the IH-Challenge approach, which helps language models correctly prioritize instructions from different sources.

OpenAIopenai.com Mar 20, 2026

From Research to Understanding

How This Text Was Created

This material is based on a real scientific study, not generated “from scratch.” At the beginning, neural networks analyze the original publication: its goals, methods, and conclusions. Then the author creates a coherent text that preserves the scientific meaning but translates it from academic format into clear, readable exposition – without formulas, yet without loss of accuracy.

Humor

91%

Energy and dynamism

92%

Clarity and accessibility

89%

Neural Networks Involved in the Process

We show which models were used at each stage – from research analysis to editorial review and illustration creation. Each neural network performs a specific role: some handle the source material, others work on phrasing and structure, and others focus on the visual representation. This ensures transparency of the process and trust in the results.

1.
Gemini 2.5 Flash Google DeepMind Research Summarization Highlighting key ideas and results

1. Research Summarization

Highlighting key ideas and results

Gemini 2.5 Flash Google DeepMind
2.
Claude Sonnet 4.6 Anthropic Creating Text from Summary Transforming the summary into a coherent explanation

2. Creating Text from Summary

Transforming the summary into a coherent explanation

Claude Sonnet 4.6 Anthropic
3.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

3. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
4.
Gemini 2.5 Flash Google DeepMind Editorial Review Correcting errors and clarifying conclusions

4. Editorial Review

Correcting errors and clarifying conclusions

Gemini 2.5 Flash Google DeepMind
5.
DeepSeek-V3.2 DeepSeek Preparing Description for Illustration Generating a textual prompt for the visual model

5. Preparing Description for Illustration

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
6.
FLUX.2 Pro Black Forest Labs Creating Illustration Generating an image based on the prepared prompt

6. Creating Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe