Allow me to begin with a small paradox I discovered while studying modern media economics. For decades, publishers built complex monetization systems: banners, subscriptions, paywalled articles, sponsored content. All of this worked as long as the reader was human – a creature capable of seeing ads, clicking links, and feeling a pang of guilt for not paying for a subscription. But what happens when the “reader” becomes a machine? A machine that has no need for ads, does not subscribe to newsletters, and devours thousands of articles per second with the methodicalness of a medieval scribe, but a million times faster?
This is precisely the question at the heart of the study I want to discuss. It concerns a paper that proposes a model called pay-per-crawl – literally “payment per scan” – and an algorithm named LM Tree, designed to solve a problem that at first glance seems like simple accounting but turns out to be a genuine philosophical puzzle about the nature of value.
Let's step back into history – just for a moment, as I always like to do, because the past has a way of explaining the present better than any analyst.
In the 18th century, Parisian newspapers earned their keep from subscribers and from those who paid for individual issues from street vendors. The value of information was tied to a specific reader, to their willingness to part with a coin. In the 20th century, television and radio arrived – and the model was turned on its head: content became “free” for the audience, while advertisers paid to reach that very audience. The internet initially replicated this model: banners, clicks, impressions.
But with the advent of large language models – systems trained on vast amounts of text – an entirely new entity emerged: the AI crawler. This is a program that crawls publishers' websites and collects content to train or operate artificial intelligence. It doesn't watch ads. It doesn't click the “subscribe” button. It simply takes the text – and leaves.
Publishers found themselves in the position of a library owner whose books are read by thousands of visitors, yet none of them pay an entrance fee because the entrance physically has no cash register. A new cash register was needed. And not just any cash register – a smart one, capable of setting different prices for different books.
A Problem You Can't Solve with a Ruler
At first, the solution seems obvious: divide all content into categories and assign each its own price. Cheaper for short news, more expensive for analysis. This is how most publishers operate when they even consider such segmentation.
But this is where the real complexity begins. Imagine a major German tech publisher – the very subject of this study – with nearly nine thousand articles. It has eight editorial categories: “artificial intelligence,” “mobile devices,” “cybersecurity,” and so on. Logical, right?
But within the “artificial intelligence” category, you might find an article on “What is a neural network” at the level of a high school paper alongside a deep dive into the architectural solutions of transformer models, written by an engineer with twenty years of experience. For an AI crawler training a model, these two articles have wildly different values. The first is worth pennies. The second, gold. But they share the same editorial category.
The same thing happens in reverse. An article about smartphones might contain a unique analysis of market trends that proves more valuable than most pieces from the “analysis” section. The categories designed for a human reader do not reflect the value for a machine consumer. It's like trying to judge a wine by the color of its label rather than by its taste.
A fixed price for the entire archive is also a dead end. Set it too high, and crawlers will pass on the cheap content that was still worth selling at a low price. Set it too low, and you'll miss out on significant revenue from premium materials. This is a classic pricing dilemma, known to economists since Adam Smith, only in a new technical guise.
The Tree That Thinks
This is where LM Tree enters the stage – an algorithm whose name can be translated as “Language Model Tree.” To understand how it works, I propose an analogy from horticulture.
Imagine you are an expert sorter at a large apple warehouse. Before you are thousands of apples, and you need to price each one. Starting with each apple individually would be madness. So you begin with questions: “Is this apple sour or sweet?” Sour ones go one way, sweet ones the other. Then, you ask the next question within each group: “Is it large or small?” And so on, until you have several distinct groups, each of which can be assigned a reasonable price.
LM Tree does exactly the same thing – only with texts, and with a large language model acting as that expert sorter.
The algorithm begins by looking at the entire archive as a single whole and asks the first question: “What feature best divides the content into expensive and cheap categories?” The language model, after being fed the titles and descriptions of the articles, along with information on which ones crawlers “bought” more readily at a given price, offers hypotheses. For example: “Articles discussing the ethical aspects of AI with in-depth technical analysis” versus “brief news updates on the release of new devices.” The algorithm checks how much this division increases potential revenue, and if the result is positive, it locks in the split.
The process is then repeated within each of the two resulting groups. And again. And again. Until the tree stops “growing” – that is, until further splitting no longer yields a significant increase in revenue.
An important detail: the algorithm operates purely on simple feedback – “bought” or “not bought.” No complex ratings, no questionnaires, no manual labor from editors. Just the binary signal of the market, multiplied by the power of a language model.
Numbers That Make You Think
The study was conducted on data from a major German tech publisher. The authors had access to 8,939 articles and over 80,000 requests from AI crawlers. The willingness to pay for each article was calculated based on real traffic data – meaning this is not a theoretical model in a vacuum, but an attempt to approximate real market conditions as closely as possible.
The results were quite telling:
- Compared to a single fixed price for the entire archive, LM Tree delivered a 65% increase in revenue.
- Compared to a simple two-category split (“premium” and “standard”), it showed a 47% increase.
- And here's the most intriguing part: compared to the publisher's own eight-segment editorial taxonomy, the increase was 40%.
That last point deserves special attention. The publisher had spent years building its system of categories, relying on editorial experience, common sense, and an understanding of its audience. And an algorithm that never read a single one of these articles “like a human” surpassed this system by forty percent. Why?
Because the editors created categories for people. And crawlers are not people. What seems valuable to a journalist or a reader (“exclusive interview,” “on-site report”) is not necessarily valuable to a system searching for dense, structured, technically-rich data for training. LM Tree managed to pinpoint this exact difference – and monetize it.
What the Algorithm 'Saw' that People Missed
One of the most fascinating aspects of the study is which features LM Tree deemed significant.
The high-value category included articles that combined technical analysis with a discussion of broader implications – for example, ethical or strategic ones. Articles that didn't just report a fact but interpreted it in context. Materials that possessed analytical depth, not just a list of a new gadget's specifications.
The low-value category included brief news updates, reviews of specific hardware models without strategic context, and materials where the information was ephemeral.
Notably, these divisions cut across the editorial categories. A deep analytical article about smartphones ended up in the same bucket as an analysis of cloud computing – and both were priced higher than a superficial text from the “artificial intelligence” section. The algorithm saw the quality of thought, not the thematic label.
This reminds me of a famous paradox from auction history. At a Christie's auction in 1987, Van Gogh's “Irises” sold for $53 million – a record sum for the time. Yet just a few years earlier, the same painting was valued far more modestly. The value hadn't changed. The mechanism for identifying it had. LM Tree does exactly the same thing: it doesn't create value, but it «finds» it where traditional tools couldn't see it.
Interpretability as an Unexpected Virtue
It's worth mentioning what the authors call the system's “interpretability.” In a world where machine learning algorithms increasingly resemble black boxes – “we don't know why the model made this particular decision, but it did” – LM Tree works differently.
Each split in the tree is a clear question formulated in human language. “Does the article contain an analysis of corporate strategies?” “Is this a review of a specific product?” “Does the text discuss long-term trends?” At any moment, the publisher can look at the tree and understand why one article costs more than another. This isn't just a convenience – it's fundamentally important for trusting the system.
Imagine an auditor reviewing a company's tax return. They can accept the result if they can follow the logic of each step. But if they're just given a number that emerged from the depths of a neural network without any explanation, their trust in it will be significantly lower. LM Tree is closer to the first scenario.
It would be unfair to conclude without mentioning the limitations. The study's authors themselves honestly acknowledge several important caveats.
First, the willingness-to-pay data was modeled based on existing crawler traffic, not derived from actual transactions. The pay-per-crawl market, as described in the study, is in its early stages of formation, and there is not yet enough historical data on real prices and purchases. This means the figures – compelling as they are – remain a theoretical estimate.
Second, the algorithm was tested on a single publisher in one country and one subject area. How well it would work for, say, a news agency or a medical portal remains an open question.
Third, there is the interesting problem of temporal decay. The value of an article is not constant: a deep analysis of a technology that was relevant in 2023 might turn into a historical artifact by 2027. A system that cannot account for this dynamic risks becoming obsolete along with its segments.
And finally, the most delicate question: what if different AI systems value different content? A crawler training a model for medical diagnostics and a crawler collecting data for financial analysis may have completely different requirements for the same text. LM Tree, in its current configuration, does not differentiate between types of buyers – this is a direction for future research.
Why This Matters – And Not Just for Publishers
One might think that all this is a niche story about media monetization, interesting only to the editors-in-chief and CFOs of tech publications. But I am convinced there is something more at play here.
We are witnessing the formation of a fundamentally new market – a market for data as a raw material for training artificial intelligence. And this market raises questions that economists will be debating for a long time. How is the value of information determined in an era when its main consumer is not a human, but a machine? What is fair compensation for intellectual labor when its results are used to create systems that could potentially replace that very labor? Who ultimately wins – the publisher who learns to sell for more, or the AI company that gets the data it needs anyway?
LM Tree is not the answer to these questions. It is a tool that takes one specific step: it helps publishers stop selling everything at a single price when the difference in value is obvious to everyone but the price tag. This is a modest but real step forward.
The history of money knows many examples of how a new mechanism for assessing value radically changed the balance of power. The emergence of futures contracts in the 17th century allowed grain merchants to finally manage a risk they had previously only feared. The advent of credit ratings in the 20th century changed who gets access to capital and on what terms. LM Tree is a much more modest invention. But the principle is the same: a new way to measure value changes who receives it.
And in that sense, an algorithm trained to bargain over articles about smartphones may turn out to be a small but symptomatic page in the long history of humanity's endless reinvention of how to agree on a price.