When an AI agent answers a question, it seems easy enough to evaluate: a correct answer is good, a mistake is bad. But if you dig a little deeper, it turns out that «looking okay» and «working reliably» are two very different things. This contradiction is exactly where the story of NOVA begins – an evaluation system that LightOn is building around its product.
Why Do We Even Need an Evaluation System – Isn't Intuition Enough?
Imagine asking a corporate AI assistant something based on internal documents and receiving a coherent, confident response. The problem is that a confident-sounding text and a factually correct text are not the same thing. Language models are experts at convincingly articulating things that aren't even in the source material.
Simply put: without measurement, it is impossible to distinguish real improvement from a stroke of luck or an unnoticed regression. That is why LightOn developed NOVA – a set of tools and approaches that allow for evaluating specific metrics at every stage of the system's operation, rather than just an «impression.»
Layer by Layer: How Honest Evaluation Works
Most modern AI agents for document processing work roughly like this: the system searches for the necessary fragments in a knowledge base and then generates an answer based on them. This approach is called RAG – «Retrieval-Augmented Generation.» It sounds simple, but in practice, there are many stages in this chain where something can go wrong.
Search and Answer – The Most Obvious Level
The first question is: did the system find the right documents? And the second: does the answer match what is written in them?
With search, things are relatively clear – there are established metrics that show how accurately the system finds relevant fragments. Assessing the quality of the answer is harder. For years, researchers have tried to solve this: developing text-comparison algorithms and analyzing phrasing similarity. But when the same thought can be expressed in a dozen different ways, such approaches often fail.
The «AI evaluating AI» method is popular now, where one language model acts as a judge for another. But there are pitfalls here too: such «judges» tend to prefer long, confident-sounding answers and can give inconsistent ratings from one run to the next. Asking a model to give a score from 1 to 10 is essentially the same subjective «matter of taste», just dressed up as numbers.
NOVA uses a different approach: instead of one «judge» evaluating everything at once, several highly specialized ones are involved. One checks if the model hallucinated facts missing from the sources. Another ensures the system correctly refuses to answer if the required information isn't there. Each evaluates a specific aspect based on clear criteria. The key observation: one «mega-judge» trying to cover everything performs worse than a group of narrow specialists.
Reranking – Does It Actually Help?
Many modern systems add an intermediate step between search and generation: the retrieved fragments are reshuffled by a more powerful model that determines the most relevant ones. In theory, this should improve quality. In practice, the effectiveness of this method must be verified, as this step increases latency; if it doesn't provide a real gain, it becomes nothing more than extra overhead. NOVA compares search quality before and after this stage to see the real picture, not the assumed one.
Document Preparation – The Invisible Point of Failure
Before a document enters the search index, it must be processed: the text recognized, content extracted from PDFs, and split into appropriately sized chunks. This stage usually stays behind the scenes, but this is often where critically important information is lost.
LightOn notes that many cases of model «hallucinations» actually turned out to be parsing issues rather than model errors: the model simply wasn't provided with the necessary content and worked with what it had. It's like blaming a chef for a dish's bad taste without noticing the ingredients were already spoiled in the warehouse. Therefore, in NOVA, the quality of document processing is a full-fledged metric, not a secondary parameter.
Agentic Solutions – The Layer That Defines Everything
In a simple system, every request follows the same path. In more complex ones, the agent first decides: do I even need to look for something? In which source? How should I rephrase the question? This is a separate level that also requires evaluation. A mistake at this stage devalues everything else – even a perfectly tuned search won't help if the agent decided to look in the wrong place.
Public Rankings Are a Hypothesis, Not a Verdict
In the industry, it is common to compare models using public benchmarks – standardized test sets for the objective measurement of capabilities. LightOn actively participates in this: monitoring benchmark quality, fixing errors in existing ones, and releasing its own.
However, a public ranking answers the question «how good is this model in controlled conditions», not «how successfully does it work in your specific system with your documents and queries.» That is why at LightOn, any new model undergoes not only public tests but also an internal set of checks on real-world data. If a model leads the rankings but shows regression on documents containing tables, this will be known before it ever hits the final product.
Evaluation is a Constant Process, Not a Finish Line
The temptation is great: set up the system once, run a check, get good results, and forget about it. But systems are not static. New data sources appear with documents the processing wasn't tailored for. Different model versions may react differently to the same prompts. Users ask questions that weren't in the test sets.
Software development long ago reached the conclusion: the earlier you catch an error, the cheaper it is to fix. The same principle applies here. At LightOn, every significant change – a new model, a document chunking strategy, or a prompt template – goes through NOVA before deployment. This allows us to notice in time, for example, if a new model has become more wordy, slowing the system down before users even complain.
At the same time, evaluation is not just about quality control. LightOn uses NOVA as a foundation for automatic system configuration optimization: first, they run an improvement process, then validate the result with a full run. The evaluation system becomes more than just a filter; it becomes a tool that makes the product better.
Cost and Payoff
Building such an infrastructure is an investment. It requires time, expertise, and a willingness to slow down for the sake of quality. But the investment pays off: there are fewer arguments over what «seems» better, the number of critical failures decreases, and iterations speed up because it becomes clear exactly what to look at.
In short: you cannot improve what you do not measure. And without properly structured measurements, you might spend a long time improving the wrong things.