Imagine a giant library. Not an ordinary one, but a place where the books are written in dozens of specialized languages, some pages are torn out and hidden in neighboring buildings, and the keys to certain cabinets are held by people who left long ago. This is roughly what biomedical scientific literature looks like to a researcher trying to use someone else's data for a new discovery.
Every year, tens of thousands of articles are published on what are known as -omics studies – a broad family of approaches that allow for the simultaneous study of thousands of molecules in a living organism. Genomics reads DNA, transcriptomics reads RNA, proteomics reads proteins, and metabolomics reads the products of metabolism. Each such article is a treasure trove of information. But to reuse this information? It's nearly impossible without a colossal investment of time and effort.
Why Is the Data There, But Unusable?
It would seem that in the era of open science, everything should be accessible. And technically, it is. Authors deposit their data in public repositories. Entire databases exist where results from proteomic, genomic, and other studies are stored. But here's the catch: to use someone else's data for a new analysis, you first have to find it. Then you have to understand exactly how it was collected. Then download it in the right format. Then process it with the correct tool. And then ensure your result is comparable to what the original authors obtained.
At each of these steps, a researcher faces obstacles. The dataset identifier might be mentioned in passing in a footnote on page twelve. The processing method might be described in supplementary materials attached as a separate PDF file. The analysis code might be on GitHub, with the link buried in the text of an article written three years ago.
This isn't anyone's malicious intent. It's simply a historically developed ecosystem where each element was created independently and for different purposes. The journal article is for humans. The data repository is for machines. The supplementary materials are for the especially persistent. As a result, the information is there, but it's fragmented, like a jigsaw puzzle scattered across three different boxes.
Enter the Agents 🤖
This is the very problem that a team of researchers tried to solve by developing a system of so-called AI agents for working with -omics data. The publication describing this system appeared in 2024–2025 and offers perhaps one of the most concrete and functional answers to the question: what if we delegate all this routine work to artificial intelligence?
Not just a language model that answers questions, but a real team of specialized software agents, each capable of doing something specific – finding articles, extracting the necessary information from them, downloading data, running analyses, and comparing results.
To continue the library metaphor: this isn't just a smart reader who quickly flips through books. It's an entire team – a bibliographer, an archivist, a lab assistant, and an analyst – all working in concert after receiving a single, simple task from you in natural language.
How It Works: A Team of Specialists Inside a Computer
At the core of the system is a so-called planner agent. It receives a task formulated in natural language, for example: “Find studies on the proteomics of liver fibrosis, download the data, and check which proteins are behaving abnormally.” The planner then takes this task and breaks it down into specific steps, selecting the appropriate tool for each one.
In the described system, the role of the planner is played by the GPT-4 language model from OpenAI – one of the most powerful models available at the time of development, with advanced reasoning abilities and a capacity for understanding complex instructions. But the planner is just the “brain.” The real work is done by specialized tools to which it has access.
Tool One: Searching and Reading Articles
The agent can access the PubMed and PubMed Central databases – the world's largest repositories of biomedical literature – and download the full texts of articles. Not just the abstracts, but the full texts in a machine-readable format. This allows it to analyze the article's content at the level of individual paragraphs and tables.
Tool Two: Extracting Metadata
After receiving the article text, the agent launches a specialized metadata extractor. This module can “read” the scientific text and pull structured information from it: what type of -omics analysis was used, on what biological material, under what experimental conditions, and – most importantly – where the data is stored.
It is at this stage that the system searches for dataset identifiers. In the world of proteomics, for instance, there is an international repository called ProteomeXchange, where each deposited dataset receives a unique code like PXD000000. The agent is able to find these codes in the article text – even if they are hidden in supplementary materials or mentioned in passing in the “Data Availability” section.
The accuracy of this search in the described experiments was 80%. This means that in eight out of ten cases, the agent correctly found and extracted the dataset identifier. It may not sound like a one-hundred-percent result, but consider the context: we are talking about the automated processing of unstructured scientific text written by hundreds of different authors in a wide variety of formats.
Tool Three: Downloading Data
With the dataset identifier in hand, the agent connects to the corresponding repository and downloads the necessary files. In the case of proteomics, these could be “raw” mass spectrometry data – huge files containing information about the thousands of molecular fragments detected by the instrument in a biological sample.
Tool Four: Analyzing Data in a Container
This is where things get particularly interesting. Downloading the data is only half the battle. Mass spectrometry data cannot simply be “opened and viewed.” It needs to be processed with specialized software – for example, the widely used proteomics package MaxQuant. This is a complex, multi-step process that results in a table with quantitative assessments of the protein content in each sample.
To ensure the reproducibility of this process, the system's authors used the concept of containerization. Imagine that each analytical tool is packed into a sealed box along with everything it needs – the right version of the operating system, all dependencies, and settings. You open this box, put the data inside, close it – and you get the result. No matter where or when you run this process, the result will be identical. This is fundamentally important for science, where reproducibility is one of the cornerstones.
The agent's access to these “boxes” is managed through a special protocol – MCP (Model Context Protocol). In essence, this is a standardized way for the language model to call external tools and receive results from them. The agent “tells” the server, “Run this pipeline with this data,” and gets the analysis results back.
The Liver Fibrosis Experiment: What Happened in Practice
The researchers gave the agents a specific task to test the system in action: find articles on the proteomics of liver fibrosis, download the data, reprocess it using a standardized method, and compare the results with what was reported in the original publications.
Liver fibrosis is a condition in which normal organ tissue is replaced by scar, or connective, tissue. This occurs in response to long-term damage – for example, from chronic viral hepatitis or alcohol abuse. Understanding which proteins behave abnormally during this process is important for developing diagnostic markers and new treatments.
The agents went through the entire process: they found articles in PubMed Central, extracted the dataset identifiers, downloaded the raw mass spectrometry data, and reprocessed it using a containerized MaxQuant-Perseus pipeline. They then performed a differential expression analysis – that is, they determined which proteins showed a statistically significant change in their levels in diseased tissue compared to healthy tissue.
The result: a 63% overlap with the findings published by the original authors. In other words, the system, operating automatically, independently found nearly two-thirds of the proteins that the original authors had labeled as “altered.”
Why not one hundred percent? It's important to understand a few things here. First, the scientific analysis of proteomic data isn't a single straight road but a whole network of intersections, where you can turn differently at each one. Minor differences in processing parameters – which search algorithm to use, how to filter data, what statistical threshold to consider significant – can substantially alter the final list of proteins. Second, method descriptions in scientific articles are often incomplete. Authors frequently omit some details of their analysis, assuming their colleagues will understand anyway. Under these circumstances, 63% is not a weak result but rather an honest assessment of how well the system copes with the reality of imperfect data.
Comparing Studies: When One Plus One Is Greater Than Two
The second scenario demonstrated by the authors is even more interesting from a scientific standpoint. The agents were tasked not just with reproducing a single experiment, but with finding several similar studies, assessing their compatibility, and performing a combined analysis.
This is what scientists call a meta-analysis – the systematic combination of data from several independent studies to obtain more robust and generalizable conclusions. Traditionally, a meta-analysis requires months of work: manually finding all suitable articles, reading them, extracting the data, standardizing the formats, and only then beginning the analysis.
In the described system, the agent uses a special semantic similarity tool. Its principle of operation is reminiscent of how recommendation algorithms on streaming services find similar movies: instead of comparing texts by keywords, the system converts article descriptions into numerical vectors – mathematical “fingerprints” of meaning – and compares them. Articles with similar “fingerprints” are likely studying similar phenomena.
After finding a group of semantically similar studies, the agent runs a data compatibility checker: can these datasets be combined at all? This is a crucial question. If one group of scientists worked with mouse tissue and another with human biopsies, directly combining the data could yield meaningless results.
In the liver fibrosis case, the system successfully found several studies that used the same experimental model – the so-called CCL4-induced fibrosis model in mice. This model is replicated in labs around the world as a standard, which makes the data from such studies comparable.
The combined analysis identified a group of proteins and molecular pathways that consistently change in liver fibrosis, regardless of the specific lab or the year the experiment was conducted. These are proteins associated with the remodeling of the extracellular matrix – a kind of “scaffolding” that supports cells in the tissue – as well as proteins involved in inflammatory responses and lipid metabolism. All of this aligns well with what is known about the mechanisms of fibrosis development and adds confidence in the validity of the identified patterns.
Why This Matters: From Static Pages to Living Knowledge
This whole story isn't just a demonstration that AI can read articles and push buttons for scientists. Behind it lies a deeper idea that could change the very structure of how scientific knowledge is produced.
Today, a published scientific article is, in a sense, a dead artifact. It is written, peer-reviewed, published, and sits in a database. Researchers cite it, read it, and sometimes reproduce its experiments. But in itself, it is passive. It cannot answer a new question. It cannot merge with another article to produce a result that is not contained in either one.
The described system takes a step toward changing this. An article ceases to be just text and becomes a starting point for automated research. Data can be extracted from it, reprocessed, and compared with data from hundreds of other articles – all without human intervention at every step.
This is especially important in the context of scientific reproducibility. Since the 2010s, the scientific community has been actively discussing the so-called “reproducibility crisis”: it turns out that a significant portion of published results cannot be reproduced in subsequent attempts. The reasons vary – statistical errors, inadequate descriptions of methods, unconscious researcher bias. An automated system that can reprocess original data using a standardized protocol and compare the result with the published one becomes a powerful tool for verifying scientific claims.
An Honest Conversation About Limitations
The system's authors do not hide its weaknesses, and this in itself is a sign of a mature approach to science.
First, the entire system depends on the quality of the base language model. GPT-4 is a powerful tool, but it is not a biologist. It can misinterpret specific terminology, confuse similar concepts, or miss important nuances in method descriptions.
Second, 80% accuracy in identifying datasets means that in 20% of cases, the system either fails to find the data or finds the wrong data. This is acceptable for large-scale automated reviews but requires caution when working with specific studies.
Third, the system is currently focused mainly on major public repositories – PRIDE, GEO, ProteomeXchange. It may simply fail to find data from lesser-known or more specialized repositories.
Fourth, the more abstract and complex the query, the harder it is for the agents to handle. “Find all proteins associated with liver fibrosis” is a fairly concrete task. “Identify the molecular mechanisms that explain why some patients are resistant to fibrosis” is a different level of complexity, requiring genuine biological reasoning, not just information extraction.
What's Next?
The authors see several key directions for the system's development.
The first is the ability to save successful analysis “workflows.” If an agent solves a task optimally once, why not record that workflow and share it with other researchers? It's like saving a step-by-step recipe instead of reinventing it from scratch every time.
The second is active human participation in the process. Not just “ask a question and get an answer,” but the ability to intervene at any step, correct the agent's decision, and add context it lacks. Science, after all, is a dialogue, not a machine's monologue.
Third is expansion to other data types. -Omics is just one, albeit vast, area of biomedicine. Clinical data, medical imaging, results from new compound screenings – all of these also deserve a similar approach.
Fourth is integration with biomedical knowledge graphs. Imagine an agent that not only reads articles but also understands that protein A interacts with protein B, which in turn regulates gene C. Such a “knowledge base of connections” would allow for building hypotheses that are not contained in any single article but arise precisely from the combination of multiple sources.
Over billions of years of evolution, nature has learned to store information in DNA and pass it from generation to generation – a brilliant system of knowledge storage and reproduction in its own right. Science is structured similarly: each study records a new fragment of understanding that future generations of scientists must then read. But while nature needs millions of years for the “reading” and “using” of its records, we have a chance to do it much faster – if, of course, we learn how to build the right readers.
The work on -omics agents is one of the first convincing steps in this direction. Not the final answer, but a well-posed question. And in science, as we know, a well-posed question is already half the answer.