When it comes to training language models, data quality often proves more important than quantity. But where can you find representative examples for niche tasks – for instance, for analyzing medical records or technical documentation? You could label them manually, but that is time-consuming and expensive. You could ask GPT-4 to generate them, but the result would be unpredictable. Or, you could structure the data so that it is correct by design.
That is exactly what SyGra Studio – a platform from the ServiceNow AI research group – is designed for. It allows you to create synthetic training data using knowledge graphs as a foundation. In a nutshell: you describe the structure of what you want to get, and the system generates text examples based on it.
Benefits of Knowledge Graphs for Synthetic Data Generation
Why Use Graphs for Text Generation
Usually, synthetic data is created like this: take a large model, give it a prompt like «generate 1000 examples of medical questions» , and hope for the best. The problem is that the model can repeat itself, drift off-topic, or simply hallucinate facts.
SyGra Studio offers a different path. Instead of relying on the neural network's creativity, you first create a knowledge graph – a formal structure where entities (for example, «patient» , «diagnosis» , «medication» ) and the relationships between them («prescribed» , «contraindicated» ) are fixed. It is similar to a database schema, but for semantic relationships.
Then the platform uses this graph as a skeleton: it «understands» which combinations are valid and which are not, and generates examples that fit within the specified logic. The result is a kind of controlled randomness: diversity is preserved, but factual errors are eliminated.
How SyGra Studio Works
How It Works in Practice
SyGra Studio consists of several components. The first is a graph editor, where you can visually build a data structure or upload an existing one. The second is a generator that turns the graph into text examples using a language model. The third is a set of tools for validation and filtering: they allow you to assess the diversity of the generated data and ensure the absence of repetitions or logical inconsistencies.
The platform supports various task formats. You can generate «question-answer» pairs for fine-tuning, examples for classification, or data for entity extraction from text. All of this is configured through the interface – writing code is not required, though experienced users can plug in their own scripts.
An important nuance: SyGra Studio is not tied to a specific model. You can use different LLMs for generation – from open-source to proprietary. The graph sets the structure, and the model handles the linguistic phrasing.
Use Cases for SyGra Studio
Who Can Benefit
The first obvious audience is developers training models for highly specialized tasks. Suppose you are building a tech support chatbot. You have a knowledge base of products, but you do not have thousands of examples of exactly how people phrase their questions. You can build a «product → feature → problem → solution» graph and generate training dialogues based on it.
The second scenario is research. When you need to test a hypothesis about a model's behavior on a specific type of data, but real examples are scarce or hard to collect. The graph allows you to control exactly which patterns enter the dataset and analyze the model's reaction to them.
Third is data augmentation. If you already have a labeled dataset but its volume is insufficient, SyGra Studio can help expand it while preserving the original relationship structure.
The Limitations and Considerations
As with any tool, there are limitations here. First is the construction of the graph itself. If you work in a field where connections between concepts are non-obvious or controversial, creating a correct structure can be difficult. A graph is a simplification of reality, and it is important to be aware of what exactly you are simplifying.
Second, generation quality still depends on the language model. The graph guarantees logical accuracy, but not stylistic diversity or the naturalness of the wording. If the model is prone to clichéd phrases, this will be reflected in the result.
Third is scalability. For local tasks, the platform works great, but if millions of examples with high variability are required, the process can become resource-intensive – both in terms of generation time and API call costs.
How to Access and Use SyGra Studio
Availability and Usage 🔧
SyGra Studio has been released to the public. It can be tested via a web interface on Hugging Face Spaces or deployed locally – the code is published on GitHub. The documentation includes examples for various domains: from medicine to finance.
The platform is under active development, so the interface and functionality may change. However, the core idea – using structure to control generation – is already viable and open for experimentation.
If you need synthetic data with predictable logic, this is one of the most effective ways to get it. The tool is not universal, but for specific tasks, it fits perfectly.