Most people familiar with image generation know about Stable Diffusion – a family of open-source models that turn text descriptions into pictures. One of the most actively developing forks in this family is Illustrious XL. And now, it has received two significant updates at once: versions 3.0 and 3.5-vpred.
In short: the model can now work with significantly higher resolutions and understands human language much better.
Previously, most models based on Stable Diffusion XL were tailored for specific resolutions – typically around 1024×1024 pixels. Going beyond these limits was difficult: the model would either start to «blur» or produce artifacts.
Illustrious XL 3.0–3.5 is trained to work with resolutions ranging from 256 to 2048 pixels per side – without being strictly tied to a specific size. This means the model can generate both small sketches and detailed, high-quality images, behaving predictably in both cases. Such flexibility is not a given for architectures of this kind.
«Understanding» a Prompt is Not the Same as Processing It
The second and, perhaps, more interesting part of the update concerns how the model perceives text.
In most image generation systems, a text prompt is processed by a special component – the text encoder. It «translates» words into numerical representations that then guide the drawing process. The problem is that this component has historically been quite limited: it struggles with long descriptions, doesn't quite grasp semantic nuances, and has difficulty maintaining relationships between multiple objects in a single prompt.
In version 3.5-vpred, the developers conducted extensive joint training of two model components at once – the text encoder and the main generation network. Simply put, they were trained together, not separately. The result is prompt comprehension comparable to what small language models demonstrate.
What does this mean in practice? The model handles prompts with many details, conditions, or relationships between objects better. For example, if you describe a scene with several characters interacting in a specific setting, the model is more likely to reproduce exactly what you intended, rather than something approximate.
Why Compare an Image Generator to a Language Model at All?
This is an important point that deserves a separate explanation.
Language models (like those used in chatbots) are designed to capture meaning, context, and dependencies between words on multiple levels. They «think» about text structurally. Image generators were traditionally not designed for this – their text component was more like a dictionary than a tool for comprehension.
When the creators of Illustrious XL say they have reached the level of «miniature language models» in terms of prompt comprehension, they are referring to this very gap. The model has come closer to truly reading the description rather than just matching words to images.
What This Means for Those Who Work with Generation
For artists and designers working with such tools, the update brings several practical implications.
- High resolution «out of the box» reduces the need for additional upscaling steps – the process of artificially enlarging an image after generation.
- Improved language understanding means fewer iterations: you don't need to «tweaks» the prompt as meticulously to fit the model's limitations.
- Flexibility in resolution opens up possibilities for a wider range of tasks – from quick sketches to final visuals.
At the same time, it's important to understand that we are still talking about a model based on the Stable Diffusion XL architecture – that is, a system geared towards a specific style and set of tasks. It is not a universal tool, which means the results will depend on how well a specific task aligns with what the model was trained on.
Illustrious XL is being developed as an open-source model, which means it can be downloaded, modified, and integrated into one's own pipelines. Unlike closed commercial solutions, this allows for local operation without sending requests to third-party servers.
The combination of being open-source, supporting high resolutions, and having improved language understanding makes the 3.5-vpred version one of the most technically advanced options in the open-source generative model ecosystem today.
The question that remains open is how well the improved language understanding will perform on a wide variety of real-world prompts, and not just on the scenarios the creators tested during development. As always, only time will tell.