When developing with language models, sooner or later, you run into the same question: how can you avoid paying for speed where it's not needed, while also not making users wait several seconds for a response they expect immediately? This isn't a theoretical dilemma; it's a very practical problem faced by almost everyone creating products based on large language models.
Google has decided to give developers the power to choose for themselves. The Gemini API now features two new operating modes: Flex and Priority. Simply put, you can now explicitly specify what's more important in a given case: saving money or getting a response as quickly as possible.
What Is Flex and What Is It For?
Flex is a lower-cost mode but without strict guarantees on response time. Requests in this mode are processed when Google's infrastructure has spare capacity. In short: you get in line, but you pay less.
This doesn't mean you'll have to wait for hours for a response. Rather, the system doesn't immediately reserve resources for your request; it processes it when convenient based on the overall load. For most background tasks, this is perfectly acceptable.
What tasks are a good fit for Flex? Anything that doesn't require an immediate response to the user:
- Batch processing of documents;
- Automatic generation of drafts, reports, or summaries;
- Data analysis during off-hours;
- Tasks that run on a schedule, not on a click.
If a developer is building, for example, a system that processes thousands of texts overnight to prepare a summary by morning, Flex is designed for exactly that purpose.
Priority: When Speed Is Key
Priority is the opposite. Here, requests get priority access to computing resources, which means more predictable and lower response times. As a result, it costs more.
Priority mode is ideal for situations where a user expects a real-time response: chatbots, live assistants, interactive interfaces, and tools integrated into a workflow that need to react without noticeable delays.
Essentially, it's the same Gemini, but with a guarantee that your request won't end up at the back of the queue during moments of peak load.
Why This Is Important Now
The reality is that using powerful language models in real-world products is not free. And as models become more complex (Google just recently released Gemini 1.5 Pro with significantly improved reasoning abilities), the cost of computation grows with them.
For small teams and startups, this becomes a real barrier: you want to build complex systems, but the budget isn't unlimited. Flex lowers this barrier – at least for those tasks where latency is not critical.
In parallel, Google is continuing to improve the efficiency of its systems. For example, the new TurboQuant algorithm significantly reduces memory consumption for AI models. The introduction of Flex and Priority is another step in the same direction: making AI more accessible without sacrificing quality where quality is critical.
How It Works in Practice: A Simple Example
Imagine a team building a service to automatically process customer feedback. Some of this feedback needs to be handled immediately – for instance, if a user submits a complaint and is waiting for a reply in a chat. This is a job for Priority. Other tasks – like gathering weekly statistics, generating an analytical summary, or preparing a digest – can wait until night. This is where Flex comes in.
Previously, you would have had to either pay the maximum rate for everything or build your own load management logic. Now, this can be done directly at the API request level simply by choosing the appropriate mode.
What Remains Unknown
Google has not yet revealed the exact numbers: how much cheaper Flex is than Priority, what the guaranteed response times are for each mode under various loads, and whether there are limits on the volume of Flex requests during peak hours. These are crucial details for anyone planning to build products with specific SLAs.
It's also unclear how exactly Flex will perform during periods of high overall load on Google's infrastructure – specifically, whether the delays will be predictable.
Overall, the introduction of two distinct modes is a step towards greater transparency and flexibility. Developers can now decide what to pay for, instead of receiving a one-size-fits-all service at a one-size-fits-all price. For an industry where inference cost is a key factor when choosing a platform, this is a tangible change.