Published on April 3, 2026

Gemini API: Google Introduces Flex and Priority Modes for Cost and Speed

Cheap and Fast at the Same Time: Google Changes Its Approach to the Gemini API

Google has added two new request processing modes to the Gemini API – Flex and Priority – allowing developers to choose between speed and cost.

Products 4 – 5 minutes min read
Event Source: Google 4 – 5 minutes min read

When developing with language models, sooner or later, you run into the same question: how can you avoid paying for speed where it's not needed, while also not making users wait several seconds for a response they expect immediately? This isn't a theoretical dilemma; it's a very practical problem faced by almost everyone creating products based on large language models.

Google has decided to give developers the power to choose for themselves. The Gemini API now features two new operating modes: Flex and Priority. Simply put, you can now explicitly specify what's more important in a given case: saving money or getting a response as quickly as possible.

What Is Gemini Flex Mode and Its Use Cases?

What Is Flex and What Is It For?

Flex is a lower-cost mode but without strict guarantees on response time. Requests in this mode are processed when Google's infrastructure has spare capacity. In short: you get in line, but you pay less.

This doesn't mean you'll have to wait for hours for a response. Rather, the system doesn't immediately reserve resources for your request; it processes it when convenient based on the overall load. For most background tasks, this is perfectly acceptable.

What tasks are a good fit for Flex? Anything that doesn't require an immediate response to the user:

  • Batch processing of documents;
  • Automatic generation of drafts, reports, or summaries;
  • Data analysis during off-hours;
  • Tasks that run on a schedule, not on a click.

If a developer is building, for example, a system that processes thousands of texts overnight to prepare a summary by morning, Flex is designed for exactly that purpose.

Priority Mode: Ensuring Speed in Gemini API

Priority: When Speed Is Key

Priority is the opposite. Here, requests get priority access to computing resources, which means more predictable and lower response times. As a result, it costs more.

Priority mode is ideal for situations where a user expects a real-time response: chatbots, live assistants, interactive interfaces, and tools integrated into a workflow that need to react without noticeable delays.

Essentially, it's the same Gemini, but with a guarantee that your request won't end up at the back of the queue during moments of peak load.

The Importance of Gemini API Flex and Priority Modes

Why This Is Important Now

The reality is that using powerful language models in real-world products is not free. And as models become more complex (Google just recently released Gemini 1.5 Pro with significantly improved reasoning abilities), the cost of computation grows with them.

For small teams and startups, this becomes a real barrier: you want to build complex systems, but the budget isn't unlimited. Flex lowers this barrier – at least for those tasks where latency is not critical.

In parallel, Google is continuing to improve the efficiency of its systems. For example, the new TurboQuant algorithm significantly reduces memory consumption for AI models. The introduction of Flex and Priority is another step in the same direction: making AI more accessible without sacrificing quality where quality is critical.

Gemini API Flex and Priority: Practical Examples

How It Works in Practice: A Simple Example

Imagine a team building a service to automatically process customer feedback. Some of this feedback needs to be handled immediately – for instance, if a user submits a complaint and is waiting for a reply in a chat. This is a job for Priority. Other tasks – like gathering weekly statistics, generating an analytical summary, or preparing a digest – can wait until night. This is where Flex comes in.

Previously, you would have had to either pay the maximum rate for everything or build your own load management logic. Now, this can be done directly at the API request level simply by choosing the appropriate mode.

Gemini API Flex and Priority: Unanswered Questions

What Remains Unknown

Google has not yet revealed the exact numbers: how much cheaper Flex is than Priority, what the guaranteed response times are for each mode under various loads, and whether there are limits on the volume of Flex requests during peak hours. These are crucial details for anyone planning to build products with specific SLAs.

It's also unclear how exactly Flex will perform during periods of high overall load on Google's infrastructure – specifically, whether the delays will be predictable.

Overall, the introduction of two distinct modes is a step towards greater transparency and flexibility. Developers can now decide what to pay for, instead of receiving a one-size-fits-all service at a one-size-fits-all price. For an industry where inference cost is a key factor when choosing a platform, this is a tangible change.

Original Title: New ways to balance cost and reliability in the Gemini API
Publication Date: Apr 2, 2026
Google blog.google An international technology company developing digital services, cloud platforms, and AI technologies for search, advertising, productivity, and consumer products.
Previous Article HiClaw Joins AgentScope: Alibaba Builds a Unified Platform for Multi-Agent Systems Next Article EXAONE 4.5: LG Releases Its First Open Multimodal Language Model

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.6 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.6 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Don’t miss a single experiment!

Subscribe to our Telegram channel —
we regularly post announcements of new books, articles, and interviews.

Subscribe