Published February 15, 2026

Tencent HY-1.8B-2Bit: Compact Language Model for Devices

Tencent Releases the Most Compact Language Model: 0.3 Billion Parameters in 600 MB

The Chinese company has open-sourced the HY-1.8B-2Bit model with 2-bit quantization – it weighs less than many mobile apps.

Development
Event Source: Tencent Reading Time: 3 – 4 minutes

Tencent has open-sourced a new language model, HY-1.8B-2Bit, which operates on a principle of radical compression. In short, the 1.8 billion parameter model is packed so tightly that its size is equivalent to a model with only 0.3 billion parameters. It weighs 600 MB – less than some mobile games or messaging apps.

This is the first instance of industrial-scale use of 2-bit quantization for models designed to run on end-user devices – smartphones, tablets, and laptops. Typically, less aggressive compression is used for such tasks, but here, its creators took it a step further.

What Is Quantization and Why Is It Needed?

Language models are, in essence, huge tables of numbers that determine how the model processes text. The larger the model, the more memory it requires and the slower it runs on standard devices.

Quantization is a method of compressing a model. Instead of storing each number with high precision (e.g., 16 or 32 bits), it can be represented with lower precision – say, in 8, 4, or even 2 bits. It's like reducing the color depth in an image: the picture remains recognizable but takes up less space.

Two-bit quantization is a very aggressive form of compression. Each number from the model is encoded with just two bits, allowing for four possible values. This severely limits precision, but with the right approach, the model can maintain its functionality.

What Can HY-1.8B-2Bit Do?

The model is designed to run on consumer devices without an internet connection or access to cloud servers. This means it can process text locally – useful for applications where privacy is important or a stable connection isn't available.

According to the developers, HY-1.8B-2Bit maintains sufficient performance for basic language tasks: answering questions, generating text, and understanding context. Moreover, the model runs quickly even on devices with modest specifications.

Of course, such compactness comes at a cost. Two-bit quantization inevitably reduces the model's quality compared to the full-sized version. But for tasks where speed and small size are more important than absolute accuracy, this tradeoff is justified.

Why Tencent's Compact Model Matters for AI

Why This Matters

Until now, most language models running on user devices used 4- or 8-bit quantization. Two-bit versions existed in research papers but had not been applied in real-world products.

Tencent has presented the first industrial-scale example of this approach. It opens up possibilities for wider AI adoption on resource-constrained devices – inexpensive smartphones, smart speakers, and wearable electronics.

Open-sourcing the model means that other developers can use it in their projects, adapt it for specific tasks, or study the quantization methods applied by the Tencent team.

Future of 2-Bit Quantization in Language Models

What's Next?

The release of HY-1.8B-2Bit is more of a technological demonstration than a finished product for the mass user. The model shows that 2-bit quantization works not only in theory but also in practice.

The question remains how widely this approach will be adopted. For many tasks, 4-bit quantization offers a better balance between size and quality. But where every megabyte counts, 2-bit models could be in high demand.

In any case, this is another step towards making language models more accessible and enabling them to run locally, without reliance on cloud services.

Original Title: 首个产业级2Bit量化新突破,腾讯混元开源0.3B端侧模型
Publication Date: Feb 9, 2026
Tencent hunyuan.tencent.com A Chinese technology conglomerate developing AI for social platforms, gaming, cloud, and digital services.
Previous Article Tencent Hunyuan Reveals How to Pinpoint Bottlenecks in Language Model Training Next Article ByteDance Releases Dola-Seed-2.0-Preview: A Long-Context Model with Advanced Reasoning

From Source to Analysis

How This Text Was Created

This material is not a direct retelling of the original publication. First, the news item itself was selected as an event important for understanding AI development. Then a processing framework was set: what needs clarification, what context to add, and where to place emphasis. This allowed us to turn a single announcement or update into a coherent and meaningful analysis.

Neural Networks Involved in the Process

We openly show which models were used at different stages of processing. Each performed its own role — analyzing the source, rewriting, fact-checking, and visual interpretation. This approach maintains transparency and clearly demonstrates how technologies participated in creating the material.

1.
Claude Sonnet 4.5 Anthropic Analyzing the Original Publication and Writing the Text The neural network studies the original material and generates a coherent text

1. Analyzing the Original Publication and Writing the Text

The neural network studies the original material and generates a coherent text

Claude Sonnet 4.5 Anthropic
2.
Gemini 2.5 Pro Google DeepMind step.translate-en.title

2. step.translate-en.title

Gemini 2.5 Pro Google DeepMind
3.
Gemini 2.5 Flash Google DeepMind Text Review and Editing Correction of errors, inaccuracies, and ambiguous phrasing

3. Text Review and Editing

Correction of errors, inaccuracies, and ambiguous phrasing

Gemini 2.5 Flash Google DeepMind
4.
DeepSeek-V3.2 DeepSeek Preparing the Illustration Description Generating a textual prompt for the visual model

4. Preparing the Illustration Description

Generating a textual prompt for the visual model

DeepSeek-V3.2 DeepSeek
5.
FLUX.2 Pro Black Forest Labs Creating the Illustration Generating an image based on the prepared prompt

5. Creating the Illustration

Generating an image based on the prepared prompt

FLUX.2 Pro Black Forest Labs

Related Publications

You May Also Like

Explore Other Events

Events are only part of the bigger picture. These materials help you see more broadly: the context, the consequences, and the ideas behind the news.

AI: Events

How AMD and Qwen Optimized MI300X GPUs for Peak Performance

Technical context Infrastructure

The Qwen team optimized their models to effectively run on AMD MI300X GPUs, achieving a response latency as low as 15 ms per token and full image generation in just 0.4 seconds.

LMSYS ORGlmsys.org Feb 13, 2026

Want to dive deeper into the world
of neuro-creativity?

Be the first to learn about new books, articles, and AI experiments
on our Telegram channel!

Subscribe