Inception Launches Mercury 2, the Fastest Reasoning LLM — 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost

Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 deliv...

Autore: Business Wire

PALO ALTO, Calif.: Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 delivers 5x faster performance while reducing the latency and cost barriers that have limited real‑world deployment of reasoning systems.

Mercury 2 models are available today via the Inception API.

Every major LLM in production today, including GPT, Claude, and Gemini, relies on the same core mechanism: autoregressive generation. They produce text sequentially. One. Token. At. A. Time. This approach has a low ceiling because speed is ultimately bounded by the serial nature of generation, and the constraints get worse as reasoning depth increases, driving up serving costs and driving down responsiveness. Constrained by this ceiling, the industry has largely taken three paths to improve speed: specialized chips, optimized serving stacks, and model compression, trading capability for speed. Leading labs and infrastructure providers have poured billions into these efforts to squeeze performance gains out of the same token-by-token generation loop.

Inception took a fundamentally different path - one rooted in diffusion, the same technical approach behind modern image and video generation systems, now applied to language. Mercury 2 advances that diffusion foundation into production-grade reasoning and sets a new performance standard for speed-optimized LLMs, delivering cost-efficient reasoning at 1,000 tokens per second throughput with performance on par with Claude 4.5 Haiku and GPT 5.2 Mini. The result is throughput and responsiveness that come from the model itself, enabling fast, scalable inference.

How dLLMs work

Instead of predicting the next token in a sequence, Mercury 2 starts with a rough sketch of the full output and iteratively refines it through a process called denoising - across many tokens in parallel. Each pass through the model modifies and improves multiple tokens simultaneously, so a single neural network evaluation produces far more useful work per step. The speed advantage comes from the model itself, not from specialized hardware. And because the model refines iteratively rather than committing to each token permanently, it can correct errors mid-generation.

“Reasoning models are only as useful as their ability to run in production,” said Stefano Ermon, CEO and co-founder of Inception. “For the past few years, we've seen incredible progress in model capability, but much less progress in making that capability usable in low-latency use cases. With Mercury 2, we've built a system where high-quality reasoning runs fast enough and efficiently enough for real-time applications. When you get speed, cost, and quality working together, you unlock entirely new possibilities - and that's what excites us most."

In standard benchmarks, consistent with Artificial Analysis’s methodology, Mercury 2 achieves approximately 1000 tokens per second output throughput, compared with Claude 4.5 Haiku Reasoning at approximately 89 tokens per second and GPT-5 Mini at approximately 71 tokens per second. On quality benchmarks, Mercury 2 scored 91.1 on AIME 2025, 73.6 on GPQA, 71.3 on IFBench, 67.3 on LiveCodeBench, 38.4 on SciCode, and 52.9 on Tau2.

These scores place Mercury 2 within competitive range of Claude 4.5 Haiku and GPT 5.2 Mini on quality, while delivering roughly 10x the throughput.

“Most teams treat inference as an optimization exercise around the autoregressive stack, but Inception started from a more fundamental place: diffusion for language,” said Tim Tully, partner at Menlo Ventures. “Mercury 2 shows what happens when that foundation is paired with a serious approach to reasoning and deployment, not just demos. We believe Inception’s diffusion-based roadmap has the potential to reset expectations for how fast and scalable reasoning models can be.”

Building on Inception's diffusion-first foundation, Mercury 2’s use cases include the following:

Across these production workflows, Mercury 2 has demonstrated three concrete advantages that matter in deployment: lower end-to-end latency, reduced inference cost at comparable quality, and improved output reliability through iterative refinement during generation. In practice, that means faster loops without compounding delays, fewer retries and fallbacks, and more predictable performance when workloads scale.

“As a people-first fund, we are proud to be the inception investor in Inception and thrilled by the progress this exceptional team has made. While the industry has spent billions optimizing around the same autoregressive architecture, Inception had the conviction to pursue a fundamentally different foundation - diffusion for language,” said Navin Chaddha, Managing Partner, Mayfield. “Mercury 2 proves that bet out, delivering production-grade reasoning at the speed and cost that real-world deployment actually demands.”

Mercury 2 also enables capabilities that are difficult to achieve with strictly sequential generation. Iterative refinement supports in-generation error correction and more controllable outputs, including structured responses for agent orchestration, code edits, and function calling, which helps teams maintain consistency and oversight as they move from prototypes to production.

Inception was founded by researchers from Stanford, UCLA, and Cornell who contributed to foundational work in diffusion models and other core AI techniques, including flash attention, decision transformers, and direct preference optimization. CEO Stefano Ermon is a co-inventor of the diffusion methods widely used in modern image and video generation systems.

Inception is hiring across research, engineering, and go-to-market roles. To learn more, visit the careers page.

About Inception

Inception develops diffusion-based large language models (dLLMs) designed for efficient, low-latency AI applications. While traditional autoregressive LLMs generate text sequentially, Inception’s diffusion-based models generate outputs in parallel, enabling faster inference and improved reliability for real-world use cases. Based in Palo Alto, California, Inception is backed by Menlo Ventures, Mayfield, Innovation Endeavors, M12 (Microsoft’s venture capital fund), Snowflake Ventures, Databricks Ventures, and individual backers including Andrew Ng and Andrej Karpathy. For more information, visit www.inceptionlabs.ai.

Fonte: Business Wire


Visualizza la versione completa sul sito

Informativa
Questo sito o gli strumenti terzi da questo utilizzati si avvalgono di cookie necessari al funzionamento ed utili alle finalità illustrate nella cookie policy. Se vuoi saperne di più o negare il consenso a tutti o ad alcuni cookie, consulta la cookie policy. Chiudendo questo banner, acconsenti all’uso dei cookie.