Inception Launches Mercury 2, the Fastest Reasoning LLM — 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost

Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 deliv...

While the AI industry spends billions squeezing incremental speed from token-by-token autoregressive models, Inception’s diffusion based generation is the architectural breakthrough that makes high throughput reasoning native to the model.
Founded by Stanford, UCLA, and Cornell researchers behind foundational diffusion work, Inception commercialized diffusion for text and Mercury 2 extends that breakthrough into production-grade reasoning built for real world inference.
Mercury 2 is built for the highest value production workflows where inference performance decides adoption: agent loops, real time voice and search, and instant coding and editing at scale.

PALO ALTO, Calif.: Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 delivers 5x faster performance while reducing the latency and cost barriers that have limited real‑world deployment of reasoning systems.

Mercury 2 models are available today via the Inception API.

Every major LLM in production today, including GPT, Claude, and Gemini, relies on the same core mechanism: autoregressive generation. They produce text sequentially. One. Token. At. A. Time. This approach has a low ceiling because speed is ultimately bounded by the serial nature of generation, and the constraints get worse as reasoning depth increases, driving up serving costs and driving down responsiveness. Constrained by this ceiling, the industry has largely taken three paths to improve speed: specialized chips, optimized serving stacks, and model compression, trading capability for speed. Leading labs and infrastructure providers have poured billions into these efforts to squeeze performance gains out of the same token-by-token generation loop.

Inception took a fundamentally different path - one rooted in diffusion, the same technical approach behind modern image and video generation systems, now applied to language. Mercury 2 advances that diffusion foundation into production-grade reasoning and sets a new performance standard for speed-optimized LLMs, delivering cost-efficient reasoning at 1,000 tokens per second throughput with performance on par with Claude 4.5 Haiku and GPT 5.2 Mini. The result is throughput and responsiveness that come from the model itself, enabling fast, scalable inference.

How dLLMs work

Instead of predicting the next token in a sequence, Mercury 2 starts with a rough sketch of the full output and iteratively refines it through a process called denoising - across many tokens in parallel. Each pass through the model modifies and improves multiple tokens simultaneously, so a single neural network evaluation produces far more useful work per step. The speed advantage comes from the model itself, not from specialized hardware. And because the model refines iteratively rather than committing to each token permanently, it can correct errors mid-generation.

“Reasoning models are only as useful as their ability to run in production,” said Stefano Ermon, CEO and co-founder of Inception. “For the past few years, we've seen incredible progress in model capability, but much less progress in making that capability usable in low-latency use cases. With Mercury 2, we've built a system where high-quality reasoning runs fast enough and efficiently enough for real-time applications. When you get speed, cost, and quality working together, you unlock entirely new possibilities - and that's what excites us most."

In standard benchmarks, consistent with Artificial Analysis’s methodology, Mercury 2 achieves approximately 1000 tokens per second output throughput, compared with Claude 4.5 Haiku Reasoning at approximately 89 tokens per second and GPT-5 Mini at approximately 71 tokens per second. On quality benchmarks, Mercury 2 scored 91.1 on AIME 2025, 73.6 on GPQA, 71.3 on IFBench, 67.3 on LiveCodeBench, 38.4 on SciCode, and 52.9 on Tau2.

These scores place Mercury 2 within competitive range of Claude 4.5 Haiku and GPT 5.2 Mini on quality, while delivering roughly 10x the throughput.

“Most teams treat inference as an optimization exercise around the autoregressive stack, but Inception started from a more fundamental place: diffusion for language,” said Tim Tully, partner at Menlo Ventures. “Mercury 2 shows what happens when that foundation is paired with a serious approach to reasoning and deployment, not just demos. We believe Inception’s diffusion-based roadmap has the potential to reset expectations for how fast and scalable reasoning models can be.”

Building on Inception's diffusion-first foundation, Mercury 2’s use cases include the following:

Fast, High-Volume Agent Loops: Mercury 2 turns agents from “cool demo” to “reliable production system” by shrinking the latency penalty that compounds across multi-step workflows. That means code agents, IT and SecOps triage, and multi-step back office automation loops can run more steps with tighter feedback cycles directly improving controllability and trust.
Search & Voice: Mercury 2 makes it practical to integrate reasoning inside tight real-time SLAs, where p95 and p99 latency determine whether the experience feels natural. This empowers applications including support and sales voice agents, customer support copilots, interactive tutoring Q&A, and real-time translation.
Instant Coding and Editing: Mercury 2 powers the iterative coding loop, enabling users to prompt, review, and tweak in rapid succession.

Across these production workflows, Mercury 2 has demonstrated three concrete advantages that matter in deployment: lower end-to-end latency, reduced inference cost at comparable quality, and improved output reliability through iterative refinement during generation. In practice, that means faster loops without compounding delays, fewer retries and fallbacks, and more predictable performance when workloads scale.

“As a people-first fund, we are proud to be the inception investor in Inception and thrilled by the progress this exceptional team has made. While the industry has spent billions optimizing around the same autoregressive architecture, Inception had the conviction to pursue a fundamentally different foundation - diffusion for language,” said Navin Chaddha, Managing Partner, Mayfield. “Mercury 2 proves that bet out, delivering production-grade reasoning at the speed and cost that real-world deployment actually demands.”

Mercury 2 also enables capabilities that are difficult to achieve with strictly sequential generation. Iterative refinement supports in-generation error correction and more controllable outputs, including structured responses for agent orchestration, code edits, and function calling, which helps teams maintain consistency and oversight as they move from prototypes to production.

Inception was founded by researchers from Stanford, UCLA, and Cornell who contributed to foundational work in diffusion models and other core AI techniques, including flash attention, decision transformers, and direct preference optimization. CEO Stefano Ermon is a co-inventor of the diffusion methods widely used in modern image and video generation systems.

Inception is hiring across research, engineering, and go-to-market roles. To learn more, visit the careers page.

About Inception

Inception develops diffusion-based large language models (dLLMs) designed for efficient, low-latency AI applications. While traditional autoregressive LLMs generate text sequentially, Inception’s diffusion-based models generate outputs in parallel, enabling faster inference and improved reliability for real-world use cases. Based in Palo Alto, California, Inception is backed by Menlo Ventures, Mayfield, Innovation Endeavors, M12 (Microsoft’s venture capital fund), Snowflake Ventures, Databricks Ventures, and individual backers including Andrew Ng and Andrej Karpathy. For more information, visit www.inceptionlabs.ai.

Fonte: Business Wire