Inception Launches Mercury 2, the Fastest Reasoning LLM — 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost

Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 deliv...

While the AI industry spends billions squeezing incremental speed from token-by-token autoregressive models, Inception’s diffusion based generation is the architectural breakthrough that makes high throughput reasoning native to the model.
Founded by Stanford, UCLA, and Cornell researchers behind foundational diffusion work, Inception commercialized diffusion for text and Mercury 2 extends that breakthrough into production-grade reasoning built for real world inference.
Mercury 2 is built for the highest value production workflows where inference performance decides adoption: agent loops, real time voice and search, and instant coding and editing at scale.

PALO ALTO, Calif.: Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 delivers 5x faster performance while reducing the latency and cost barriers that have limited real‑world deployment of reasoning systems.

Mercury 2 models are available today via the Inception API.

Every major LLM in production today, including GPT, Claude, and Gemini, relies on the same core mechanism: autoregressive generation. They produce text sequentially. One. Token. At. A. Time. This approach has a low ceiling because speed is ultimately bounded by the serial nature of generation, and the constraints get worse as reasoning depth increases, driving up serving costs and driving down responsiveness. Constrained by this ceiling, the industry has largely taken three paths to improve speed: specialized chips, optimized serving stacks, and model compression, trading capability for speed. Leading labs and infrastructure providers have poured billions into these efforts to squeeze performance gains out of the same token-by-token generation loop.

Inception took a fundamentally different path - one rooted in diffusion, the same technical approach behind modern image and video generation systems, now applied to language. Mercury 2 advances that diffusion foundation into production-grade reasoning and sets a new performance standard for speed-optimized LLMs, delivering cost-efficient reasoning at 1,000 tokens per second throughput with performance on par with Claude 4.5 Haiku and GPT 5.2 Mini. The result is throughput and responsiveness that come from the model itself, enabling fast, scalable inference.

How dLLMs work

Instead of predicting the next token in a sequence, Mercury 2 starts with a rough sketch of the full output and iteratively refines it through a process called denoising - across many tokens in parallel. Each pass through the model modifies and improves multiple tokens simultaneously, so a single neural network evaluation produces far more useful work per step. The speed advantage comes from the model itself, not from specialized hardware. And because the model refines iteratively rather than committing to each token permanently, it can correct errors mid-generation.

“Reasoning models are only as useful as their ability to run in production,” said Stefano Ermon, CEO and co-founder of Inception. “For the past few years, we've seen incredible progress in model capability, but much less progress in making that capability usable in low-latency use cases. With Mercury 2, we've built a system where high-quality reasoning runs fast enough and efficiently enough for real-time applications. When you get speed, cost, and quality working together, you unlock entirely new possibilities - and that's what excites us most."

In standard benchmarks, consistent with Artificial Analysis’s methodology, Mercury 2 achieves approximately 1000 tokens per second output throughput, compared with Claude 4.5 Haiku Reasoning at approximately 89 tokens per second and GPT-5 Mini at approximately 71 tokens per second. On quality benchmarks, Mercury 2 scored 91.1 on AIME 2025, 73.6 on GPQA, 71.3 on IFBench, 67.3 on LiveCodeBench, 38.4 on SciCode, and 52.9 on Tau2.

These scores place Mercury 2 within competitive range of Claude 4.5 Haiku and GPT 5.2 Mini on quality, while delivering roughly 10x the throughput.

“Most teams treat inference as an optimization exercise around the autoregressive stack, but Inception started from a more fundamental place: diffusion for language,” said Tim Tully, partner at Menlo Ventures. “Mercury 2 shows what happens when that foundation is paired with a serious approach to reasoning and deployment, not just demos. We believe Inception’s diffusion-based roadmap has the potential to reset expectations for how fast and scalable reasoning models can be.”

Building on Inception's diffusion-first foundation, Mercury 2’s use cases include the following:

Fast, High-Volume Agent Loops: Mercury 2 turns agents from “cool demo” to “reliable production system” by shrinking the latency penalty that compounds across multi-step workflows. That means code agents, IT and SecOps triage, and multi-step back office automation loops can run more steps with tighter feedback cycles directly improving controllability and trust.
Search & Voice: Mercury 2 makes it practical to integrate reasoning inside tight real-time SLAs, where p95 and p99 latency determine whether the experience feels natural. This empowers applications including support and sales voice agents, customer support copilots, interactive tutoring Q&A, and real-time translation.
Instant Coding and Editing: Mercury 2 powers the iterative coding loop, enabling users to prompt, review, and tweak in rapid succession.

Across these production workflows, Mercury 2 has demonstrated three concrete advantages that matter in deployment: lower end-to-end latency, reduced inference cost at comparable quality, and improved output reliability through iterative refinement during generation. In practice, that means faster loops without compounding delays, fewer retries and fallbacks, and more predictable performance when workloads scale.

“As a people-first fund, we are proud to be the inception investor in Inception and thrilled by the progress this exceptional team has made. While the industry has spent billions optimizing around the same autoregressive architecture, Inception had the conviction to pursue a fundamentally different foundation - diffusion for language,” said Navin Chaddha, Managing Partner, Mayfield. “Mercury 2 proves that bet out, delivering production-grade reasoning at the speed and cost that real-world deployment actually demands.”

Mercury 2 also enables capabilities that are difficult to achieve with strictly sequential generation. Iterative refinement supports in-generation error correction and more controllable outputs, including structured responses for agent orchestration, code edits, and function calling, which helps teams maintain consistency and oversight as they move from prototypes to production.

Inception was founded by researchers from Stanford, UCLA, and Cornell who contributed to foundational work in diffusion models and other core AI techniques, including flash attention, decision transformers, and direct preference optimization. CEO Stefano Ermon is a co-inventor of the diffusion methods widely used in modern image and video generation systems.

Inception is hiring across research, engineering, and go-to-market roles. To learn more, visit the careers page.

About Inception

Inception develops diffusion-based large language models (dLLMs) designed for efficient, low-latency AI applications. While traditional autoregressive LLMs generate text sequentially, Inception’s diffusion-based models generate outputs in parallel, enabling faster inference and improved reliability for real-world use cases. Based in Palo Alto, California, Inception is backed by Menlo Ventures, Mayfield, Innovation Endeavors, M12 (Microsoft’s venture capital fund), Snowflake Ventures, Databricks Ventures, and individual backers including Andrew Ng and Andrej Karpathy. For more information, visit www.inceptionlabs.ai.

Fonte: Business Wire

Last News

RSA at Cybertech Europe 2024

Alaa Abdul Nabi, Vice President, Sales International at RSA presents the innovations the vendor brings to Cybertech as part of a passwordless vision for…

Italian Security Awards 2024: G11 Media honours the best of Italian cybersecurity

G11 Media's SecurityOpenLab magazine rewards excellence in cybersecurity: the best vendors based on user votes

How Austria is making its AI ecosystem grow

Always keeping an European perspective, Austria has developed a thriving AI ecosystem that now can attract talents and companies from other countries

Sparkle and Telsy test Quantum Key Distribution in practice

Successfully completing a Proof of Concept implementation in Athens, the two Italian companies prove that QKD can be easily implemented also in pre-existing…

G11 Media Networks

InnovationOpenLab is a channel of BitCity, a newspaper registered at the court of Como ,
n. 21/2007 del 11/10/2007- Registration ROC n. 15698

G11 MEDIA S.R.L. Registered office Via NUOVA VALASSINA, 4 22046 MERONE (CO) - P.IVA/C.F.03062910132 Como business register n. 03062910132 - REA n. 293834 CAPITALE SOCIALE Euro 30.000 i.v.

Inocras and Broad Institute Researchers Present New TCGA Whole-Genome Cancer Insights, Accelerating Discovery in Cancer Genomics

HSS Presents New Research Leveraging AI to Uncover Insights Related to Pain Risk and Anesthesia Education at ASRA Annual Meeting

Cross Country Healthcare to Hold First Quarter 2026 Earnings Conference Call on Thursday, May 7, 2026

Cerebras Systems Announces Filing of Registration Statement for Proposed Initial Public Offering

Qualcomm Recommends Stockholders Reject Mini-Tender Offer by Tutanota LLC

Suite Studios & Frame.io Drive: Bringing File Streaming to Creative Teams Everywhere

CORRECTING and REPLACING Terradepth's Absolute Ocean Cleared Selection Process and Placed in NIWC Atlantic RCC Hopper

My Mountain Mover Expands into Dental with Virtual Assistants Built for Practice Operations

Inception Launches Mercury 2, the Fastest Reasoning LLM — 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost

Related news

Last News

RSA at Cybertech Europe 2024

Italian Security Awards 2024: G11 Media honours the best of Italian cybersecurity

How Austria is making its AI ecosystem grow

Sparkle and Telsy test Quantum Key Distribution in practice

Most read

Enterprises Seek Structured, Low-risk Mainframe Modernization Plans

UKYAA to Host Global Youth Art and AI Exhibition at Cambridge Focused…

Cloudflare Announces Date of First Quarter 2026 Financial Results and…

American Innovators Network Builds Momentum for Little Tech with More…

G11 Media Networks

Inception Launches Mercury 2, the Fastest Reasoning LLM — 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost

Related news

Last News

Most read

Newsletter signup

G11 Media Networks