Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 deliv...

PALO ALTO, Calif.: Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 delivers 5x faster performance while reducing the latency and cost barriers that have limited real‑world deployment of reasoning systems.
Mercury 2 models are available today via the Inception API.
Every major LLM in production today, including GPT, Claude, and Gemini, relies on the same core mechanism: autoregressive generation. They produce text sequentially. One. Token. At. A. Time. This approach has a low ceiling because speed is ultimately bounded by the serial nature of generation, and the constraints get worse as reasoning depth increases, driving up serving costs and driving down responsiveness. Constrained by this ceiling, the industry has largely taken three paths to improve speed: specialized chips, optimized serving stacks, and model compression, trading capability for speed. Leading labs and infrastructure providers have poured billions into these efforts to squeeze performance gains out of the same token-by-token generation loop.
Inception took a fundamentally different path - one rooted in diffusion, the same technical approach behind modern image and video generation systems, now applied to language. Mercury 2 advances that diffusion foundation into production-grade reasoning and sets a new performance standard for speed-optimized LLMs, delivering cost-efficient reasoning at 1,000 tokens per second throughput with performance on par with Claude 4.5 Haiku and GPT 5.2 Mini. The result is throughput and responsiveness that come from the model itself, enabling fast, scalable inference.
How dLLMs work
Instead of predicting the next token in a sequence, Mercury 2 starts with a rough sketch of the full output and iteratively refines it through a process called denoising - across many tokens in parallel. Each pass through the model modifies and improves multiple tokens simultaneously, so a single neural network evaluation produces far more useful work per step. The speed advantage comes from the model itself, not from specialized hardware. And because the model refines iteratively rather than committing to each token permanently, it can correct errors mid-generation.
“Reasoning models are only as useful as their ability to run in production,” said Stefano Ermon, CEO and co-founder of Inception. “For the past few years, we've seen incredible progress in model capability, but much less progress in making that capability usable in low-latency use cases. With Mercury 2, we've built a system where high-quality reasoning runs fast enough and efficiently enough for real-time applications. When you get speed, cost, and quality working together, you unlock entirely new possibilities - and that's what excites us most."
In standard benchmarks, consistent with Artificial Analysis’s methodology, Mercury 2 achieves approximately 1000 tokens per second output throughput, compared with Claude 4.5 Haiku Reasoning at approximately 89 tokens per second and GPT-5 Mini at approximately 71 tokens per second. On quality benchmarks, Mercury 2 scored 91.1 on AIME 2025, 73.6 on GPQA, 71.3 on IFBench, 67.3 on LiveCodeBench, 38.4 on SciCode, and 52.9 on Tau2.
These scores place Mercury 2 within competitive range of Claude 4.5 Haiku and GPT 5.2 Mini on quality, while delivering roughly 10x the throughput.
“Most teams treat inference as an optimization exercise around the autoregressive stack, but Inception started from a more fundamental place: diffusion for language,” said Tim Tully, partner at Menlo Ventures. “Mercury 2 shows what happens when that foundation is paired with a serious approach to reasoning and deployment, not just demos. We believe Inception’s diffusion-based roadmap has the potential to reset expectations for how fast and scalable reasoning models can be.”
Building on Inception's diffusion-first foundation, Mercury 2’s use cases include the following:
Across these production workflows, Mercury 2 has demonstrated three concrete advantages that matter in deployment: lower end-to-end latency, reduced inference cost at comparable quality, and improved output reliability through iterative refinement during generation. In practice, that means faster loops without compounding delays, fewer retries and fallbacks, and more predictable performance when workloads scale.
“As a people-first fund, we are proud to be the inception investor in Inception and thrilled by the progress this exceptional team has made. While the industry has spent billions optimizing around the same autoregressive architecture, Inception had the conviction to pursue a fundamentally different foundation - diffusion for language,” said Navin Chaddha, Managing Partner, Mayfield. “Mercury 2 proves that bet out, delivering production-grade reasoning at the speed and cost that real-world deployment actually demands.”
Mercury 2 also enables capabilities that are difficult to achieve with strictly sequential generation. Iterative refinement supports in-generation error correction and more controllable outputs, including structured responses for agent orchestration, code edits, and function calling, which helps teams maintain consistency and oversight as they move from prototypes to production.
Inception was founded by researchers from Stanford, UCLA, and Cornell who contributed to foundational work in diffusion models and other core AI techniques, including flash attention, decision transformers, and direct preference optimization. CEO Stefano Ermon is a co-inventor of the diffusion methods widely used in modern image and video generation systems.
Inception is hiring across research, engineering, and go-to-market roles. To learn more, visit the careers page.
About Inception
Inception develops diffusion-based large language models (dLLMs) designed for efficient, low-latency AI applications. While traditional autoregressive LLMs generate text sequentially, Inception’s diffusion-based models generate outputs in parallel, enabling faster inference and improved reliability for real-world use cases. Based in Palo Alto, California, Inception is backed by Menlo Ventures, Mayfield, Innovation Endeavors, M12 (Microsoft’s venture capital fund), Snowflake Ventures, Databricks Ventures, and individual backers including Andrew Ng and Andrej Karpathy. For more information, visit www.inceptionlabs.ai.
Fonte: Business Wire
Alaa Abdul Nabi, Vice President, Sales International at RSA presents the innovations the vendor brings to Cybertech as part of a passwordless vision for…
G11 Media's SecurityOpenLab magazine rewards excellence in cybersecurity: the best vendors based on user votes
Always keeping an European perspective, Austria has developed a thriving AI ecosystem that now can attract talents and companies from other countries
Successfully completing a Proof of Concept implementation in Athens, the two Italian companies prove that QKD can be easily implemented also in pre-existing…
$SUIS #SUI--Canary Capital Group LLC (“Canary Capital”), a digital asset investment management firm, today announced the launch of the Canary Staked SUI…
Merck (NYSE: MRK), known as MSD outside of the U.S. and Canada, and Mayo Clinic, the world's top-ranked hospital system, today announced a research and…
Cobalt AI, a San Francisco-based startup, is scaling up its comprehensive platform that provides expert-curated datasets, evaluation frameworks, and specialized…
Keysight Technologies, Inc. (NYSE: KEYS) introduced a new portfolio of scale-up validation solutions designed to help artificial intelligence (AI) data…