Mercury: Commercial-scale diffusion language model

385 HyprMusic 180 4/30/2025, 9:51:10 PM inceptionlabs.ai ↗

Comments (180)

inerte · 60d ago

Not sure if I would tradeoff speed for accuracy.

Yes, it's incredible boring to wait for the AI Agents in IDEs to finish their job. I get distracted and open YouTube. Once I gave a prompt so big and complex to Cline it spent 2 straight hours writing code.

But after these 2 hours I spent 16 more tweaking and fixing all the stuff that wasn't working. I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.

I've been more and more only using the "thinking" models of o3 in ChatGPT, and Gemini / Claude in IDEs. They're slower, but usually get it right.

But at the same time I am open to the idea that speed can unlock new ways of using the tooling. It would still be awesome to basically just have a conversation with my IDE while I am manually testing the app. Or combine really fast models like this one with a "thinking background" one, that would runs for seconds/minutes but try to catch the bugs left behind.

I guess only giving a try will tell.

XenophileJKO · 60d ago

So my personal belief is that diffusion models will enable higher degrees of accuracy. This is because unlike an auto-regressive model it can adjust a whole block of tokens when it encounters some kind of disjunction.

Think of the old example where an auto regressive model would output: "There are 2 possibilities.." before it really enumerated them. Often the model has trouble overcoming the bias and will hallucinate a response to fit the proceeding tokens.

Chain of thought and other approaches help overcome this and other issues by incentivizing validation, etc.

With diffusion however it is easier for the other generated answer to change that set of tokens to match the actual number of possibilities enumerated.

This is why I think you'll see diffusion models be able to do some more advanced problem solving with a smaller number of "thinking" tokens.

pama · 60d ago

Unfortunately the intuition and the math proofs so far suggest that autoregressive training is learning the joint distribution of probabilistic streams of tokens much better than diffision models do or will ever do. My intuitive take is that the conditional probability distribtion of decoder-only autoregressive models is at just the right level of complexity for probabilistic models to learn accurately enough. Intuitively (and simplifying things at the risk of breaking rigor), the diffusion (or masked models) have to occasionally issue tokens with less information and thus higher variance than a pure autoregressive model would have to do, so the joint distribution, ie the probability of the whole sentence/answer will be lower and thus diffusion models will never get precise enough. Of course, during generation the sampling techniques influence the above simplified idea dramatically and the typical randomized sampling for next token prediction is suboptimal and could be beaten by a carefully designed block diffusion sampler in principle in some contexts though I havent seen real examples of it yet. But the key ideas of the above scribbles are still valid: autoregresive models will always be better (or at least equal) probabilistic models of sequential data than diffusion models will be. So the diffusion models mostly offer a tradeoff for performance vs quality. Sometimes there is a lot of room for that tradeoff in practice.

niemandhier · 60d ago

This is tremendously interesting!

Could you point me to some literature? Especially regarding mathematical proofs of your intuition?

I’d like to recalibrate my priors to align better with current research results.

GistNoesis · 60d ago

From the mathematical point of view the literature is about the distinction between a "filtering" distribution and a "smoothing" distribution. The smoothing distribution is strictly more powerful.

In theory intuitively the smoothing distribution has access to all the information that the filtering distribution has and some additional information therefore has a minimum lower than the filtering distribution.

In practice, because the smoothing input space is much bigger, keeping the same number of parameters we may not reach a better score because with diffusion we are tackling a much harder problem (the whole problem), whereas with autoregressive models we are taking a shortcut which happens to probably be one that humans are probably biased too (communication evolved so that it can be serialized to be exchanged orally).

pama · 60d ago

Although what you say about smoothing vs filtering is true in principle, for conditional generation of the eventual joint distribution starting from the same condition and using an autoregresive vs diffusive LLM, it is the smoothing distribution that has less power. In other words, during inference starting from J tokens and writing token number K is of course better with diffusion if you also have some given tokens after token K and up to the maximal token N. However, if your input is fixed (tokens up to J) and you have to predict those additional tokens (from J+1 to N), you are solving a harder problem and have a lower joint probability at the end of the inference for the full generated sequence from J+1 up to N.

pama · 60d ago

I am still jetlagged and not sure what the most helpful reference would be. Maybe start from the block diffusion paper I recommended in a parallel thread and trace your way up/down from there. The logic leading to Eq 6 is a special case of such a math proof.

https://openreview.net/forum?id=tyEyYT267x

kmacdough · 60d ago

What are the barriers to mixed architecture models? Models which could seamlessly pass from autoregressive to diffusion, etc.

Humans can integrate multiple sensory processing centers and multiple modes of thought all at once. It's baked into our training process (life).

pama · 60d ago

The human processing is still autoregressive, but using multiple parallel synchronized streams. There is no problem with such an approach and my best guess is that in the next year we will see many teams training models using such tricks for generating reasoning traces in parallel.

The main concern is taking a single probabilistic stream (eg a book) and comparing autoregressive modelling of it with a diffusive modelling of it.

Regarding mixing diffusion and autoregressive—I was at ICLR last week and this work is probably relevant: https://openreview.net/forum?id=tyEyYT267x

cchance · 60d ago

Maybe diffusion for "thoughts" and autoregressive for output :S

efavdb · 60d ago

Suggests an opportunity for hybrids, where the diffusion model might be responsible for large scale structure of response and the next token model for filling in details. Sort of like a multi scale model in dynamics simulations.

AlexCoventry · 60d ago

> it can adjust a whole block of tokens when it encounters some kind of disjunction.

This is true in principle for general diffusion models, but I don't think it's true for the noise model they use in Mercury (at least, going by a couple of academic papers authored by the Inception co-founders.) Their model generates noise by masking a token, and once it's masked, it stays masked. So the reverse-diffusion gets to decide on the contents of a masked token once, and after that it's fixed.

freeqaz · 60d ago

Here are two papers linked from Inception's site:

1. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution - https://arxiv.org/abs/2310.16834

2. Simple and Effective Masked Diffusion Language Models - https://arxiv.org/abs/2406.07524

AlexCoventry · 60d ago

Thanks, yes, I was thinking specifically of "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution". They actually consider two noise distributions: one with uniform sampling for each noised token position, and one with a terminal masking (the Q^{uniform} and Q^{absorb}.) However, the terminal-masking system is clearly superior in their benchmarks.

https://arxiv.org/pdf/2310.16834#page=6

macleginn · 60d ago

The exact types of path dependencies in inference on text-diffusion models look like an interesting research project.

AlexCoventry · 60d ago

Yes, the problem is coming up with a noise model where reverse diffusion is tractable.

XenophileJKO · 60d ago

Thank you, I'll have to read the papers. I don't think I have read theirs.

fizx · 60d ago

Once that auto-regressive model goes deep enough (or uses "reasoning"), it actually has modeled what possibilities exist by the time it's said "There are 2 possibilities.."

We're long past that point of model complexity.

klipt · 60d ago

But as everyone knows, computer science has two hard problems: naming things, cache invalidation, and off by one errors.

tyre · 60d ago

Check out RooCode if you haven’t. There’s an orchestrator mode that can start with a big model to come up with a plan and break down, then spin out small tasks to smaller models for scoped implementation.

danenania · 60d ago

If you’re open to a terminal-based approach, this is exactly what my project Plandex[1] focuses on—breaking up and completing large tasks step by step.

1 - https://github.com/plandex-ai/plandex

amelius · 60d ago

Wouldn't it be possible to trade speed back for accuracy, e.g. by asking the model to look at a problem from different angles, let it criticize its own output, etc.?

nowittyusername · 60d ago

Just have it sample for longer or create a simple workflow that uses a monte carlo tree search approach. Don't see why this wont improve accuracy. I would love to see someone run tests to see how accurate the model is compared to similar parameter models in a per time block benchmark. Like if it can get same accuracy as a similar parameter autoregressive model but with half the speed, you already have a winner, besides other advantages of a diffusion based model.

kadushka · 60d ago

AI field desperately needs smarter models - not faster models.

geysersam · 60d ago

Definitely needs faster and cheaper models. Fast and cheap models could replace software in tons of situations. Imagine a vending machine or a mobile game or a word processor where basically all logic is implemented as a prompt to an llm. It would serve as the ultimate high level programming language.

janalsncm · 60d ago

I think natural language to code is the right abstraction. Easy enough barrier to entry but still debuggable. Debugging why an LLM randomly gives you Mountain Dew instead of Sprite if you have a southern accent sounds like a nightmare.

geysersam · 60d ago

I'm not sure it would be that hard to debug. Make sure you can reproduce the llm state (by storing the random seed for the session, or something like that) and then ask it "why did you just now give that customer mountain dew when they ordered sprite?"

otabdeveloper4 · 60d ago

> and then ask it "why did you just now give that customer mountain dew when they ordered sprite?"

Worse than useless for debugging.

An LLM can't think and doesn't have capabilities for self-reflection.

It will just generate a plausible stream of tokens in reply that may or may not correspond to the real reason why.

geysersam · 60d ago

Of course a llm can't think. But that doesn't mean it can't answer simple questions about the output that was produced. Just try it out with chatgpt when you have time. Even if it's not perfectly accurate it's still useful for debugging.

Just think about it as a human employee. Can they always say why they did what they did? Often, but not always. Sometimes you will have to work to figure out the misunderstanding.

otabdeveloper4 · 60d ago

> it's still useful for debugging

How so? What the LLM says is whatever is more likely given the context. It has no relation to the underlying reality whatsoever.

geysersam · 60d ago

Not sure what you mean by "relation to the underlying reality". The explanation is likely to be correlated with the underlying reason for the answer.

For example, here is a simple query:

> I put my bike in the bike stand, I removed my bike from the bike stand and biked to the beach, then I biked home, where is my bike. Answer only with the answer and no explanation

> Chatgpt: At home

> Why did you answer that?

> I answered that your bike is at home because the last action you described was biking home, which implies you took the bike with you and ended your journey there. Therefore, the bike would logically be at home now.

Do you doubt that the answer would change if I changed the query to make the final destination be "the park" instead of "home"? If you don't doubt that, what do you mean that the answer doesn't correspond to the underlying reality? The reality is the answer depends on the final destination mentioned, and that's also the explanation given by the LLM, clearly the reality and the answers are related.

janalsncm · 60d ago

You need to find an example of the LLM making a mistake. In your example, ChatGPT answered correctly. There are many examples online of LLMs answering basic questions incorrectly, and then the person asking the LLM why it did so. The LLM response is usually nonsense.

Then there is the question of what you would do with its response. It’s not like code where you can go in and update the logic. There are billions of floating point numbers. If you actually wanted to update the weights you’ll quickly find yourself fine-tuning the monstrosity. Orders of magnitude more work than updating an “if” statement.

geysersam · 59d ago

I don't think llms always can give correct explanations for their answers. That's a misunderstanding.

> Then there is the question of what you would do with its response. I

Sure but that's a separate question. I'd say the first course of action would be to edit the prompt. If you have to resort to fine tuning I'd say the approach has failed and the tool was insufficient for the task.

janalsncm · 59d ago

It’s not really a separate question imo. We want to know whether computer code or prompts are better for programming things like vending machines.

For LLMs, interpretability is one problem. The ability to effectively apply fixes is another. If we are talking about business logic, have the LLM write code for it and don’t tie yourself in knots begging the LLM to do things correctly.

There is a grey area though, which is where code sucks and statistical models shine. If your task was to differentiate between a cat and a dog visually, good luck writing code for that. But neural nets do that for breakfast. It’s all about using the right tool for the job.

otabdeveloper4 · 59d ago

> The explanation is likely to be correlated with the underlying reason for the answer.

No it isn't. You misunderstand how LLMs work. They're giant Mad Libs machines: given these surrounding words, fill in this blank with whatever statistically is most likely. LLMs don't model reality in any way.

geysersam · 59d ago

Did you read the example above? Do you disagree that the LLM provided a correct explanation for the reason it answered as it did?

> They're giant Mad Libs machines: given these surrounding words, fill in this blank with whatever statistically is most likely. LLMs don't model reality in any way.

Not sure why you think this is incompatible with the statement you disagreed with.

otabdeveloper4 · 59d ago

> Do you disagree that the LLM provided a correct explanation for the reason it answered as it did?

Yes, I do. An LLM replies with the most likely string of tokens. Which may or may not correspond with the correct or reasonable string of tokens, depending on how stars align. In this case the statistically most likely explanation the LLM replied with just happened to correspond with the correct one.

geysersam · 59d ago

> In this case the statistically most likely explanation the LLM replied with just happened to correspond with the correct one.

I claim that case is not so uncommon as people in this thread seem to think

janalsncm · 60d ago

Why not just store the state in the code and debug as usual, perhaps with LLM assistance? At least that’s tractable.

suddenlybananas · 60d ago

Why on earth would you implement a vending machine using an LLM?

K0balt · 60d ago

The same reason we make the butter dish suffer from existential angst.

jacob019 · 60d ago

Because it's easy and cheap. Like how many products use a Raspberry Pi or ESP32 when an ATtiny would do.

kadushka · 60d ago

How in the world is this easy and cheap? Are you planning to run this LLM inside the vending machine? Or are you planning to send those prompts to a remote LLM somewhere?

geysersam · 60d ago

The premise here is that the model runs fast and cheap. With the current state of the technology running a vending machine using an LLM is of course absurd. The point is that accuracy is not the only dimension that brings qualitative change to the kind of applications that LLMs are useful for.

kadushka · 60d ago

Running a vending machine using an LLM is absurd not because we can't run LLMs fast or cheap enough - it's because LLMs are not reliable, and we don't know yet how to make them more reliable. Our best LLM - o3 - doubled the previous model (o1) hallucination rate. OpenAI says it hallucinated a wrong answer 33% of the time in benchmarks. Do you want a vending machine that screws up 33% of the time?

Today, the accuracy of LLMs is by far a bigger concern (and a harder problem to solve) than its speed. If someone releases a model which is 10x slower than o3, but is 20% better in terms of accuracy, reliability, or some other metric of its output quality, I'd switch to it in a heartbeat (and I'd be ready to pay more for it). I can't wait until o3-pro is released.

geysersam · 59d ago

Do you seriously think a typical contemporary LLM would screw up 33% of vending machine orders?

I don't know what benchmark you're looking at but I'm sure the questions in it were more complicated than the logic inside a vending machine.

Why don't you just try it out? It's easy to simulate, just tell the bot about the task and explain to it what actions to perform in different situations, then provide some user input and see if it works or not.

K0balt · 60d ago

You could run a 3B model on 200 dollars worth of hardware and it would do just fine, 100 percent of the time, most of the time. I could definitely see someone talking it out of a free coke now and then though.

With vending machines costing 2-5k, it’s not out of the question, but it’s hard to imagine the business case for it. Maybe the tantalizing possibility of getting a free soda would attract traffic and result in additional sales from frustrated grifters? Idk.

Wazako · 60d ago

Yet deepseek has shown that more dialogue increases quality. Increasing speed is therefore important if you need thinking models.

guiriduro · 60d ago

If you have much more speed in the available time, for an activity like coding, you could use that for iteration, writing more tests and satisfying them, especially if you can pair that with a concurrent test runner to provide feedback. I'm not sure the end result would be lower scoring/smartness than an LLM could achieve in the same duration.

kadushka · 60d ago

I'm not sure the end result would be lower scoring/smartness than an LLM could achieve in the same duration.

It probably wouldn’t with current models. That’s exactly why I said we need smarter models - not more speed. Unless you want to “use that for iteration, writing more tests and satisfying them, especially if you can pair that with a concurrent test runner to provide feedback.” - I personally don’t.

otabdeveloper4 · 60d ago

LLM's can't think, so "smarter" is not possible.

IshKebab · 60d ago

They can by the normal English definitions of "think" and "smart". You're just redefining those words to exclude AI because you feel threatened by it. It's tedious.

otabdeveloper4 · 60d ago

Incorrect. LLM's have no self-reflection capability. That's a key prerequisite for "thinking". ("I think, therefore I am.")

They are simple calculators that answer with whatever tokens are most likely given the context. If you want reasonable or correct answers (rather than the most likely) then you're out of luck.

IshKebab · 60d ago

It is not a key prerequisite for "thinking". It's "I think therefore I am" not "I am self-aware therefore I think".

In the 90s if your cursor turned into an hourglass and someone said "it's thinking" would you have pedantically said "NO! It is merely calculating!"

Maybe you would... but normal people with normal English would not.

otabdeveloper4 · 59d ago

Self-reflection is not the same thing as self-awareness.

Computers have self-reflection to a degree - e.g., they react to malfunctions and can evaluate their own behavior. LLMs can't do this, in this respect they are even less of a thinking machine than plain old dumb software.

baq · 60d ago

Technically correct and completely besides the point.

K0balt · 60d ago

People cant fly.

jillesvangurp · 60d ago

I think speed and convenience are essential. I use chat gpt desktop for coding. Not because it's the best but because it's fast and easy and doesn't interrupt my flow too much. I mostly stick to the 4o model. I only use the o3 model when I really have to. Because at that point getting an answer is slooooow. 4o is more than good enough most of the time.

And more importantly it's a simple option+shift+1 away. I simply type something like "fix that" and it has all the context it needs to do its thing. Because it connects to my IDE and sees my open editor and the highlighted line of code that is bothering me. If I don't like the answer, I might escalate to o3 sometimes. Other models might be better but they don't have the same UX. Claude desktop is pretty terrible, for example. I'm sure the model is great. But if I have to spoon feed it everything it's going to annoy me.

What I'd love is for smaller faster models to be used by default and for them to escalate to slower more capable models on a need to have basis only. Using something like o3 by default makes no sense. I don't want to have to think about which model is optimal for what question. The problem of figuring out what model is best to use is a much simpler one than answering my questions. And automating that decision opens the doors to having a multitude of specialized models.

matznerd · 60d ago

You're missing that Claude desktop has MCP servers, which can extend it to do a lot more, including much better real life "out of the box" uses. You can do things like use Obsidian as a filesystem or connect to local databases to really extend the abilities. You can also read and write to github directly and bring in all sorts of other tools.

kazinator · 60d ago

> Not sure if I would tradeoff speed for accuracy.

Are you, though?

There are obvious examples of obtaining speed without losing accuracy, like using a faster processor with bigger caches, or more processors.

Or optimizing something without changing semantics, or the safety profile.

Slow can be unreliable; a 10 gigabit ethernet can be more reliable than a 110 baud acoustically-coupled modem in mean time between accidental bit flips.

Here, the technique is different, so it is apples to oranges.

Could you tune the LLM paradigm so that it gets the same speed, and how accurate would it be?

otabdeveloper4 · 60d ago

> I now realize I should have done things incrementally even when I have a pretty good idea of the final picture.

Or just save yourself the time and money and code it yourself like it's 2020.

(Unless it's your employer paying for this waste, in which case go for it, I guess.)

cedws · 60d ago

You left an LLM to code for two hours and then were surprised when you had to spend a significant amount of time more cleaning up after it?

Is this really what people are doing these days?

No comments yet

thomastjeffery · 60d ago

Accuracy is a myth.

These models do not reason. They do not calculate. They perform no objectivity whatsoever.

Instead, these models show us what is most statistically familiar. The result is usually objectively sound, or at least close enough that we can rewrite it as something that is.

kittikitti · 60d ago

I don't use the best available models for prototyping because it can be expensive or more time consuming. This innovation makes prototyping faster and practicing prompts on slightly lower accuracy models can provide more realistic expectations.

g-mork · 60d ago

The excitement for me is the implications for lower energy models. Tech like this could thoroughly break the Nvidia stranglehold at least for some segments

dmos62 · 60d ago

If the benchmarks aren't lying, Mercury Coder Small is as smart as 4o mini and costs the same, but is order of magnitude faster when outputting (unclear if pre-output delay is notably different). Pretty cool. However, I'm under the impression that 4o-mini was superceded by 4.1-mini and 4.1-nano for all use cases (correct me if I'm wrong). Unfortunately they didn't publish comparisons with the 4.1 line, which feels like an attempt to manipulate the optics. Or am I misreading this?

Btw, why call it "coder"? 4o-mini level of intelligence is for extracting structured data and basic summaries, definitely not for coding.

kmacdough · 60d ago

It appears to be purpose-trained for coding. They also have a generalist model, but that's not the one being compared.

I agree, the comparison is dated, cherry-picked and doesn't reference the thinking models people do use for coding.

But it's also a bit of a new architecture in early stages of development/testing. Comparing against other small non-thinking models is a good step. It demonstrates the strategy is viable and worth exploring. Time will tell its value. Perhaps a guiding LLM could lean on diffusion to speed up generation. Perhaps we'll see more mixed-architecture models. Perhaps diffusion beats out current LLMs, but from my armchair this seems unlikely.

g-mork · 60d ago

There are some open weight attempts at this around too: https://old.reddit.com/r/LocalLLaMA/search?q=diffusion&restr...

Saw another on Twitter past few days that looked like a better contender to Mercury, doesn't look like it got posted to LocalLLaMa, and I can't find it now. Very exciting stuff

freeqaz · 60d ago

this video showing how diffusion models generate text is mesmerizing to look at! (comment in top thread linked in your search results)

https://www.reddit.com/media?url=https://i.redd.it/xci0dlo7h...

falcor84 · 60d ago

That seems fake - diffusion models should evolve details over time, right? This one just feels in the blanks gradually, like an old progressive jpeg.

EDIT: This video in TFA was actually a much cooler demonstration - https://framerusercontent.com/assets/YURlGaqdh4MqvUPfSmGIcao...

m-hodges · 60d ago

It fails the MU Puzzle¹ by violating rules:

To transform the string "AB" to "AC" using the given rules, follow these steps:

1. *Apply Rule 1*: Add "C" to the end of "AB" (since it ends in "B"). - Result: "ABC"

2. *Apply Rule 4*: Remove the substring "CC" from "ABC". - Result: "AC"

Thus, the series of transformations is: - "AB" → "ABC" (Rule 1) - "ABC" → "AC" (Rule 4)

This sequence successfully transforms "AB" to "AC".

¹ https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...

invalidroot · 59d ago

Nice writeup! This is the second post I've seen in the genre of "I've had a secret, personal benchmark for LLMs where the 'solution' requires questioning the premises, and o4-mini-high beats it." The first post I saw was about a chessboard and the prompt "mate in one:" https://x.com/KelseyTuoc/status/1912945346126417940

(Edited to remove direct spoiler for the MU-puzzle, in case people want to try it.)

schappim · 60d ago

It's nice to see a team doing something different.

The cost[1] is US$1.00 per million output tokens and US$0.25 per million input tokens. By comparison, Gemini 2.5 Flash Preview charges US$0.15 per million tokens for text input and $0.60 (non-thinking) output[2].

Hmmm... at those prices they need to focus on markets where speed is especially important, eg high-frequency trading, transcription/translation services and hardware/IoT alerting!

1. https://files.littlebird.com.au/Screenshot-2025-05-01-at-9.3...

2. https://files.littlebird.com.au/pb-IQYUdv6nQo.png

kmacdough · 60d ago

I would be extremely hesitant to assume a direct relationship between pricing and cost. A behemoth like Google is very willing to take significant losses for years to grow market share. Back in 2014-2015 Uber often charged less than the Boston subway, but it always cost them MUCH more under the hood. AFAIK they're still not profitable.

Chinese companies will be similarly eager for market share, but not everyone has the access to the same raw capital.

loufe · 60d ago

Absolutely this. Gemini is amazing, but I'm under no illusions that their principal goal right now is to boost their database of high quality training data with free access via ai studio. That said, custom silicon with a model made with internal teams collaborating to make use of that hardware idiosyncracies must be a massive advantage, as well.

dvdhs · 60d ago

Not sure how HFTs are relevant here

KingMob · 60d ago

HFT is limited by time on how much processing it can do. In theory, a super-fast dLLM would enable to incorporate information sources in their decision-making that were previously too high-level. E.g., imagine using wire reports to predict an arbitrage opportunity that doesn't even exist yet (I dunno, not an HFT guy).

In practice, iiuc, HFT still happens within 10s of milliseconds, and I doubt even current dLLM is THAT fast.

jbellis · 60d ago

What is the price on Mercury Mini?

vlovich123 · 60d ago

I just tried giving it a coding snippet that has a bug. ChatGPT & Claude found the bug instantly. Mercury fails to find it even after several reprompts (it's hallucinating). On the upside it is significantly faster. That's promising since the edge for ChatGPT and Claude are in the prolonged time and energy they've spent building training infrastructure, tooling, datasets, etc to pump out models with high task performance.

kmacdough · 60d ago

Keep in mind this release was never intended to prove superiority. Rather, it shows an alternative structure with some promising performance characteristics. More work needs to be done to show real application, but this very valuable learning.

That's part of the reason to compare against older, smaller models since they're at a more comparable stage of development.

vlovich123 · 60d ago

I agree. As I was trying to imply, I think if you integrated this structure into OpenAI’s or Claude’s stack, you’d get a vastly cheaper model that’s significantly faster with similar task performance (modulo the structural task performance parts that are hard to port to this new architecture). The point about quality was also intended to temper some of the excitement about the scores published on the page.

jonplackett · 60d ago

Ok. My go to puzzle is this:

You have 2 minutes to cool down a cup of coffee to the lowest temp you can

You have two options:

1. Add cold milk immediately, then let it sit for 2 mins.

2. Let it sit for 2 mins, then add the cold milk.

Which one cools the coffee to the lowest temperature and why?

And Mercury gets this right - while as of right now ChatGPT 4o get it wrong.

So that’s pretty impressive.

twic · 60d ago

Depends on the shape of the cup! You can contrive a cup shaped like an exponentially flaring horn, where adding the milk increases the volume a little, which massively increases the surface area, and so leads to faster cooling. Or you can have a cup with a converging top, like a brandy glass, where adding the milk reduces the surface area, and makes cooling even slower.

j_bum · 60d ago

And add the twist: what if the cup is made from a highly conductive material?

IshKebab · 60d ago

Clever, but it's implicitly a normal cylindrical cup.

jefftk · 60d ago

Claude 3.7 gets it exactly right:

To determine which option cools coffee the most, I'll analyze the heat transfer physics involved. The key insight is that the rate of heat loss depends on the temperature difference between the coffee and the surrounding air. When the coffee is hotter, it loses heat faster. Option 1 (add milk first, then wait):

- Adding cold milk immediately lowers the coffee temperature right away

- The coffee then cools more slowly during the 2-minute wait because the temperature difference with the environment is smaller

Option 2 (wait first, then add milk):

- The hot coffee cools rapidly during the 2-minute wait due to the large temperature difference

- Then the cold milk is added, creating an additional temperature drop at the end

Option 2 will result in the lowest final temperature. This is because the hotter coffee in option 2 loses heat more efficiently during the waiting period (following Newton's Law of Cooling), and then gets the same cooling benefit from the milk addition at the end. The mathematical principle behind this is that the rate of cooling is proportional to the temperature difference, so keeping the coffee hotter during the waiting period maximizes heat loss to the environment.

kazinator · 60d ago

That's totally cribbed from some discussion hat occurred in its training.

Nevermark · 60d ago

As apposed to humans who all derive the physics of heat transfer independently when given a question like this?

Not picking on you - this brings up something we could all get better at:

There should be a "First Rule of Critiquing Models": Define a baseline system to compare performance against. When in doubt, or for general critiques of models, compare to real world random human performance.

Without a real practical baseline to compare with, its to easy to fall into subjective or unrealistic judgements.

"Second Rule": Avoid selectively biasing judgements by down selecting performance dimensions. For instance, don't ignore difference in response times, grammatical coherence, clarity of communication, and other qualitative and quantitative differences. Lack of comprehensive performance dimension coverage is like comparing runtimes of runners, without taking into account differences in terrain, length of race, altitude, temperature, etc.

It is very easy to critique. It is harder to critique in a way that sheds light.

selcuka · 59d ago

> As apposed to humans who all derive the physics of heat transfer independently when given a question like this?

Isn't that the difference between learning and memorizing, though? If you were taught Newton's Law of Cooling using this example and truly learned it, you could apply it to other problems as well. But if you only memorized it, you might be able to recite it when asked the same question, yet still be unable to apply it to anything else.

accrual · 60d ago

> It is very easy to critique. It is harder to critique in a way that sheds light.

Well said. This is the sort of ethos I admire and aspire to on HN.

mhh__ · 60d ago

So is my knowledge of newtons law of cooling

kazinator · 60d ago

If an LLM has only that knowledge and nothing else (pieces of text saying that heat transfer is proportional to some function of the temp difference) such that is not trained on any texts that give problems and solutions in this area, it will not work this out, since it has nothing to generate tokens from.

Also, your knowledge doesn't come from anywhere near having scanned terabytes of text, which would take you multiple lifetimes of full time work.

mhh__ · 60d ago

We get way more info than llms do, just not solely from text

suddenlybananas · 60d ago

You have not read every accessible piece of text in existence.

mhh__ · 60d ago

There is more to life than just text e.g. this is part of lecun argument against LLMs

suddenlybananas · 60d ago

Lecun's argument is based off a bad interpretation of how data is processed by the optic nerve, we don't receive that much raw data.

What we do have, is billions of years of evolution that has given a lot of innate knowledge which means we are radically more capable than LLMs despite having little data.

kazinator · 60d ago

There is more to text than just predicting tokens based on a vast volume of text.

There isn't an argument "against LLMs" as such; the argumentation is more oriented against the hype and incessant promotion of AI.

fph · 60d ago

This exact problem was in Martin Gardner's column for Scientific American in the 1970s. There are surely references all over the internet.

jonplackett · 60d ago

If it was just ‘in the training data’ they’d all get it right.

But they don’t.

kazinator · 60d ago

I don't think that can be postulated as a law, because they are a kind of lossy compression. Different lossy compressions will lose different details.

krackers · 60d ago

Hmm a good nerd-snipe puzzle. I was never very good at physics, so hopefully someone can check my work... assuming upon mixing coffee is at Tc and milk at Tm, and simplifying to assume equivalent mass & specific temp we have (Tf - Tc) = -(Tf - Tm) => Tf = (Tc+Tm)/2 which is intuitive (upon mixing we get the average temperature).

On the assumption that the cold milk is always at a fixed temperature until it's mixed in, then the temperature of coffee at point of mixing is the main factor. Before and after, it follows newton's law of cooling. So we're comparing something like Tenv + [(Tc+Tm)/2 - Tenv]e^(-2) vs (Tenv + [Tc - Tenv]e^(-2) + Tm)/2. The latter is greater than the former only when Tm > Tenv (the milk isn't cold), or in other words it's better to let the coffee cool as much as possible before mixing assuming the milk is colder than the environment.

Another interesting twist is to consider the case where the milk isn't kept at a fixed temperature but is also subject to warming (it's taken out of the fridge). Then the former equation is unchanged but the latter becomes (Tenv + [Tc - Tenv]e^(-2) + Tenv + [Tm - Tenv]e^(-2))/2. But this is equivalent to the former equation, so in this case it doesn't matter when you mix it.

Not 100% confident in both analysis, but I wonder if there's a more intuitive way to see it. I also don't know if deviating from the assumption of equivalent mass & specific temp changes the analysis (it might lead to a small range where for the fixed case, situation 1 is better?) It's definitely not "intuitive" to me.

stavros · 60d ago

There is a more intuitive way: Adding the milk later has a chance to get the cup below room temperature, whereas adding it early never will.

maytc · 60d ago

That example is probably in the training data?

The puzzle assumes that the room temperature is greater than the cold milk's temperature. When I added that the room temperature is, say, -10 °C, Mercury fails to see the difference.

djrj477dhsnv · 60d ago

Under any reasonable assumptions for the size and shape of the cup, the amount of coffee, the makeup of the air, etc., the room being -10c won't change the result.

It would only matter if the air were able to cool the coffee to a temperature less than that of the milk in under 2 minutes.

byearthithatius · 60d ago

Token-based Gemini is 4 seconds of thinking:

Okay, let's break this down using the principle of heat transfer (specifically Newton's Law of Cooling):

    Heat loss is faster when the temperature difference is greater. A very hot object loses heat to cool surroundings much faster than a warm object.

    Option 1: Add milk immediately.

        You instantly lower the coffee's temperature by mixing it with cold milk.

        Now, this warmer (not hot) mixture sits for 2 minutes. Because the temperature difference between the mixture and the room is smaller, it cools down more slowly over those 2 minutes.

    Option 2: Let it sit for 2 mins, then add milk.

        The very hot coffee sits for 2 minutes. Because the temperature difference between the hot coffee and the room is large, it loses heat more quickly during these 2 minutes.

        After 2 minutes of rapid cooling, you then add the cold milk, lowering the temperature further.

Conclusion:

To get the coffee to the lowest temperature, you should choose Option 2: Let it sit for 2 mins, then add the cold milk.

drusepth · 60d ago

Is there a trick question in here, where the cold milk is trending closer to room temperature over the two minutes waited?

firejake308 · 60d ago

The rate of change of temperature is proportional to the difference in temperature. Adding the cold milk later let's you juice a larger temperature differential out of the 2-minute waiting phase. So you should add the milk later.

adrianmonk · 60d ago

> the cold milk later

I think what the other person is asking about is: can you be sure that the milk is (as) cold later?

There's a lot of discussion about what happens to the temperature of the coffee during the 2 minutes. What happens to the temperature of the milk during that same time?

Where is the milk stored? Do you grab it out of the refrigerator the moment you add it to the coffee? Or the cold milk sitting out on the countertop getting warmer? If so, how rapidly?

fc417fc802 · 60d ago

It's a safe bet that freshly brewed coffee is much farther from room temperature than refrigerated milk is. However deriving properties related to that symmetry (or lack thereof) would make an excellent question for an exam in an introductory class.

FilosofumRex · 60d ago

The two options are equivalent, since the final (equilibrium) temp of an adiabatic system (coffee + Milk + room) must be the same - ie it's the total amount of heat transferred that matters, and not the rates of heat transfer.

If the system is not adiabatic, ie the room is not big enough to remain constant temp, or equilibrium is not achieved in 2 mins of cooling, then the puzzle statement must be specify all three initial temps to be well poised.

fc417fc802 · 60d ago

What gave you the idea that this is an adiabatic system? It's a cup of freshly brewed coffee on your kitchen counter. Equilibrium will certainly not be achieved within 2 minutes. Even if it were, different schemes would reach it at different time points.

It is fundamentally a question about rate of energy transfer.

The thing to notice about the symmetric system is that both items will experience the same rate of transfer. However there's presumably more coffee in your coffee than there is milk so it's not actually symmetric. If you adjust the parameters for volume and specific heat to make the final mixed product symmetric then it no longer matters when you do the mixing.

crazygringo · 60d ago

For me, ChatGPT (the free version, GPT-4o mini I believe?) gets it right, choosing option 2 because the coffee will cool faster due to the larger temperature difference.

Unless there's a gotcha somewhere in your prompt that I'm missing, like what if the temperature of the room is hotter than the coffee, or so cold that the coffee becomes colder than the milk, or something?

I would be suprised if any models get it wrong, since I assume it shows up in training data a bunch?

jonplackett · 60d ago

This is what I got from full-fat 4o. Maybe thinking less helps!

ChatGPT:

Option 1 — Add the cold milk immediately — will result in a lower final temperature after 2 minutes.

Why: • Heat loss depends on the temperature difference between the coffee and the environment (usually room temperature). • If you add the milk early, the overall temperature of the coffee-milk mixture is reduced immediately. This lowers the average temperature over the 2 minutes, so less heat is lost to the air. • If you wait 2 minutes to add the milk, the hotter coffee loses more heat to the environment during those 2 minutes, but when you finally add the milk, it doesn’t cool it as much because the coffee’s already cooler and the temp difference between the milk and the coffee is smaller.

Summary: • Adding milk early = cooler overall drink after 2 minutes. • Adding milk late = higher overall temp after 2 minutes, because more heat escapes during the time the coffee is hotter.

Want me to show a simple simulation or visualisation of this?

crazygringo · 60d ago

Oof. I wonder what makes it so bad?

In my experience LLM's tend to be pretty good at basic logic as long as they understand the domain well enough.

I mean, it even gets it right at first -- "This lowers the average temperature over the 2 minutes, so less heat is lost to the air." -- but then it seems to get conceptually confused about heat loss vs cooling, which is surprising.

adammarples · 60d ago

If you let it sit for 2 minutes your time is up and you don't have time to add the cold milk

selcuka · 60d ago

By this logic you can't let it sit for 2 mins after you add the cold milk either, so both options are invalid.

In math/science questions some things are assumed to be (practically impossibly) instant.

adammarples · 59d ago

Ah but if you do that you get time up after 1.58s and you still win

cratermoon · 60d ago

> My go to puzzle is this:

> Mercury gets this right - while as of right now ChatGPT 4o get it wrong.

This is so common a puzzle it's discussed all over the internet. It's in the data used to build the models. What's so impressive about a machine that can spit out something easily found with a quick web search?

jonplackett · 60d ago

Just that what I thought would be better models don’t do it right.

I was expecting this model to be no-where near chatGPT

Although someone above is saying 4o-mini got it right so maybe it’s meaningless. Or maybe thinking less helps…

cratermoon · 60d ago

There is sufficient stochasticity in LLMs to invalidate most comparisons at this level. Minor changes in the prompt text, even from run to run in the same model, will produce different results (depending on temperature and other paramters), much less different models.

Try re-running your test on the same model multiple times with the identical prompt, or varying the prompt. Depending on how much context the service you choose is keeping for you across a conversation, the behavior can change. Something as simple as prompting an incorrect response with a request to try again because the result was wrong can give different results.

Statistically, the model will eventually hit on the right combination of vectors and generate the right words from the training set, and as I noted before, this problem has a very high probability of being in the training data used to build all the models easily available.

emmelaich · 60d ago

I had it write a Python program to calculate disk usage by directory -- basically a `du` clone. It was astonishly fast (2s) and correct. I've tried other models which have got it wrong, slow, and they've ignored my instructions to use topdown=False in the call to walk().

behnamoh · 60d ago

Qwen3 30B A3B: https://pastebin.com/kdJZvqVb

twotwotwo · 60d ago

It's kind of weird to think that in a coding assistant, an LLM is regularly asked to produce a valid block of code top to bottom, or repeat a section of code with changes, when that's not what we do. (There are other intuitively odd things about this, like the amount of compute spent generating 'easy' tokens, e.g. repeating unchanged code.) Some of that might be that models are just weird and intuition doesn't apply. But maybe the way we do it--jumping around, correcting as we go, etc.--is legitimately an efficient use of effort, and a model could do its job better, with less effort, or both if it too used some approach other than generating the whole sequence start-to-finish.

There's already stuff in the wild moving that direction without completely rethinking how models work. Cursor and now other tools seem to have models for 'next edit' not just 'next word typed'. Agents can edit a thing and then edit again (in response to lints or whatever else); approaches based on tools and prompting like that can be iterated on without the level of resources needed to train a model. You could also imagine post-training a model specifically to be good at producing edit sequences, so it can actually 'hit backspace' or replace part of what it's written if it becomes clear it wasn't right, or if two parts of the output 'disagree' and need to be reconciled.

From a quick search it looks like https://arxiv.org/abs/2306.05426 in 2023 discussed backtracking LLMs and https://arxiv.org/html/2410.02749v3 / https://github.com/upiterbarg/lintseq trained models on synthetic edit sequences. There is probably more out there with some digging. (Not the same topic, but the search also turned up https://arxiv.org/html/2504.20196 from this Monday(!) about automatic prompt improvement for an internal code-editing tool at Google.)

vineyardmike · 60d ago

> an LLM is regularly asked to produce a valid block of code top to bottom, or repeat a section of code with changes, when that's not what we do.

Eh, it's mostly what we do. We don't re-type everything every time, but we do type top-to-bottom when we type. As you later mentioned, "next edit" models really strike that balance, and they're like 50% of the value I derive from a tool like Cursor.

I'd love to see more diff-outputs instead of "retyping" everything (with a nice UI for the humans). I suspect that part of the reason we have these "inhuman" actions is because of the chat interface we've been using has lead to certain outputs being more desirable due to the medium.

NitpickLawyer · 60d ago

Looks interesting, and my intuition is that code is a good application of diffusion LLMs, especially if they get support for "constrained generation", as there's already plenty of tooling around code (linters and so on).

Something I don't see explored in their presentation is the ability of the model to restore from errors / correct itself. SotA LLMs shine at this, a few back and forth w/ sonnet / gemini pro / etc really solves most problems nowadays.

freeqaz · 60d ago

Anybody able to get the "View Technical Report" button at the bottom to do anything? I was curious to glean more details but it doesn't work on either of my devices.

I'm curious what level of detail they're comfortable publishing around this, or are they going full secret mode?

albertzeyer · 60d ago

It links to this file: https://drive.google.com/file/d/1j1ofmm8iBaVreGC5TSF1oLsrOqB...

But all but the first page seems to be missing in this PDF? There is just an abstract and (partial) outline.

krackers · 58d ago

There's at least some discussion in https://www.lesswrong.com/posts/pLnLSgWphqDbdorgi/on-the-imp...

>Instead of generating tokens one at a time, a dLLM produces the full answer at once. The initial answer is iteratively refined through a diffusion process, where a transformer suggests improvements for the entire answer at once at every step. In contrast to autoregressive transformers, the later tokens don’t causally depend on the earlier ones (leaving aside the requirement that the text should look coherent). For an intuition of why this matters, suppose that a transformer model has 50 layers and generates a 500-token reasoning trace, the final token of this trace being the answer to the question. Since information can only move vertically and diagonally inside this transformer and there are fewer layers than tokens, any computations made before the 450th token must be summarized in text to be able to influence the final answer at the last token. Unless the model can perform effective steganography, it had better output tokens that are genuinely relevant for producing the final answer if it wants the performed reasoning to improve the answer quality. For a dLLM generating the same 500-token output, the earlier tokens have no such causal role, since the final answer isn’t autoregressively conditioned on the earlier tokens. Thus, I’d expect it to be easier for a dLLM to fill those tokens with post-hoc rationalizations.

>Despite this, I don’t expect dLLMs to be a similarly negative development as Huginn or COCONUT would be. The reason is that in dLLMs, there’s another kind of causal dependence that could prove to be useful for interpreting those models: the later refinements of the output causally depend on the earlier ones. Since dLLMs produce human-readable text at every diffusion iteration, the chains of uninterpretable serial reasoning aren’t that deep. I’m worried about the text looking like gibberish at early iterations and the reasons behind the iterative changes the diffusion module makes to this text being hard to explain, but the intermediate outputs nevertheless have the form of human-readable text, which is more interpretable than long series of complex matrix multiplications.

Based solely on the above, my armchair analysis is that it seems like it's not strictly diffusion in the Langevin diffusion/denoising sense (since there are discrete iteration rounds), but instead borrows the idea of "iterative refinement". You drop the causal masking and token-by-token autoregressive generation, and instead start with a bunch of text and propose a series of edits at each step? On one hand dropping the causal masking over token sequence means that you don't have an objective that forces the LLM to learn a representation sufficient to "predict" things as normally thought, but on the flipside there is now a sort of causal masking over _time_, since each iteration depends on the previous. It's a neat tradeoff.

Subthread https://news.ycombinator.com/item?id=43851429 also has some discussion

echelon · 60d ago

There are so many models. Every single day half a dozen new models land. And even more papers.

It feels like models are becoming fungible apart from the hyperscaler frontier models from OpenAI, Google, Anthropic, et al.

I suppose VCs won't be funding many more "labs"-type companies or "we have a model" as the core value prop companies? Unless it has a tight application loop or is truly unique?

Disregarding the team composition, research background, and specific problem domain - if you were starting an AI company today, what part of the stack would you focus on? Foundation models, AI/ML infra, tooling, application layer, ...?

Where does the value accrue? What are the most important problems to work on?

vessenes · 60d ago

Word on the street is a lot of money is going into vertical application AI companies this season. Makes sense - the bitter lesson means capturing a market and proprietary data is a good play, while frontier models keep getting better at using what you (and only you) own.

jtonz · 60d ago

I would be interested to see how people would apply this working as a coding assistant. For me, its application in solutioning seem very strong, particularly vibe coding, and potentially agentic coding. One of my main gripes with LLM-assisted coding is that for me to get the output which catches all scenarios I envision takes multiple attempts in refining my prompt requiring regeneration of the output. Iterations are slow and often painful.

With the speed this can generate its solutions, you could have it loop through attempting the solution, feeding itself the output (including any errors found), and going again until it builds the "correct" solution.

bayesianbot · 60d ago

I basically did this with aider and Gemini 2.5 few days ago and was blown away. Basically talked about the project structure, let it write the final plan to file CONVENTIONS.md that gets automatically attached to the context, then kept asking "What should we do next" until tests were ready, and then I just ran a loop where it modifies the code and I press Return to run the tests and add the output to prompt and let it go again.

About 10 000 lines of code, and I only intervened a few times, to revert few commits and once to cut a big file to smaller ones so we could tackle the problems one by one.

I did not expect LLMs to be able to do this so soon. But I just commented to say about aider - the iteration loop really was mostly me pressing return. Especially in the navigator mode PR, as it automatically looked up the correct files to attach to the context

jbellis · 60d ago

Unfortunately a 4o mini level of intelligence just isn't enough to make this work, no matter how many iterations you let it try.

parsimo2010 · 60d ago

This sounds like a neat idea but it seems like bad timing. OpenAI just released token-based that beats the best diffusion image generation. If diffusion isn't even the best at generating images, I don't know if I'm going to spend a lot of time evaluating it for text.

Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.

jonplackett · 60d ago

The reason image-1 is so good is because it’s the same model doing the talking and the image making.

I wonder if the same would be true for a multi-modal diffusion model that can now also speak?

freeqaz · 60d ago

Facebook has their Chameleon model from 2023 that was in this space. Ancient now.

There is also this GitHub project that I played with a while ago that's trying to do this. https://github.com/GAIR-NLP/anole

Are there any OSS models that follow this approach today? Or are we waiting for somebody to hack that together?

orbital-decay · 60d ago

Does it beat them because it's a transformer, or because it's a much larger end-to-end model with higher quality multimodal training?

scratchyone · 60d ago

I wonder if it benefits because it can attend to individual tokens of the prompt while generating, compared to typical diffusion models that just get a static vector embedding of the prompt.

jakeinsdca · 60d ago

I just tried it and it was able to perfectly generate a piece of code for me that i needed for generating a 12 month rolling graph based on a list of invoices and it seemed a bit easier and faster then chatgpt.

moralestapia · 60d ago

>Mercury is up to 10x faster than frontier speed-optimized LLMs. Our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips.

This means on custom chips (Cerebras, Graphcore, etc...) we might see 10k-100k tokens/sec? Amazing stuff!

Also of note, funny how text generation started w/ autoregression/tokens and diffusion seems to perform better, while image generation went the opposite way.

moralestapia · 59d ago

After reviewing what they have on their playground, this thing seems to be a scam.

They're running Qwen on a traditional LLM pipeline. The "diffusion effect", as it says there, it's just a decorative, lmao. That in itself shouldn't break the deal as I understand you have to put on a show, but, looking at the latency and timing of their outputs this is not a diffusion model, as they claim. They're also not even close to the 1,000 TPS figure they put out.

I'm surprised nobody on this forum got the slightest clue on that. I guess I should 4x my fee again :).

agnishom · 60d ago

I'd hope that with diffusion, it would be able to go back and forth between parts of the output to adjust issues with part of the output which it had previously generated. This would not be possible with a purely sequential model.

However,

> Prompt: Write a sentence with ten words which has exactly as many r’s in the first five words as in the last five

> Response: Rapidly running, rats rush, racing, racing.

rfv6723 · 60d ago

Why not possible with autoregressive model?

o4 mini

https://chatgpt.com/share/681315c2-aa90-800d-b02d-c3ba653281...

agnishom · 60d ago

Thanks! I didn't know o4 was autoregressive!

pants2 · 60d ago

This is awesome for the future of autocomplete. Current models aren't fast enough to give useful suggestions at the speed that I type - but this certainly is.

That said, token-based models are currently fast enough for most real-time chat applications, so I wonder what other use-cases there will be where speed is greatly prioritized over smarts. Perhaps trading on Trump tweets?

tzury · 60d ago

Would have been nice if along to this demo video[1] comparing speed of 3 models, they would have share the artifacts as well, so we can compare quality.

[1] https://framerusercontent.com/assets/cWawWRJn8gJqqCGDsGb2gN0...

kittikitti · 60d ago

This is genius! There are tradeoffs between diffusion and neural network models in image generation so why not use diffusion models in text generation? Excited to see where this ends up and I wouldn't be surprised if we saw some of these types of models appear in the future updates to popular families like Llama or Qwen.

StriverGuy · 60d ago

Related paper discussing diffusion models from 2 months ago: https://arxiv.org/abs/2502.09992

mlsu · 60d ago

It seems that with this technique you could not possibly do "chain of thought." That technique seems unique to auto-regressive architecture. Right?

No comments yet

badmonster · 60d ago

1000+ tokens/sec on H100s, a 5–10x speedup over typical autoregressive models — and without needing exotic hardware like Groq or Cerebras - impressive

lostmsu · 60d ago

Would batch inference increase throughput further? Or does it already peak the FLOPS?

carterschonwald · 60d ago

I actually just tried it. And I’m very impressed. Or at least it’s reasonable code to start with for nontrivial systems.

byearthithatius · 60d ago

Interesting approach. However, I never thought of auto regression being _the_ current issue with language modeling. If anything it seems the community was generally surprised just how far next "token" prediction took us. Remember back when we did char generating RNNs and were impressed they could make almost coherent sentences?

Diffusion is an alternative but I am having a hard time understanding the whole "built in error correction" that sounds like marketing BS. Both approaches replicate probability distributions which will be naturally error-prone because of variance.

nullc · 60d ago

Consider the entropy of the distribution of token X in these examples:

"Four X"

and

"Four X and seven years ago".

In the first case X could be pretty much anything, but in the second case we both know the only likely completion.

So it seems like there would be a huge advantage in not having to run autogressively. But in practice it's less significant then you might imagine because the AR model can internally model the probability of X conditioned on the stuff it hasn't output yet, and in fact because without reinforcement the training causes it converge on the target probability of the whole output, the AR model must do some form of lookahead internally.

(That said RLHF seems to break this product of the probabilities property pretty badly, so maybe it will be the case that diffusion will suffer less intelligence loss ::shrugs::).

ttctciyf · 60d ago

> in the second case we both know the only likely completion.

You two may, but I don't. 'Decades'? 'Months'? 'Wives'? 'Jobs'? 'Conservative PMs'?

orbital-decay · 60d ago

Diffusion models are built around this type of internal lookahead from the start (accurate near prediction, progressively less accurate far prediction, step forward, repeat). They just do it in the coarse-to-fine direction, i.e. in a different dimension, and had more thought put into shortcuts and speed-accuracy tradeoffs in this process. RL is also used with both types of models. It's not immediately obvious that one must necessarily be more efficient.

byearthithatius · 60d ago

Both are conditional distributions on the context of which they were requested so like you said in the second paragraph, the difference is not significant. I see what you mean though and maybe there are use cases then where Diffusion is preferable. To me it seems the context conditional and internal model is sufficient where this problem doesn't really occur.

nullc · 60d ago

::nods:: in the case of diffusion though "conditional on its own (eventual) output" is more transparent and explicit.

As an example of one place that might make a difference is that some external syntax restriction in the sampler is going to enforce the next character after a space is "{".

Your normal AR LLM doesn't know about this restriction and may pick the tokens leading up to the "{" in a way which is regrettable given that there is going to be a {. The diffusion, OTOH, can avoid that error.

In the case where there isn't an artificial constraint on the sampler this doesn't come up because when its outputting the earlier tokens the AR model knows in some sense about it's own probability of outputting a { later on.

But in practice pretty much everyone engages in some amount of sampler twiddling, even if just cutting off low probability tokens.

As far as the internal model being sufficient, clearly it is or AR LLMs could hardly produce coherent English. But although it's sufficient it may not be particularly training or weight efficient.

I don't really know how these diffusion text models are trained so I can't really speculate, but it does seem to me that getting to make multiple passes might allow it less circuit depth. I think of it in terms of every AR step must expend effort predicting something about the following next few steps in order to output something sensible here, this has to be done over and over again, even though it doesn't change.

nullc · 60d ago

Totally separate from this line of discussion is that if you want to use an LLM for, say, copyediting it's pretty obvious to me how a diffusion model could get much better results.

Like if you take your existing document and measure the probability of your actual word vs an AR model's output, varrious words are going to show up as erroneously improbable even when the following text makes them obvious. A diffusion model should just be able to score up the entire text conditioned on the entire text rather than just the text in front of it.

strangescript · 60d ago

Speed is great, but you have to set the bar a little higher than last year's tiny models

ZeroTalent · 60d ago

Look into groq.com guys. some good models at similar speed to inception labs

sujayk_33 · 60d ago

It's faster inference because of the Hardware (LPUs), here the question is about architectures (AR or Diffusions)

ZeroTalent · 57d ago

I realize that, but it can be used now with many models in real-life situations. I just wanted to mention it if someone doesn't know it.

rfv6723 · 60d ago

SRAM doesn't scale with advanced semiconductor node.

Groq is heading to a dead end.

jph00 · 60d ago

The linked page only compares to very old and very small models. But the pricing is higher even than the latest Gemini Flash 2.5 model, which performs far better than anything they compare to.

freeqaz · 60d ago

Their pockets are probably not as deep as Google's in terms of willingness to burn cash for market share.

If speed is your most important metric, I could still see there being a niche for this.

From a pure VC perspective though, I wonder if they'd be better off Open Sourcing their model to get faster innovation + centralization like Llama has done. (Or Mistral with keeping some models private, some public.)

Use it as marketing, get your name out there, and have people use your API when they realize they don't want to deal with scaling AI compute themselves lol

vineyardmike · 60d ago

> The linked page only compares to very old and very small models.

They're comparing against the fastest models. That's why smaller models are shown.

jbellis · 60d ago

Sort of. The benchmarks showing Flash 2.5 doing really well are benchmarking its thinking mode, which is 4x more expensive than Mercury here

NitpickLawyer · 60d ago

Is cost really the main differentiator here, tho? "Solving" coding seems like the holy grail atm (and I agree, it can enable a bunch of things once that's done) and "traditional, organic, human fed code" is pretty expensive atm, so does cost really matter now?

Put another way, how much would company x be willing to spend on "here's a repo, here are the tests, here is the speed now, make this faster while still passing all the tests". If it "solves" something in cudnn that makes it 10% faster, how much would nvidia pay for this? 1m$? 10m$?

jph00 · 60d ago

Flash 2.5 without thinking mode is also exceptionally good fwiw.

good-luck86523 · 60d ago

Everyone will just switch to LibreOffice and Hetzner.

High tech US service industry exports are cooked.

stats111 · 60d ago

Can't use the Mercury name Sir. It's a bank!

gitroom · 60d ago

this convo has me rethinking how much speed actually matters vs just getting stuff right - you think most problems are just about better habits or purely tooling upgrades at this point?

mackepacke · 60d ago

Nice

marcyb5st · 60d ago

Super happy to see something like this getting traction. As someone that is trying to reduce my carbon footprint sometimes I feel bad about asking any model to do something trivial. With something like that perhaps the guilt will lessen

whall6 · 60d ago

If you live in the U.S., marginal electricity demand during the day is almost invariably met with solar or wind (solar typically runs at a huge surplus on sunny days). Go forth and AI in peace, marcyb5st.

marcyb5st · 60d ago

Thanks! That helps somewhat. However, it feels like that's just part of the story.

If I remember correctly hyperscalers put their green agendas in stasis now that LLMs are around and that makes me believe that there is a CO2 cost associated.

Still, any improvement is a good news and if diffusion models replace autoregressive models we can invest that surplus in energy in something else useful for the environment.

kuhewa · 60d ago

This made me wonder - do any cloud compute systems have an option to time jobs or use physical resources geographically based on surplus power availability to minimise emissions?

I reckon it might incidentally happen if optimising for cost of power depending how correlated that is to carbon intensivity of power generation, which admittedly I haven't thought through.

mmoskal · 60d ago

To put this into perspective, driving for an hour in an electric car (15kW avg consumption) consumes about as much energy as 50,000 chatgpt queries [0] Running your laptop for an hour would be around 100 queries.

[0] https://epoch.ai/gradient-updates/how-much-energy-does-chatg...

ris · 60d ago

Please see yesterday's https://simonwillison.net/2025/Apr/29/chatgpt-is-not-bad-for... instead of propagating the hand-wringing.