Defeating Nondeterminism in LLM Inference

118 jxmorris12 39 9/10/2025, 5:26:08 PM thinkingmachines.ai ↗

Comments (39)

riazrizvi · 16m ago

Natural language is ambiguous. It needs to be. I think the approach here of trying to figure out how to make circles into squares, and argue why circles should be squares, is misguided.

Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.

lsy · 2h ago

Fixing "theoretical" nondeterminism for a totally closed individual input-output pair doesn't solve the two "practical" nondeterminism problems, where the exact same input gives different results given different preceding context, and where a slightly transformed input doesn't give a correctly transformed result.

Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.

kazinator · 34m ago

There is no such thing as "exactly the same input, but with different preceding context". The preceding context is input!

If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.

Now what some people want is requirements like:

- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.

- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.

These kinds of requirements betray a misunderstanding of what these LLMs are.

saagarjha · 2h ago

This is really useful in reproducing bugs.

jll29 · 2h ago

Sometimes, the reason for non-determinism is implementation-specific. For instance, in GPT-2's source code (I haven't checked other model versions), setting the temperature in the GUI does not lead to a value of 0 but "epsilon" (a very small value larger than 0), to avoid a division by zero error in the code, which makes sense.

For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).

jasonjmcghee · 1h ago

I love high quality blog post style research discussion - Anthropic has been leading the charge with this recently and it's great to see it spreading. OpenAI was also doing this during all the RL research days.

paulbjensen · 17m ago

It reminded me of this wonderful talk by the late Joe Armstrong (Erlang's creator): https://www.youtube.com/watch?v=lKXe3HUG2l4

Great post.

mg · 2h ago

I really hope we will get deterministic LLMs in the future. Even if it causes slightly slower response times.

Nondeterminism is what currently keeps me from working with other developers.

As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".

Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.

[1]: https://www.gibney.org/prompt_coding

p1necone · 34m ago

Surely if you end up relying on a given prompt to produce the exact same code every time you should instead just check that code into source control the first time you generate it?

A deterministic LLM isn't going to behave appreciably differently from a non deterministic one if your input or context varies by even a tiny bit (pun intended) each time.

khimaros · 1h ago

i tried to create a makefile driven workflow based on this idea and ended up with https://github.com/khimaros/enc -- it suffers from the issues you raised

i'm hoping that it becomes more useful as models improve and become more reliable at producing working code (though determinism would be great for improving prompts).

eldenring · 1h ago

Very impressive! I guess this still wouldn't affect their original example

> For example, you might observe that asking ChatGPT the same question multiple times provides different results.

even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

The router also leaks batch-level information across sequences.

boroboro4 · 1h ago

> even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.

I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work.

eldenring · 47m ago

Ah interesting, good point. So I guess expert-choice routing leaks across the batch. Now I'm not sure.

measurablefunc · 3h ago

I think this means that the results might also be non-deterministic across hardware revisions b/c I don't think they verified that the kernels will work the same on different GPU & TPU versions b/c how do they know that the compiler will not re-order the operations behind their back?

saagarjha · 2h ago

Yes, there’s usually no guarantee on how different hardware does operations (for example, even if the hardware is correctly rounding intermediate results, different hardware may use different tile sizes). The reproducibility here is for runs on the same machine.

Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.

AlotOfReading · 2h ago

You can prevent reordering with sufficient amounts of compiler abuse.

With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.

reliabilityguy · 2h ago

> will not re-order the operations behind their back?

Valid point. Floating point summation is not always commutative.

TimorousBestie · 3h ago

Ensuring the same floating-point algorithm workload behaves exactly the same on two distinct workstations is a heck of a lot of work that almost no one is willing to pay for.

measurablefunc · 2h ago

Not only that but heterogeneous clusters (inevitable at a large enough scale) will also have non-deterministic outputs. So it's great that they wrote kernels to make the forward pass deterministic but getting rid of it entirely at data center scale would mean that they'd also have to do this type of work across cluster nodes as well to maintain "cluster" invariance & not just batch invariance.

syntaxing · 1h ago

Super interesting. For those unaware, this is the company Mira Murati (OpenAI previous CTO) started

htrp · 36m ago

We know what thinking machines does yet?

threeducks · 46m ago

It should also be noted that PyTorch has a page about reproducibility: https://docs.pytorch.org/docs/stable/notes/randomness.html

TL;DR

Seed your PRNGs and call torch.use_deterministic_algorithms(True) to get the deterministic kernels. They may be slightly slower, but in practice, you probably will not notice.

Note that results will still differ between different drivers and GPUs. It would be great if NVIDIA tried harder in that regard.

sudohalt · 36m ago

cool project but if this is what you are producing with $2 billion funding, i doubt you will survive. This is the type of article a grad student would write over a weekend.

lrvick · 2h ago

Job one is have every bit of software involved also be deterministic, which stagex takes care of.

I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago.

Run two of these with the same prompts and same seed and you get the same results.

Obviously in GPU clusters with different hardware things get more complicated.

https://git.distrust.co/public/llmshell

spindump8930 · 2h ago

That's not what this is about.

"I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths.

> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.

Your situation isn't really comparable.

saagarjha · 2h ago

What’s stagex?

cubefox · 2h ago

His solution still relies on greedy (temperature 0) sampling, which is probably not optimal for model performance on various tasks. For example, Gemini 2.5 uses temperature 1 by default. But deterministic inference with temperature >0 can still be achieved by using pseudorandom sampling with a fixed seed.

red2awn · 1h ago

Conceptually setting temperature to be >0 doesn't actually introduce any non-determinism. If your sampler is seeded then it will always choose the same next token. Higher temperature only flattens the logit distribution.

mynameismon · 1h ago

The point of the blog is that even at "supposed" deterministic generative sampling, non-determinism creeps in. This in turn has disastrous effects in very real experiments.

cubefox · 1h ago

My point is that greedy sampling is not just not sufficient but also not necessary for deterministic inference.

TNDnow · 2h ago

Who needs a working product when you can spend all day designing the most WEWORK looking website and slap some pseud slop on it. It's like crypto "startups" but it's not even fun.

nowittyusername · 1h ago

I am baffled that I still run against these statement years after LLM's have been around. LLM's are deterministic and always have been. The reason people are having issues with them is because they are basing their assumptions on api based experiments. Like my man, how can you be making these statements when you haven't done the due diligence of running the LLM on your own hardware with all of the variables locked down and accounted for? If you do just that it would become obviously clear that they are deterministic and most of the time the reason you see the non deterministic behavior is because you have not controlled for a variable. Usually prompt caching, batch processing or some other obvious variable. Now this is related to within same system deterministic behavior. You might get different answers when running on a different gpu, but at least for same systems the behavior is 100% identical if you account for all server startup flags and properly account for things like prompt cashing, slot contamination etc...

Voloskaya · 1h ago

I suggest you look up the name of the main author of TFA before assuming they don’t know what they are talking about.

This is literally one of the most knowledgeable person on the topic. I think you are the one that hasn’t peeled enough layers to connect with what they are saying.

golol · 1h ago

Hold on a second. A transformer produces deterministically a probability distribution over the token alphabet from the context. Then one samples from this distribution. This is random and meant to be random.

nowittyusername · 37m ago

The sampling process isn't random. If you sample with identical sampling parameters and identical values for said parameters, you will always get same results. You only start getting "non deterministic" behavior when you start using more complex systems outside the scope of your control like multi gpu systems and batch processing. One llm sampled with cash prompting off and and batch processing off will always generate same results if all values are same.

oasisaimlessly · 1h ago

It's possible to deterministically sample from a probability distribution. For example, just seed your RNG with a constant, or with the SHA256 hash of the context.

golol · 51m ago

Well yes, you can "hack" the pseudorandom number generator, but... that's not really the point when talking about determinism in LLMs is it? I mean the mathematical idea of the standard LLM is certainly truly random.

tossandthrow · 1h ago

The article literally justifies This in the second paragraph.

nowittyusername · 1h ago

I suppose I have issues with the way "determinism" is used in the title of this article. It can mean different things to different people and in my mind stating that "Defeating Nondeterminism in LLM Inference" frames it as an actual issue with LLM inference. But its not, its an issue with LLM inference when you start using large scale inference with more complex parts such as systems which use multi gpu inference systems or batching processes and other mechanisms. It is not an issue when using an LLM without those more complex parts. Stating it this way muddies the signal and gives a false sense that this is a fundamental issue with architecture, where its an issue of the systems at scale...

ChatGPT Developer Mode: Full MCP client access (platform.openai.com)

Show HN: Term.everything – Run any GUI app in the terminal (github.com)

Defeating Nondeterminism in LLM Inference (thinkingmachines.ai)

The HackberryPi CM5 handheld computer (github.com)

Launch HN: Recall.ai (YC W20) – API for meeting recordings and transcripts

Mux (YC W16) Is Hiring Engineering ICs and Managers (mux.com)

Show HN: Haystack – Review pull requests like you wrote them yourself (haystackeditor.com)

OrioleDB Patent: now freely available to the Postgres community (supabase.com)

'Clearest sign' yet of ancient life on Mars (nature.com)

Dotter: Dotfile manager and templater written in Rust (github.com)

Harvey Mudd Miniature Machine (cs.hmc.edu)

I didn't bring my son to a museum to look at screens (sethpurcell.com)

UGMM-NN: Univariate Gaussian Mixture Model Neural Network (arxiv.org)

Jiratui – A Textual UI for interacting with Atlassian Jira from your shell (jiratui.sh)

Clojure's Solutions to the Expression Problem (infoq.com)

Show HN: TailGuard – Bridge your WireGuard router into Tailscale via a container (github.com)

Anthropic Services Down (status.anthropic.com)

Kerberoasting (blog.cryptographyengineering.com)

Charlie Kirk killed at event in Utah (nbcnews.com)

TikTok has turned culture into a feedback loop of impulse and machine learning (thenexus.media)

Zoox robotaxi launches in Las Vegas (zoox.com)

The origin story of merge queues (mergify.com)

Insufficiently sanitized data allows unauthenticated access to FreePBX Admin (labs.watchtowr.com)

Distributing your own scripts via Homebrew (justin.searls.co)

We can’t circumvent the work needed to train our minds (zettelkasten.de)

Tarsnap is cozy (til.andrew-quinn.me)

Semantic Line Breaks (2017) (sembr.org)

Things you can do with a debugger but not with print debugging (mahesh-hegde.github.io)

Delphi 13 Florence Released (blogs.embarcadero.com)

Rayhunter: IMSI Catchers We Have Found So Far (eff.org)

Wiggling into Correlation (entropicthoughts.com)

I replaced Animal Crossing's dialogue with a live LLM by hacking GameCube memory (joshfonseca.com)

NASA finds Titan's lakes may be creating vesicles with primitive cell walls (sciencedaily.com)

CPU Utilization is Wrong (2017) (brendangregg.com)

Knowledge and memory (robinsloan.com)

All clickwheel iPod games have now been preserved for posterity (arstechnica.com)

Guy running a Google rival from his laundry room (fastcompany.com)

Rendering flame fractals with a compute shader (2023) (wrighter.xyz)

Performance Improvements in .NET 10 (devblogs.microsoft.com)

Google Ends Support for Lynx Browser

Show HN: CrabCamera – Cross-platform camera plugin for Tauri desktop apps (crates.io)

'Block Everything' protests sweep across France, scores arrested (reuters.com)

Microsoft PowerToys (learn.microsoft.com)

R-Zero: Self-Evolving Reasoning LLM from Zero Data (arxiv.org)

Tufts Offers Free Tuition for Families Making Under $150k a Year (bloomberg.com)

iPhone Air (apple.com)

Show HN: Small Transfers – charge from 0.000001 USD per request for your SaaS (smalltransfers.com)

Axial twist theory (en.wikipedia.org)

Memory Integrity Enforcement (security.apple.com)

Pontevedra, Spain declares its entire urban area a "reduced traffic zone" (greeneuropeanjournal.eu)

Defeating Nondeterminism in LLM Inference

Comments (39)