Just looked in the parts drawer at home and dont seem to have a $25,000 GPU for some inexplicable reason.
Kurtz79 · 1d ago
Does it even make sense calling them 'GPUs' (I just checked NVIDIA product page for the H100 and it is indeed so)?
There should be a quicker way to differentiate between 'consumer-grade hardware that is mainly meant to be used for gaming and can also run LLMs inference in a limited way' and 'business-grade hardware whose main purpose is AI training or running inference for LLMs".
blitzar · 1d ago
We are fast approaching the return of the math coprocessor. In fashion they say that trends tend to reappear roughly every two decades, its overdue.
egorfine · 1d ago
Yeah I would love for Nvidia to introduce faster update cycle to their hardware, so that we'll have models like "H201", "H220", etc.
I think it will also make sense to replace "H" with a brand number, sort of like they already do for customer GPUs.
So then maybe one day we'll have a math coprocessor called "Nvidia 80287".
beAbU · 1d ago
I remember the building hugh end workstations for a summer job in the 2000s, where I had to fit Tesla cards in the machines. I don't remember what their device names were, we just called them tesla cards.
"Accelerator card" makes a lot of sense to me.
WithinReason · 1d ago
It's called a tensorcore and it's in most GPUs
genewitch · 1d ago
"GPGPU" was something from over a decade ago; for general purpose GPU computing
hnuser123456 · 1d ago
Yeah, Crysis came out in 2007 and could run physics on the GPU.
AlphaSite · 1d ago
I think apple calls them NPUs and Broadcom calls them XPUs. Given they’re basically the number 2 and 3 accelerator manufacturers one of those probably works.
washadjeffmad · 1d ago
I just specify SXM (node) when I want to differentiate from PCIe. We have H100s in both.
codedokode · 1d ago
By the way I wonder, what has more performance, a $25 000 professional GPU or a bunch of cheaper consumer GPUs costing $25 000 in total?
omneity · 1d ago
Consumer GPUs in theory and by a large margin (10 5090s will eat an H100 lunch with 6 times the bandwidth, 3x VRAM and a relatively similar compute ratio), but your bottleneck is the interconnect and that is intentionally crippled to avoid beowulf GPU clusters eating into their datacenter market.
Last consumer GPU with NVLink was the RTX 3090. Even the workstation-grade GPUs lost it.
H100s also has custom async WGMMA instructions among other things. From what I understand, at least the async instructions formalize the notion of pipelining, which engineers were already implicitly using because to optimize memory accesses you're effectively trying to overlap them in that kind of optimal parallel manner.
addandsubtract · 1d ago
We could call the consumer ones GFX cards, and keep GPU for the matrix multiplying ones.
beAbU · 1d ago
GPU stands for "graphics processing unit" so I'm not sure how your suggestion solves it.
Maybe renaming the device to an MPU, where the M stands for "matrix/math/mips" would make it more semantically correct?
rebolek · 1d ago
I think that G was changed to "general", so now it's "general processing unit".
rpdillon · 1d ago
This doesn't seem to be true at all. It's a highly specialized chip for doing highly parallel operations. There's nothing general about it.
I looked around briefly and could find no evidence that it's been renamed. Do you have a source?
fouc · 1d ago
CPU is already the general (computing) processing unit so that wouldn't make sense
amelius · 1d ago
Well, does it come with graphics connectors?
OliverGuy · 1d ago
Nope, doesn't have any of the required hardware to even process graphics iirc
diggan · 1d ago
Although the RTX Pro 6000 is not consumer-grade, it does come with graphics ports (four Displayports) and does render graphics like a consumer card :) So seems the difference between the segments is becoming smaller, not bigger.
simpleintheory · 1d ago
That’s because it’s intended as a workstation GPU not one used in servers
diggan · 1d ago
Sure, but it still sits in the 'business-grade hardware whose main purpose is AI training or running inference for LLMs" segment parent mentioned, yet have graphics connectors so the only thing I'm saying is that just looking at that won't help you understand what segment the GPU goes into.
namibj · 12h ago
I'd Like to point at the first revision AMD MI50/MI60 cards which were at the time the most powerful GPUs on the market at least by memory bandwidth.
Defining GPU as "can output contemporary display connector signal and is more than just a ramdac/framebuffer-to-cable translator, starting with even just some 2D blitting acceleration.
dougSF70 · 1d ago
With Ollama i got the 20B model running on 8 TitanX cards (2015). Ollama distributed the model so that the 15GB of vram required was split evenly accross the 8 cards. The tok/s were faster than reading speed.
Aurornis · 1d ago
For the price of 8 decade old Titan X cards, someone could pick up a single modern GPU with 16GB or more of RAM.
Aurornis · 1d ago
They’re widely available to rent.
Unless you’re running it 24/7 for multiple years, it’s not going to be cost effective to buy the GPU instead of renting a hosted one.
For personal use you wouldn’t get a recent generation data center card anyway. You’d get something like a Mac Studio or Strix Halo and deal with the slower speed.
varispeed · 1d ago
I rented H100 for training a couple of times and I found that they couldn't do training at all. Same code worked fine on Mac M1 or RTX 5080, but on H100 I was getting completely different results.
So I wonder what I could be doing wrong. In the end I just use RTX 5080 as my models fit neatly in the available RAM.
* by not working at all, I mean the scripts worked, but results were wrong. As if H100 couldn't do maths properly.
philipkiely · 1d ago
This comment made my day ty! Yeah definitely speaking from a datacenter perspective -- fastest piece of hardware I have in the parts drawer is probably my old iPhone 8.
vonneumannstan · 1d ago
>Just looked in the parts drawer at home and dont seem to have a $25,000 GPU for some inexplicable reason.
It just means you CAN buy one if you want, as in they're in stock and "available", not that you can necessarily afford one.
lopuhin · 1d ago
you can rent them for less then $2/h in a lot of places (maybe not in the drawer)
KolmogorovComp · 1d ago
available != cheap
blitzar · 1d ago
available
/əˈveɪləbl/
adjective: available
able to be used or obtained; at someone's disposal
swexbe · 1d ago
You can rent one from most cloud providers for a few bucks an hour.
koakuma-chan · 1d ago
Might as well just use openai api
ekianjo · 1d ago
thats not the same thing at all
poly2it · 1d ago
That depends on your intentions.
blueboo · 1d ago
You might find $2.50 in change to use one for an hour though
wcallahan · 1d ago
I just used GPT-OSS-120B on a cross Atlantic flight on my MacBook Pro (M4, 128GB RAM).
A few things I noticed:
- it’s only fast with with small context windows and small total token context; once more than ~10k tokens you’re basically queueing everything for a long time
- MCPs/web search/url fetch have already become a very important part of interacting with LLMs; when they’re not available the LLM utility is greatly diminished
- a lot of CLI/TUI coding tools (e.g., opencode) were not working reliably offline at this time with the model, despite being setup prior to being offline
That’s in addition to the other quirks others have noted with the OSS models.
XCSme · 1d ago
I know there was a downloadable version of Wikipedia (not that large). Maybe soon we'll have a lot of data stored locally and expose it via MCP, then the AIs can do "web search" locally.
I think 99% of web searches lead to the same 100-1k websites. I assume it's only a few GBs to have a copy of those locally, thus this raises copyright concerns.
Aurornis · 1d ago
The mostly static knowledge content from sites like Wikipedia is already well represented in LLMs.
LLMs call out to external websites when something isn’t commonly represented in training data, like specific project documentation or news events.
XCSme · 1d ago
That's true, but the data is only approximately represented in the weights.
Maybe it's better to have the AI only "reason", and somehow instantly access precise data.
adsharma · 1d ago
What use cases will gain from this architecture?
XCSme · 17h ago
Data processing, tool calling, agentic use. Those are also the main use-cases outside "chatting".
Even though LM Studio uses llama.cpp as a runtime, the performance differs between them. With LM Studio 0.3.22 Build 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s on a RTX Pro 6000, while with llama.cpp compiled from 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost 100 more per second, both running lmstudio-community/gpt-oss-120b-GGUF.
esafak · 1d ago
Is it always like this or does it depend on the model?
diggan · 1d ago
Depends on the model. Each runner needs to implement support when there are new architectures, and they all seemingly focuses on different things. As far as I've gathered so far, vLLM focuses on inference speed, SGLang on parallizing across multiple GPUs, Ollama on being as fast out the door with their implementation as possible, sometimes cutting corners, llama.cpp sits somewhere in-between Ollama and vLLM. Then LM Studio seems to lag slightly behind with their llama.cpp usage, so I'm guessing that's the difference between LM Studio and building llama.cpp from source today.
fouc · 1d ago
What was your iogpu.wired_limit_mb set to? By default only ~70% or ~90GB of your RAM will be available to your GPU cores unless you change your wired limit setting.
mich5632 · 1d ago
I think this the difference between compute bound pre-fill (a cpu has a high bandwidth/compute ratio), vs decode.
The time to first token is below 0.5s - even for a 10k context.
MoonObserver · 1d ago
M2 Max processor.
I saw 60+ tok/s on short conversations, but it degraded to 30 tok/s as the conversation got longer.
Do you know what actually accounts for this slowdown? I don’t believe it was thermal throttling.
summarity · 1d ago
Physics: You always have the same memory bandwidth. The longer the context, the more bits will need to pass through the same pipe. Context is cumulative.
VierScar · 1d ago
No I don't think it's the bits. I would say it's the computation. Inference requires performing a lot of matmul, and with more tokens the number of computation operations increases exponentially - O(n^2) at least. So increasing your context/conversation will quickly degrade performance
I seriously doubt it's the throughput of memory during inference that's the bottleneck here.
MereInterest · 1d ago
Nitpick: O(n^2) is quadratic, not exponential. For it to “increase exponentially”, n would need to be in the exponent, such as O(2^n).
esafak · 1d ago
To contrast with exponential, the term is power law.
zozbot234 · 1d ago
Typically, the token generation phase is memory-bound for LLM inference in general, and this becomes especially clear as context length increases (since the model's parameters are a fixed quantity.) If it was pure compute bound there would be huge gains to be had by shifting some of the load to the NPU (ANE) but AIUI it's just not so.
summarity · 1d ago
It literally is. LLM inference is almost entirely memory bound. In fact for naive inference (no batching), you can calculate the token throughput just based on the model size, context size and memory bandwidth.
zozbot234 · 1d ago
Prompt pre-processing (before the first token is output) is raw compute-bound. That's why it would be nice if we could direct llama.cpp/ollama to run that phase only on iGPU/NPU (for systems without a separate dGPU, obviously) and shift the whole thing over to CPU inference for the latter token-generation phase.
(A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)
torginus · 1d ago
Inference takes quadratic amount of time wrt context size
gigatexal · 1d ago
M3 Max 128GB here and it’s mad impressive.
Im spec’ing out a Mac Studio with 512GB ram because I can window shop and wish but I think the trend for local LLMs is getting really good.
Do we know WHY openAI even released them?
diggan · 1d ago
> Do we know WHY openAI even released them?
Regulations and trying to earn good will of developers using local LLMs, something that was slowly eroding since it was a while ago (GPT2 - 2019) they released weights to the public.
Epa095 · 1d ago
If the new gpt 5 is actually better, then this oss version is not really a threat to Openai's income stream, but it can be a threat to their competitors.
lavezzi · 23h ago
> Do we know WHY openAI even released them?
Enterprises can now deploy them on AWS and GCP.
zackify · 1d ago
You didn’t even mention how it’ll be on fire unless you use low power mode.
Yes all this has been known since the M4 came out. The memory bandwidth is too low.
Try using it with real tasks like cline or opencode and the context length is too long and slow to be practical
Aurornis · 1d ago
> Yes all this has been known since the M4 came out. The memory bandwidth is too low.
The M4 Max with 128GB of RAM (the part used in the comment) has over 500GB/sec of memory bandwidth.
zackify · 23h ago
Which is incredibly slow when you’re over 20k context
radarsat1 · 1d ago
How long did your battery last?!
woleium · 1d ago
planes have power sockets now, but i do wonder how much jet fuel a whole plane of gpus would consume in electricity (assuming the system can handle it, which seems unlikely) and air conditioning.
TimBurman · 1d ago
That's an interesting question. According to Rich and Greg's Airplane Page[1], the A320 has three generators rated for 90kVA continuous each, one per engine and a third in the auxilary power unit that isn't normally deployed. Cruising demand is around 140 kVA of the 180 kVA supplied by the engines, leaving 40 kVA to spare. The A380 has six similar generators, two in reserve. They give the percentages so you could calculate how much fuel each system is consuming.
> Inspired by GPUs, we parallelized this effort across multiple engineers. One engineer tried vLLM, another SGLang, and a third worked on TensorRT-LLM. We were able to quickly get TensorRT-LLM working, which was fortunate as it is usually the most performant inference framework for LLMs.
> TensorRT-LLM
It is usually the hardest to setup correctly and is often out of the date regarding the relevant architectures. It also requires compiling the model on the exact same hardware-drivers-libraries stack as your production environment which is a great pain in the rear end to say the least. Multimodal setups also been a disaster - at least for a while - when it was near-impossible to make it work even for mainstream models - like Multimodal Llamas. The big question is whether it's worth it, since when running the GPT-OSS-120B on H100 using vLLM is flawless in comparison - and the throughput stays at 130-140 t/s for a single H100. (It's also somewhat a clickbait of a title - I was expecting to see 500t/s for a single GPU, when in fact it's just a tensor-parallel setup)
It's also funny that they went for a separate release of TRT-LLM just to make sure that gpt-oss will work correctly, TRT-LLM is a mess
philipkiely · 1d ago
TRT-LLM has its challenges from a DX perspective and yeah for Multi-modal we still use vLLM pretty often.
But for the kind of traffic we are trying to serve -- high volume and latency sensitive -- it consistently wins head-to-head in our benchmarking and we have invested a ton of dev work in the tooling around it.
sarthaksoni · 1d ago
Reading this made me realize how easy it is to set up GPT-OSS 20B in comparison. I had it running on my Mac in five minutes, thanks to Llama.
DrPhish · 1d ago
Its also easy to do 120b on CPU if you have the resources. I had 120b running on my home LLM CPU inference box in just as long as it took to download the GGUFs, git pull and rebuild llama-server.
I had it running at 40t/s with zero effort and 50t/s with a brief tweaking.
Its just too bad that even the 120b isn't really worth running compared to the other models that are out there.
It really is amazing what ggerganov and the llama.cpp team have done to democratize LLMs for individuals that can't afford a massive GPU farm worth more than the average annual salary.
wkat4242 · 1d ago
What hardware do you have? 50tk/s is really impressive for cpu.
DrPhish · 1d ago
2xEPYC Genoa w/768GB of DDR5-4800 and an A5000 24GB card.
I built it in January 2024 for about $6k and have thoroughly enjoyed running every new model as it gets released. Some of the best money I’ve ever spent.
testaburger · 1d ago
Which specific model epcys? And if it's not too much to ask which motherboard and power supply? I'm really interested in building something similar
* Gigabyte MZ73-LM1 with two AMD EPYC GENOA 9334 QS 64c/128t
* 24 sticks of M321R4GA3BB6-CQK 32GB DDR5-4800 RDIMM PC5-38400R
* 24GB A5000
Note that the RAM price almost doubled since Jan 2024
fouc · 1d ago
I've seen some mentions of pure-cpu setups being successful for large models using old epyc/xeon workstations off ebay with 40+ cpus. Interesting approach!
wkat4242 · 1d ago
Wow nice!! That's a really good deal for that much hardware.
How many tokens/s do you get for DeepSeek-R1?
DrPhish · 1d ago
Thanks, it was a bit of a gamble at the time (lots of dodgy ebay parts), but it paid off.
R1 starts at about 10t/s on an empty context but quickly falls off. I'd say the majority of my tokens are generating around 6t/s.
Some of the other big MoE models can be quite a bit faster.
I'm mostly using QwenCoder 480b at Q8 these days for 9t/s average. I've found I get better real-world results out of it than K2, R1 or GLM4.5.
ekianjo · 1d ago
thats a r/localllama user right there
SirMaster · 1d ago
I'm getting 20 tokens/sec on the 120B model with a 5060Ti 16GB and a regular desktop Ryzen 7800x3d with 64GB of DDR5-6000.
wkat4242 · 1d ago
Wow that's not bad. It's strange, for me it is much much slower on a Radeon Pro VII (also 16GB, with a memory bandwidth of 1TB/s!) and a Ryzen 5 5600 with also 64GB. It's basically unworkably slow. Also, I only get 100% CPU when I check ollama ps, the GPU is not being used at all :( It's also counterproductive because the model is just too large for 64GB.
I wonder what makes it work so well on yours! My CPU isn't much slower and my GPU probably faster.
magicalhippo · 1d ago
AMD basically decided they wanted to focus on HPC and data center customers rather than consumers, and so GPGPU driver support for consumer cards has been
non-existing or terrible[1].
Why is it hard to set up llms? You can just ask an llm to do it for you, no? If this relatively simple task is already too much for llms then what good are they?
diggan · 1d ago
In the case of the GPT-OSS models, the worst (time consuming) part of supporting it is the new format they've trained the model with, "OpenAI harmony", in my own clients I couldn't just replace the model and call it a day, but still working on getting then to work correctly with tool calling...
CraigRood · 1d ago
I was playing with it yesterday and every single session gave me factually incorrect information.
Speed and ease of use is one thing, but it shouldn't be at the cost of accuracy.
OliverGuy · 1d ago
If you are trying to get facts out of an LLM you are using it wrong, if you want a fact it should use a tool (eg we search, rag etc) to get the information that contains the fact (Wikipedia page, documentation etc) and then parse that document for the fact and return it to you.
LoganDark · 1d ago
120B is pretty easy to run too, if you have enough memory.
tmshapland · 1d ago
Such a fascinating read. I didn't realize how much massaging needed to be done to get the models to perform well. I just sort of assumed they worked out of the box.
acters · 1d ago
Personally, I think bigger companies should be more proactive and work with some of the popular inference engine software devs with getting their special snowflake LLM to work before it gets released. I guess it is all very much experimental at the end of the day. Those devs are putting in God's work for us to use on our budget friendly hardware choices.
mutkach · 1d ago
This is a good take, actually. GPT-OSS is not much of a snowflake (judging by the model's architecture card at least) but TRT-LLM treats every model like that - there is too much hardcode - which makes it very difficult to just use it out-of-the-box for the hottest SotA thing.
diggan · 1d ago
> GPT-OSS is not much of a snowflake
Yeah, according to the architecture it doesn't seem like a snowflake, but they also decided to invent a new prompting/conversation format (https://github.com/openai/harmony) which definitely makes it a bit of a snowflake today, can't just use what worked a couple of days ago, but everyone needs to add proper support for it.
diggan · 1d ago
This is literally what they did for GPT-OSS, seems there was coordination to support it on day 1 with collaborations with OpenAI
eric-burel · 1d ago
SMEs are starting to want local LLMs and it's a nightmare to figure what hardware would work for what models. I am asking devs in my hometown to literally visit their installs to figure combos that work.
CMCDragonkai · 1d ago
Are you installing them onsite?
eric-burel · 1d ago
Some are asking that yeah but I haven't run an install yet, I am documenting the process. This is a last resort, hosting on European cloud is more efficient but some companies don't even want to hear about cloud hosting.
lagrange77 · 1d ago
While you're here..
Do you guys know a website that clearly shows which OS LLM models run on / fit into a specific GPU(setup)?
The best heuristic i could find for the necessary VRAM is Number of Parameters × (Precision / 8) × 1.2 from here [0].
Yeah we have tried to build calculators before it just depends so much.
Your equation is roughly correct, but I tend to multiply by a factor of 2 not 1.2 to allow for highly concurrent traffic.
diggan · 1d ago
Maybe I'm spoiled by having great internet connection, but I usually download the weights and try to run them via various tools (llama.cpp, LM Studio, vLLM and SGLang typically) and see what works. There seems to be so many variables involved (runners, architectures, implementations, hardware and so on) that none of the calculators I've tried so far been accurate, both in the way that they've over-estimated and under-estimated what I could run.
So in the end, trying to actually run them seems to be the only fool-proof way of knowing for sure :)
lagrange77 · 1d ago
Thanks for your answers!
While it is seemingly hard to calculate it, maybe one should just make a database website that tracks specific setups (model, exact variant / quantisation, runner, hardware) where users can report, which combination they got running (or not) along with metrics like tokens/s.
Visitors could then specify their runner and hardware and filter for a list of models that would run on that.
diggan · 1d ago
Yeah, what you're suggesting sounds like it could be more useful than the "generalized calculators" people are currently publishing and using.
reactordev · 1d ago
huggingface has this built in if you care to fill out your software and hardware profile here:
Then on the model pages, it will show you whether you can use it.
diggan · 1d ago
Interesting, never knew about that! I filled out my details, then went to https://huggingface.co/openai/gpt-oss-120b but I'm not sure if I see any difference? Where is it supposed to show if I can run it or not?
reactordev · 1d ago
You’ll see green check next to models you can use on the model card.
For those kind of models, you know if you can run them. :D
Also most of the times they are split up and, sometimes, you’ll get an indicator on the splits.
It’s still a work in progress to check all hardware and model format compatibility but it’s a great start until GGUF becomes the standard.
eric-burel · 1d ago
"Encourage Open-Source and Open-Weight AI" is the part just after "Ensure that Frontier AI Protects Free Speech and American Values" in America's AI Action Plan. I know this is not rational but OpenAI OSS models kinda give me chills as I am reading the Plan in parallel.
Anyway I like seeing oss model providers talking about hardware, because that's a limiting point for most developers that are not familiar with this layer.
geertj · 1d ago
> Ensure that Frontier AI Protects Free Speech and American Values
I am in the early phases of collecting my thoughts on this topic so bear with me, but it this a bad thing?
AI models will have a world view. I think I prefer them having a western world view, as that has built our modern society and has proven to be most successful in making the lives of people better.
At the very minimum I would want a model to document its world view, and be aligned to it so that it does not try to socially engineer me to surreptitiously change mine.
eric-burel · 1d ago
Yeah I mean you'd want to take a look at the plan to get a bigger picture, it reflects a specific set of values which are not universally shared. This should led to the development of European models, but it feels inefficient to duplicate the work in each country/region just because open source models are planned to be used as trojan horses for values.
exe34 · 1d ago
> I think I prefer them having a western world view,
What worries me is that the current "western world view" of America is not the same as the western world view we've shared with them since the cold war. The trend is towards the same kind of values and behaviour we see in the Islamic Republic and the Russian Federation. If that sort of "western world view" gets baked into the intelligent infrastructure, it may be very hard to change course in the future. For example dissidence and wrongthink is going to get harder and harder.
AesopAerial · 1d ago
> I think I prefer them having a western world view, as that has built our modern society and has proven to be most successful in making the lives of people better.
Highly debatable, and most people anywhere would probably say the same thing about whatever world view they hold.
petesergeant · 1d ago
> but it this a bad thing?
I think the worry is that there’s no fixed definitions here, so the executive can use this to exert partisan or ideological pressure on model providers.
Every four years the models get RLHF’d to switch between thinking guns are amazing vs thinking guns are terrible.
geertj · 14h ago
> Every four years the models get RLHF’d to switch between thinking guns are amazing vs thinking guns are terrible.
I may be naive, but on this specific case, I am hoping that an AI could lead us to a somewhat objective truth. There seems to be enough data points to make some conclusion here. For example, most/all counties in Europe have less gun violence than the US, but there are at least two EU counties with high gun ownership (Finland and Austria) that also have low gun violence. The gun ownership issue is so polarized these days, I don’t think we can trust most people to make reason based arguments about it. Maybe an AI could help us synthesize and interpret the data dispassionately.
ben_w · 1d ago
"Western" != "American": I grew up in a country where even the police are not, and do not wish to be, routinely armed.
Even then, there is an important difference between de-facto and de-jure rules. Fun fact: even North Korea has a constitutional guarantee of freedom of speech and the right vote*. They don't do these things as we would understand any of those words, but they have those things right there in the constitution.
So: does the USA, as it exists today, represent the values you want? Can you honestly say, hand on heart, that Alligator Alcatraz should be a thing your AI has been trained to support? Or that it's fine for Qatar to donate a 747 that becomes part of the library of the current president, not the office of the president, when his term in office comes to an end?
I won't list everything, this isn't the place for that, but even if we wind the clock back a few years, do you (/we) want an AI aligned with a political circus of kayfabe that distracts us from the real political machinations?
Of course, this is still USA-focused.
I'd say that what really made a difference to our quality of life wasn't even the American political system: there were massive improvements to human existence starting with the first industrial revolution in the UK in the 1760s, but the social and political nature of the world back then was so bleak that communism got invented a century later and introduced what was at the time controversial ideas like "women are not property" and "universal free education is good", and the USA's systems changed substantially several times since then (at a minimum Civil War, New Deal, and the Civil Rights movement).
The "meta system" that allows change can be considered good, but not uniquely so if you compare this to the Russian Revolution getting rid of the Tzars and a 40 years later they were in orbit (and this despite the Holodomor and WW2) and then threw off these shackles with Glasnost and the fall of the USSR (and note there that in Russia specifically, not all the former soviet countries but specifically Russia, the freedom gained failed to bring material improvements and the lives of those living through it were, in aggregate, made worse despite that freedom), and similar stories with the Chinese starting with dangerous incompetence (Four Pests campaign) and now in a position where "which is more powerful, them or the USA?" is a matter of which measure you use rather than it being obvious.
You know what's actually hard to find in all this? The actual dimensions of the arrays in the model GPT-OSS-120B. At least with statically typed languages, you know how big your arrays are at a glance. I'm trying to find it in the GitHub repo[1], and I'm not seeing it.
I'm just trying to figure out how wide the datastream through this is, in particular, the actual data (not the weights) that flow through all of it. The width of the output stream. Just how big is a token at the output, prior to reducing it with "temperature" to a few bytes?
Assume infinitely fast compute in a magic black box, but you have to send the output through gigabit ethernet... what's the maximum number of tokens per second?
What’s the application where you want to stream out the logits for each consecutive token while still sampling each token according to the usual rule? Keep in mind that, if you are doing the usual clever tricks like restricting the next token sampled to something that satisfies a grammar, you need to process the logits and sample them and return a token before running the next round of inference.
mikewarot · 1d ago
I know the actual output of the model is wider than a token.... but I can't find it (the actual width, or number of bytes) in the source. Perhaps it's my very casual familiarity with Python that's limiting me, but I don't see any actual declarations of array sizes anywhere in the code.
I'm just trying to calculate the actual bandwidth required for the full output of the model, not just a token to be handed off to the user.
I need this so I can compute just what bandwidth a fully FPGA (later ASIC) based implementation of the model would result in.
GPT-OSS will run even faster on Blackwell chips because of its hardware support for fp4.
If anyone is working on training or inference in Rust, I'm currently working on adding fp8 and fp4 support to cudarc[0] and candle[1]. This is being done so I can support these models in our inference engine for Mixlayer[2].
Ah, interesting. As someone with a RTX Pro 6000, is it ready today to be able to run gpt-oss-120b inference, or are there still missing pieces? Both linked PRs seems merged already, so unsure if it's ready to be played around with or not.
magicalhippo · 1d ago
Maybe I'm especially daft this morning but I don't get the point of the speculative decoding.
How does the target model validate the draft tokens without running the inference as normal?
Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.
cristoperb · 1d ago
My simplified understanding: The target model can validate the draft tokens all at once, in a single forward pass. The output of that forward pass is a list of probabilities for each draft token which are compared to the probabilities produced by the draft model. If the target model's probabilities are the same or greater than the draft model, the tokens are accepted. Worst case none of the draft tokens are accepted and instead the target model selects the single next token as usual.
furyofantares · 1d ago
Not an expert, but here's how I understand it. You know how input tokens are cheaper than output tokens? It's related to that.
Say the model so far has "The capital of France". The small model generates "is Paris.", which let's say is 5 tokens.
You feed the large model "The capital of France is Paris." to validate all 5 of those tokens in a single forward pass.
isoprophlex · 1d ago
but... do you get any validation during the forward pass? the small model could just as well have generated "is Berlin." or whatever. do these models somehow give you a likelihood for the next token when you're prefilling, that you can compare against? if so why not just... use that always?
or is this a scenario where computation is expensive but validation is cheap?
EDIT: thanks, people, for educating me! very insightful :)
sanxiyn · 1d ago
Yes, models give likelihoods you can compare against. No, you can't do that without drafting, because likelihood of token N+2 depends on token N+1. That is, you get P(is, The capital of France) and P(Berlin, The capital of France is), but for the later you need to give "is" as input, you can't do P(Berlin, The Capital of France _).
Yes, the forward pass does a next token prediction on all input tokens (so we know exactly how many tokens from the small model matched). The expensive thing is not the computation, but the memory bandwidth, as each pass needs to load the model from memory.
If the small model predicts some tokens correctly, you save some passes, at the expense of doing some extra computations when the tokens were not correct.
In any case, each forward pass will give at least one new token.
ahmedfromtunis · 1d ago
But what would happen if the small model's prediction was "is Rome."? Wouldn't that result in costlier inference if the small model is "wrong" more than it is correct.
Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?
acters · 1d ago
I believe that is exactly the downside of using speculative decoding, which is why it is very important to have the models properly sized between each other by making sure the small use is big enough to be mostly correct while also being exceptionally faster than the larger one. However the larger one has to be fast enough that catching flaws won't introduce too manyrandom delays. Also, if the small one is incorrect then the larger one correcting the mistake is miles better than leaving in incorrect output.
It is about improving quality while allowing for faster speed most of the time. The tradeoff is that you consume more memory from having two models loaded vs one of them exclusively.
If you just focus on one then it would make sense to reduce memory usage by just running the smaller model.
acters · 1d ago
Another caveat with this method is that both larger and smaller models need to behave very similar because a lot of the savings come from generating the necessary fluff around each detail such as grammar, formatting and words/letters that transition between each other.
Unsurprisingly gpt-oss has both larger and smaller models that work very similarly! Both model sizes are so similar that even if getting a few wrong would not be slowing down the performance enough to equal the speed of the larger model(which is the worst case with this setup). We want the speed of the smaller model as much as possible. That is all
cwyers · 1d ago
So, the way speculative decoding works, the model begins predicting at the first wrong token, so you still get 'is' for free.
imtringued · 1d ago
You're forgetting that some sequences are more predictable than others, hence the name "speculative" decoding. Let's say your token encoding has 128k tokens. That means the model has to pick the right token out of 128k. Some of those tokens are incredibly rare, while others are super common. The big model has seen the rare tokens many more times than the small model. This means that the small model will be able to do things like produce grammatically correct English, but not know anything about a specific JS framework.
The post training fine tuning costs (low thousand dollars) are the main reason why speculative decoding is relatively unpopular. The most effective speculative decoding strategy requires you to train multiple prediction heads ala medusa (or whatever succeeded it). If you don't do any fine tuning, then the probability of the small model being useful is slim. Using a random model as your draft model will probably give you very disappointing results.
bhaney · 1d ago
> How does the target model validate the draft tokens without running the inference as normal?
It does run the inference as normal, just in parallel with the other inferences
> if it is doing just that, I don't get the point
Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.
littlestymaar · 1d ago
> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.
Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel).
Speculative decoding is just a way of running a single query as if it was parallel queries.
joliu · 1d ago
It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase.
So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.
Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.
jlebar · 1d ago
Just want to suggest: Ask an LLM about it! If you have access to a reasoning model like o3, I've found it to be very helpful.
Let's say I want to run f2(f1(x)) where f1 and f2 are both a single pass through GPT4.
This takes 2 seconds time, assuming 1 second for every pass.
What I instead do is kick off f1(x) in another thread, and then run f2(g1(x)) where g1 is one pass through GPT-nano.
This takes 1 + 0.1 seconds, assuming gpt nano takes 0.1s for every pass. In this 1.1 seconds, the f1(x) that we kicked off in the 2nd thread would have finished (it takes 1 second).
So in 1.1 seconds we have available to us f1(x), f2(g1(x)), and we store the intermediate g1(x) as well
We compare g1(x) and f1(x)
If they were equal, i.e g1(x) = f1(x), then we have our answer = f2(g1(x)) in just 1.1s.
If they were not, we compute f2(output of f1(x) from 2nd thread) which takes 1 further second, bringing our total to 2.1s.
If the small model is equalling the big model in say 2/3 of cases, you will spend 2/3 * 1.1 + 1/3 * 2.1 = 1.433s on average for this computation. Without speculative decoding, it is always 2s.
magicalhippo · 1d ago
Thanks, very nice explanation, that makes perfect sense. I guess their graphics confused me for some reason and had me thinking all wrong.
Now I see they tried to point out the obvious thing which is to predict multiple tokens ahead, not just two as in your example.
arkmm · 1d ago
This is a really great explanation.
robrenaud · 1d ago
I think your core misunderstanding is that you are assuming K calls to generate 1 token is expensive as 1 call to generate K tokens. It is actually much more expensive to generate serially than even in small batches.
radarsat1 · 1d ago
Would love to try fully local agentic coding. Is it feasible yet? I have a laptop with a 3050 but that's not nearly enough VRAM, I guess. Still, would be interested to know what's possible today on reasonable consumer hardware.
Davidzheng · 20h ago
if I have a mac with 128Gb of integrated ram and I want to try this model, should I be using llama.cpp, mlx, or vllm, or something else? Sorry but I literally don't understand how I'm supposed to decide. Is it just compare inference speeds?
nektro · 1d ago
> we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.
Yeah the custom hardware providers are super good at TPS. Kudos to their teams for sure, and the demos of instant reasoning are incredibly impressive.
That said, we are serving the model at its full 131K context window, and they are serving 33K max, which could matter for some edge case prompts.
Additionally, NVIDIA hardware is much more widely available if you are scaling a high-traffic application.
adsharma · 1d ago
What's the best number on vLLM and SGlang so far on H100?
It's sad that MLPerf takes a long time to catch up to SOTA models.
smcleod · 1d ago
TensorRT-LLM is a right nightmare to setup and maintain. Good on them for getting it to work for them - but it's not for everyone.
philipkiely · 1d ago
We have built a ton of tooling on top of TRT-LLM and use it not just for LLMs but also for TTS models (Orpheus), STT models (Whisper), and embedding models.
modeless · 1d ago
What's the best speed people have gotten on 4090s?
asabla · 1d ago
I'm on a 5090 so it's not apples to apples comparison. But I'm getting ~150t/s for the 20B version using ~16000 context size.
steinvakt2 · 1d ago
And flash attention doesn't work on 5090 yet, right? So currently 4090 is probably faster, or?
diggan · 1d ago
> And flash attention doesn't work on 5090 yet, right?
Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.
PeterStuer · 1d ago
I don't think the 4090 has native 4bit support, which will probably have a significant impact.
modeless · 1d ago
Cool, what software?
asabla · 1d ago
Initial testing has only been done with ollama. Plan on testing out llama.cpp and vllm when there is enough time
ActorNightly · 1d ago
You can't fit the model into 4090 without quantization, its like 64 gigs.
For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1
SirMaster · 1d ago
You don't really need it to fit all in VRAM due to the efficient MoE architecture and with llama.cpp
The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.
dexterlagan · 13h ago
I have a similar setup but with 32 GB of RAM. Do you partly offload the model to RAM? Do you use LMStudio or other to achieve this? Thanks!
modeless · 1d ago
The 20B one fits.
steinvakt2 · 1d ago
Does it fit on a 5080 (16gb)?
jwitthuhn · 1d ago
Haven't tried myself but it looks like it probably does. The weight files total 13.8 GB which gives you a little left over to hold your context.
northern-lights · 1d ago
It fits on a 5070TI, so should fit on a 5080 as well.
OldfieldFund · 1d ago
laughs in Cerebras
hsaliak · 1d ago
TLDR: tensorrt
littlestymaar · 1d ago
Very fast “Sorry I can't help with that” generator.
jeffhuys · 1d ago
Just "liberate" it
philipkiely · 1d ago
Went to bed with 2 votes, woke up to this. Thank you so much HN!
Just looked in the parts drawer at home and dont seem to have a $25,000 GPU for some inexplicable reason.
There should be a quicker way to differentiate between 'consumer-grade hardware that is mainly meant to be used for gaming and can also run LLMs inference in a limited way' and 'business-grade hardware whose main purpose is AI training or running inference for LLMs".
I think it will also make sense to replace "H" with a brand number, sort of like they already do for customer GPUs.
So then maybe one day we'll have a math coprocessor called "Nvidia 80287".
"Accelerator card" makes a lot of sense to me.
Last consumer GPU with NVLink was the RTX 3090. Even the workstation-grade GPUs lost it.
https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-...
Maybe renaming the device to an MPU, where the M stands for "matrix/math/mips" would make it more semantically correct?
I looked around briefly and could find no evidence that it's been renamed. Do you have a source?
Defining GPU as "can output contemporary display connector signal and is more than just a ramdac/framebuffer-to-cable translator, starting with even just some 2D blitting acceleration.
Unless you’re running it 24/7 for multiple years, it’s not going to be cost effective to buy the GPU instead of renting a hosted one.
For personal use you wouldn’t get a recent generation data center card anyway. You’d get something like a Mac Studio or Strix Halo and deal with the slower speed.
So I wonder what I could be doing wrong. In the end I just use RTX 5080 as my models fit neatly in the available RAM.
* by not working at all, I mean the scripts worked, but results were wrong. As if H100 couldn't do maths properly.
It just means you CAN buy one if you want, as in they're in stock and "available", not that you can necessarily afford one.
adjective: available
able to be used or obtained; at someone's disposal
A few things I noticed: - it’s only fast with with small context windows and small total token context; once more than ~10k tokens you’re basically queueing everything for a long time - MCPs/web search/url fetch have already become a very important part of interacting with LLMs; when they’re not available the LLM utility is greatly diminished - a lot of CLI/TUI coding tools (e.g., opencode) were not working reliably offline at this time with the model, despite being setup prior to being offline
That’s in addition to the other quirks others have noted with the OSS models.
I think 99% of web searches lead to the same 100-1k websites. I assume it's only a few GBs to have a copy of those locally, thus this raises copyright concerns.
LLMs call out to external websites when something isn’t commonly represented in training data, like specific project documentation or news events.
Maybe it's better to have the AI only "reason", and somehow instantly access precise data.
Even though LM Studio uses llama.cpp as a runtime, the performance differs between them. With LM Studio 0.3.22 Build 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s on a RTX Pro 6000, while with llama.cpp compiled from 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost 100 more per second, both running lmstudio-community/gpt-oss-120b-GGUF.
I seriously doubt it's the throughput of memory during inference that's the bottleneck here.
(A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)
Im spec’ing out a Mac Studio with 512GB ram because I can window shop and wish but I think the trend for local LLMs is getting really good.
Do we know WHY openAI even released them?
Regulations and trying to earn good will of developers using local LLMs, something that was slowly eroding since it was a while ago (GPT2 - 2019) they released weights to the public.
Enterprises can now deploy them on AWS and GCP.
Yes all this has been known since the M4 came out. The memory bandwidth is too low.
Try using it with real tasks like cline or opencode and the context length is too long and slow to be practical
The M4 Max with 128GB of RAM (the part used in the comment) has over 500GB/sec of memory bandwidth.
[1] https://alverstokeaviation.blogspot.com/2016/03/
This page also has a rendered image of the generator:
https://aviation.stackexchange.com/questions/43490/how-much-...
> TensorRT-LLM
It is usually the hardest to setup correctly and is often out of the date regarding the relevant architectures. It also requires compiling the model on the exact same hardware-drivers-libraries stack as your production environment which is a great pain in the rear end to say the least. Multimodal setups also been a disaster - at least for a while - when it was near-impossible to make it work even for mainstream models - like Multimodal Llamas. The big question is whether it's worth it, since when running the GPT-OSS-120B on H100 using vLLM is flawless in comparison - and the throughput stays at 130-140 t/s for a single H100. (It's also somewhat a clickbait of a title - I was expecting to see 500t/s for a single GPU, when in fact it's just a tensor-parallel setup)
It's also funny that they went for a separate release of TRT-LLM just to make sure that gpt-oss will work correctly, TRT-LLM is a mess
But for the kind of traffic we are trying to serve -- high volume and latency sensitive -- it consistently wins head-to-head in our benchmarking and we have invested a ton of dev work in the tooling around it.
It really is amazing what ggerganov and the llama.cpp team have done to democratize LLMs for individuals that can't afford a massive GPU farm worth more than the average annual salary.
How many tokens/s do you get for DeepSeek-R1?
R1 starts at about 10t/s on an empty context but quickly falls off. I'd say the majority of my tokens are generating around 6t/s.
Some of the other big MoE models can be quite a bit faster.
I'm mostly using QwenCoder 480b at Q8 these days for 9t/s average. I've found I get better real-world results out of it than K2, R1 or GLM4.5.
I wonder what makes it work so well on yours! My CPU isn't much slower and my GPU probably faster.
[1]: https://github.com/ROCm/ROCm/discussions/3893
Speed and ease of use is one thing, but it shouldn't be at the cost of accuracy.
Yeah, according to the architecture it doesn't seem like a snowflake, but they also decided to invent a new prompting/conversation format (https://github.com/openai/harmony) which definitely makes it a bit of a snowflake today, can't just use what worked a couple of days ago, but everyone needs to add proper support for it.
Do you guys know a website that clearly shows which OS LLM models run on / fit into a specific GPU(setup)?
The best heuristic i could find for the necessary VRAM is Number of Parameters × (Precision / 8) × 1.2 from here [0].
[0] https://medium.com/@lmpo/a-guide-to-estimating-vram-for-llms...
Your equation is roughly correct, but I tend to multiply by a factor of 2 not 1.2 to allow for highly concurrent traffic.
So in the end, trying to actually run them seems to be the only fool-proof way of knowing for sure :)
While it is seemingly hard to calculate it, maybe one should just make a database website that tracks specific setups (model, exact variant / quantisation, runner, hardware) where users can report, which combination they got running (or not) along with metrics like tokens/s.
Visitors could then specify their runner and hardware and filter for a list of models that would run on that.
https://huggingface.co/settings/local-apps
Then on the model pages, it will show you whether you can use it.
https://huggingface.co/unsloth/gpt-oss-20b-GGUF
Also most of the times they are split up and, sometimes, you’ll get an indicator on the splits.
It’s still a work in progress to check all hardware and model format compatibility but it’s a great start until GGUF becomes the standard.
I am in the early phases of collecting my thoughts on this topic so bear with me, but it this a bad thing?
AI models will have a world view. I think I prefer them having a western world view, as that has built our modern society and has proven to be most successful in making the lives of people better.
At the very minimum I would want a model to document its world view, and be aligned to it so that it does not try to socially engineer me to surreptitiously change mine.
What worries me is that the current "western world view" of America is not the same as the western world view we've shared with them since the cold war. The trend is towards the same kind of values and behaviour we see in the Islamic Republic and the Russian Federation. If that sort of "western world view" gets baked into the intelligent infrastructure, it may be very hard to change course in the future. For example dissidence and wrongthink is going to get harder and harder.
Highly debatable, and most people anywhere would probably say the same thing about whatever world view they hold.
I think the worry is that there’s no fixed definitions here, so the executive can use this to exert partisan or ideological pressure on model providers.
Every four years the models get RLHF’d to switch between thinking guns are amazing vs thinking guns are terrible.
I may be naive, but on this specific case, I am hoping that an AI could lead us to a somewhat objective truth. There seems to be enough data points to make some conclusion here. For example, most/all counties in Europe have less gun violence than the US, but there are at least two EU counties with high gun ownership (Finland and Austria) that also have low gun violence. The gun ownership issue is so polarized these days, I don’t think we can trust most people to make reason based arguments about it. Maybe an AI could help us synthesize and interpret the data dispassionately.
Even then, there is an important difference between de-facto and de-jure rules. Fun fact: even North Korea has a constitutional guarantee of freedom of speech and the right vote*. They don't do these things as we would understand any of those words, but they have those things right there in the constitution.
So: does the USA, as it exists today, represent the values you want? Can you honestly say, hand on heart, that Alligator Alcatraz should be a thing your AI has been trained to support? Or that it's fine for Qatar to donate a 747 that becomes part of the library of the current president, not the office of the president, when his term in office comes to an end?
I won't list everything, this isn't the place for that, but even if we wind the clock back a few years, do you (/we) want an AI aligned with a political circus of kayfabe that distracts us from the real political machinations?
Of course, this is still USA-focused.
I'd say that what really made a difference to our quality of life wasn't even the American political system: there were massive improvements to human existence starting with the first industrial revolution in the UK in the 1760s, but the social and political nature of the world back then was so bleak that communism got invented a century later and introduced what was at the time controversial ideas like "women are not property" and "universal free education is good", and the USA's systems changed substantially several times since then (at a minimum Civil War, New Deal, and the Civil Rights movement).
The "meta system" that allows change can be considered good, but not uniquely so if you compare this to the Russian Revolution getting rid of the Tzars and a 40 years later they were in orbit (and this despite the Holodomor and WW2) and then threw off these shackles with Glasnost and the fall of the USSR (and note there that in Russia specifically, not all the former soviet countries but specifically Russia, the freedom gained failed to bring material improvements and the lives of those living through it were, in aggregate, made worse despite that freedom), and similar stories with the Chinese starting with dangerous incompetence (Four Pests campaign) and now in a position where "which is more powerful, them or the USA?" is a matter of which measure you use rather than it being obvious.
* https://en.wikipedia.org/wiki/Constitution_of_North_Korea#Ch...
I'm just trying to figure out how wide the datastream through this is, in particular, the actual data (not the weights) that flow through all of it. The width of the output stream. Just how big is a token at the output, prior to reducing it with "temperature" to a few bytes?
Assume infinitely fast compute in a magic black box, but you have to send the output through gigabit ethernet... what's the maximum number of tokens per second?
[1] https://github.com/openai/gpt-oss/tree/main/gpt_oss
I'm just trying to calculate the actual bandwidth required for the full output of the model, not just a token to be handed off to the user.
I need this so I can compute just what bandwidth a fully FPGA (later ASIC) based implementation of the model would result in.
Edit/Append: I asked GPT-5, and it estimated:
Which sounds about right to me. This yields a maximum of about 500 logits/second on Gigabit ethernet.The actual compute of the model is peanuts compared to just shuffling the data around.
That’s 2880 values (so multiply by dtype)
If anyone is working on training or inference in Rust, I'm currently working on adding fp8 and fp4 support to cudarc[0] and candle[1]. This is being done so I can support these models in our inference engine for Mixlayer[2].
[0] https://github.com/coreylowman/cudarc/pull/449 [1] https://github.com/huggingface/candle/pull/2989 [2] https://mixlayer.com
How does the target model validate the draft tokens without running the inference as normal?
Because if it is doing just that, I don't get the point as you can't trust the draft tokens before they are validated, so you're still stuck waiting for the target model.
Say the model so far has "The capital of France". The small model generates "is Paris.", which let's say is 5 tokens.
You feed the large model "The capital of France is Paris." to validate all 5 of those tokens in a single forward pass.
or is this a scenario where computation is expensive but validation is cheap?
EDIT: thanks, people, for educating me! very insightful :)
If the small model predicts some tokens correctly, you save some passes, at the expense of doing some extra computations when the tokens were not correct.
In any case, each forward pass will give at least one new token.
Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?
It is about improving quality while allowing for faster speed most of the time. The tradeoff is that you consume more memory from having two models loaded vs one of them exclusively.
If you just focus on one then it would make sense to reduce memory usage by just running the smaller model.
Unsurprisingly gpt-oss has both larger and smaller models that work very similarly! Both model sizes are so similar that even if getting a few wrong would not be slowing down the performance enough to equal the speed of the larger model(which is the worst case with this setup). We want the speed of the smaller model as much as possible. That is all
The post training fine tuning costs (low thousand dollars) are the main reason why speculative decoding is relatively unpopular. The most effective speculative decoding strategy requires you to train multiple prediction heads ala medusa (or whatever succeeded it). If you don't do any fine tuning, then the probability of the small model being useful is slim. Using a random model as your draft model will probably give you very disappointing results.
It does run the inference as normal, just in parallel with the other inferences
> if it is doing just that, I don't get the point
Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.
Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel).
Speculative decoding is just a way of running a single query as if it was parallel queries.
So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.
Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.
I think this answer is as good as any of the human-generated ones in the thread so far, but the real power is that you can ask it follow-up questions. https://chatgpt.com/share/6894504f-4458-8008-a8c9-f371588259...
This takes 2 seconds time, assuming 1 second for every pass.
What I instead do is kick off f1(x) in another thread, and then run f2(g1(x)) where g1 is one pass through GPT-nano.
This takes 1 + 0.1 seconds, assuming gpt nano takes 0.1s for every pass. In this 1.1 seconds, the f1(x) that we kicked off in the 2nd thread would have finished (it takes 1 second).
So in 1.1 seconds we have available to us f1(x), f2(g1(x)), and we store the intermediate g1(x) as well
We compare g1(x) and f1(x)
If they were equal, i.e g1(x) = f1(x), then we have our answer = f2(g1(x)) in just 1.1s.
If they were not, we compute f2(output of f1(x) from 2nd thread) which takes 1 further second, bringing our total to 2.1s.
If the small model is equalling the big model in say 2/3 of cases, you will spend 2/3 * 1.1 + 1/3 * 2.1 = 1.433s on average for this computation. Without speculative decoding, it is always 2s.
Now I see they tried to point out the obvious thing which is to predict multiple tokens ahead, not just two as in your example.
Baseten: 592.6 tps Groq: 784.6 tps Cerebras: 4,245 tps
still impressive work
That said, we are serving the model at its full 131K context window, and they are serving 33K max, which could matter for some edge case prompts.
Additionally, NVIDIA hardware is much more widely available if you are scaling a high-traffic application.
It's sad that MLPerf takes a long time to catch up to SOTA models.
Flash attention works with GPT-OSS + llama.cpp (tested on 1d72c8418) and other Blackwell card (RTX Pro 6000) so I think it should work on 5090 as well, it's the same architecture after all.
For home use, Gemma27B QAT is king. Its almost as good as Deepseek R1
The 120B is running at 20 tokens/sec on my 5060Ti 16GB with 64GB of system ram. Now personally I find 20 tokens/sec quite usable, but for some maybe it's not enough.