Open models by OpenAI

564 lackoftactics 200 8/5/2025, 5:02:02 PM openai.com ↗

Comments (200)

foundry27 · 36m ago

Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:

- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.

- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.

- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)

All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.

rfoo · 26m ago

Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.

The model is pretty sparse tho, 32:1.

liuliu · 17m ago

Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.

logicchains · 25m ago

>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool

They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.

ClassAndBurn · 34m ago

Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.

My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.

N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.

There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.

Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.

[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

lechatonnoir · 21m ago

I'm pretty sure there's no reason that Anthropic has to do research on open models, it's just that they produced their result on open models so that you can reproduce their result on open models without having access to theirs.

x187463 · 1h ago

Running a model comparable to o3 on a 24GB Mac Mini is absolutely wild. Seems like yesterday the idea of running frontier (at the time) models locally or on a mobile device was 5+ years out. At this rate, we'll be running such models in the next phone cycle.

tedivm · 1h ago

It only seems like that if you haven't been following other open source efforts. Models like Qwen perform ridiculously well and do so on very restricted hardware. I'm looking forward to seeing benchmarks to see how these new open source models compare.

Rhubarrbb · 55m ago

Agreed, these models seem relatively mediocre to Qwen3 / GLM 4.5

modeless · 49m ago

Nah, these are much smaller models than Qwen3 and GLM 4.5 with similar performance. Fewer parameters and fewer bits per parameter. They are much more impressive and will run on garden variety gaming PCs at more than usable speed. I can't wait to try on my 4090 at home.

There's basically no reason to run other open source models now that these are available, at least for non-multimodal tasks.

tedivm · 37m ago

Qwen3 has multiple variants ranging from larger (230B) than these models to significantly smaller (0.6b), with a huge number of options in between. For each of those models they also release quantized versions (your "fewer bits per parameter).

I'm still withholding judgement until I see benchmarks, but every point you tried to make regarding model size and parameter size is wrong. Qwen has more variety on every level, and performs extremely well. That's before getting into the MoE variants of the models.

modeless · 29m ago

The benchmarks of the OpenAI models are comparable to the largest variants of other open models. The smaller variants of other open models are much worse.

mrbungie · 20m ago

I would wait for neutral benchmarks before making any conclusions.

moralestapia · 40m ago

You can always get your $0 back.

Imustaskforhelp · 35m ago

I have never agreed with a comment so much but we are all addicted to open source models now.

satvikpendem · 29m ago

Depends on how much you paid for the hardware to run em on

echelon · 25m ago

This might mean there's no moat for anything.

Kind of a P=NP, but for software deliverability.

CamperBob2 · 6m ago

On the subject of who has a moat and who doesn't, it's interesting to look the role of patents in the early development of wireless technology. There was WWI, and there was WWII, but the players in the nascent radio industry had serious beef with each other.

I imagine the same conflicts will ramp up over the next few years, especially once the silly money starts to dry up.

a_wild_dandan · 1h ago

Right? I still remember the safety outrage of releasing Llama. Now? My 96 GB of (V)RAM MacBook will be running a 120B parameter frontier lab model. So excited to get my hands on the MLX quants and see how it feels compared to GLM-4.5-air.

4b6442477b1280b · 58m ago

in that era, OpenAI and Anthropic were still deluding themselves into thinking they would be the "stewards" of generative AI, and the last US administration was very keen on regoolating everything under the sun, so "safety" was just an angle for regulatory capture.

God bless China.

a_wild_dandan · 26m ago

Oh absolutely, AI labs certainly talk their books, including any safety angles. The controversy/outrage extended far beyond those incentivized companies too. Many people had good faith worries about Llama. Open-weight models are now vastly more powerful than Llama-1, yet the sky hasn't fallen. It's just fascinating to me how apocalyptic people are.

I just feel lucky to be around in what's likely the most important decade in human history. Shit odds on that, so I'm basically a lotto winner. Wild times.

narrator · 34m ago

Yeah, China is e/acc. Nice cheap solar panels too. Thanks China. The problem is their ominous policies like not allowing almost any immigration, and their domestic Han Supremacist propaganda, and all that make it look a bit like this might be Han Supremacy e/acc. Is it better than wester/decel? Hard to say, but at least the western/decel people are now starting to talk about building power plants, at least for datacenters, and things like that instead of demanding whole branches of computer science be classified, as they were threatening to Marc Andreessen when he visited the Biden admin last year.

Imustaskforhelp · 36m ago

Okay I will be honest, I was so hyped up about This model but then I went to localllama and saw it that the:

120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)

ascorbic · 7m ago

That's SVGBench, which is a useful benchmark but isn't much of a test of general coding

logicchains · 23m ago

It's only got around 5 billion active parameters; it'd be a miracle if it was competitive at coding with SOTA models that have significantly more.

bogtog · 59m ago

When people talk about running a (quantized) medium-sized model on a Mac Mini, what types of latency and throughput times are they talking about? Do they mean like 5 tokens per second or at an actually usable speed?

n42 · 45m ago

here's a quick recording from the 20b model on my 128GB M4 Max MBP: https://asciinema.org/a/AiLDq7qPvgdAR1JuQhvZScMNr

and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM

I am, um, floored

phonon · 36m ago

Here's a 4bit 70B parameter model, https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M) on a M4 Max 128 GB. Usable, but not very performant.

a_wild_dandan · 23m ago

GLM-4.5-air produces tokens far faster than I can read on my MacBook. That's plenty fast enough for me, but YMMV.

tyho · 47m ago

What's the easiest way to get these local models browsing the web right now?

dizhn · 46m ago

aider uses Playwright. I don't know what everybody is using but that's a good starting point.

deviation · 1h ago

So this confirms a best-in-class model release within the next few days?

From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?

ticulatedspline · 1h ago

Even without an imminent release it's a good strategy. They're getting pressure from Qwen and other high performing open-weight models. without a horse in the race they could fall behind in an entire segment.

There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.

FergusArgyll · 44m ago

Thursday

https://manifold.markets/Bayesian/on-what-day-will-gpt5-be-r...

winterrx · 1h ago

GPT-5 coming Thursday.

bredren · 1h ago

Undoubtedly. It would otherwise reduce the perceived value of their current product offering.

The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.

Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.

og_kalu · 1h ago

Even before today, the last week or so, it's been clear for a couple reasons, that GPT-5's release was imminent.

logicchains · 1h ago

> I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it

Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.

siliconc0w · 1m ago

It seems like OSS will win, I can't see people willing to pay like 10x the price for what seems like 10% more performance. Especially once we get better at routing the hardest questions to the better models and then using that response to augment/fine-tune the OSS ones.

timmg · 55m ago

Orthogonal, but I just wanted to say how awesome Ollama is. It took 2 seconds to find the model and a minute to download and now I'm using it.

Kudos to that team.

artembugara · 1h ago

Disclamer: probably dumb questions

so, the 20b model.

Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?

mlyle · 1h ago

An A100 is probably 2-4k tokens/second on a 20B model with batched inference.

Multiply the number of A100's you need as necessary.

Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.

Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...

PeterStuer · 24m ago

(answer for 1 inference) Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).

Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.

mythz · 1h ago

gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB VRAM card.

[1] https://ollama.com/library/gpt-oss

dragonwriter · 54m ago

You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.

artembugara · 1h ago

thanks, this part is clear to me.

but I need to understand 20 x 1k token throughput

I assume it just might be too early to know the answer

Tostino · 1h ago

I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).

artembugara · 54m ago

oh, I totally understand that I'd need multiple GPUs. I'd just want to know what GPU specifically and how many

Tostino · 45m ago

I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.

My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.

petuman · 1h ago

> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.

spott · 49m ago

Groq is offering 1k tokens per second for the 20B model.

You are unlikely to match groq on off the shelf hardware as far as I'm aware.

henriquegodoy · 17m ago

Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.

I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.

seydor · 28s ago

This is good for China

sadiq · 1h ago

Looks like Groq (at 1k+ tokens/second) and Fireworks are already live on openrouter: https://openrouter.ai/openai/gpt-oss-120b

$0.15M in / $0.6-0.75 M out

podnami · 1h ago

Wow this was actually blazing fast. I prompted "how can the 45th and 47th presidents of america share the same parents?"

On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.

golergka · 1m ago

When I pay attention to o3 CoT, I notice it spends a few passes thinking about my system prompt. Hard to imagine this question is hard enough to spend 13 seconds on.

swores · 58m ago

I'm not sure that's a particularly good question for concluding something positive about the "thought for 0.7 seconds" - it's such a simple answer, ChatGPT 4o (with no thinking time) immediately answered correctly. The only surprising thing in your test is that o3 wasted 13 seconds thinking about it.

Workaccount2 · 52m ago

A current major outstanding problem with thinking models is how to get them to think an appropriate amount.

Imustaskforhelp · 1h ago

Not gonna lie but I got sorta goosebumps

I am not kidding but such progress from a technological point of view is just fascinating!

nisegami · 55m ago

Interesting choice of prompt. None of the local models I have in ollama (consumer mid range gpu) were able to get it right.

tekacs · 40m ago

I apologize for linking to Twitter, but I can't post a video here, so:

https://x.com/tekacs/status/1952788922666205615

Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...

I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.

tekacs · 39m ago

Cleanshot link for those who don't want to go to X: https://share.cleanshot.com/bkHqvXvT

tekacs · 36m ago

A few days ago I posted a slowed-down version of the video demo on someone's repo because it was unreadably fast due to being sped up.

https://news.ycombinator.com/item?id=44738004

... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.

sigmar · 1h ago

Non-rhetorically, why would someone pay for o3 api now that I can get this open model from openai served for cheaper? Interesting dynamic... will they drop o3 pricing next week (which is 10-20x the cost[1])?

[1] currently $3M in/ $8M out https://platform.openai.com/docs/pricing

gnulinux · 1h ago

Not even that, even if o3 being marginally better is important for your task (let's say) why would anyone use o4-mini? It seems almost 10x the price and same performance (maybe even less): https://openrouter.ai/openai/o4-mini

gnulinux · 1h ago

Wow, that's significantly cheaper than o4-mini which seems to be on part with gpt-oss-120b. ($1.10/M input tokens, $4.40/M output tokens) Almost 10x the price.

LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.

tempaccount420 · 51m ago

It's funny because I was thinking the opposite, the pricing seems way too high for a 5B parameter activation model.

gnulinux · 48m ago

Sure you're right, but if I can squeeze out o4-mini level utility out of it, but its less than quarter the price, does it really matter?

mikepurvis · 59m ago

Are the prices staying aligned to the fundamentals (hardware, energy), or is this a VC-funded land grab pushing prices to the bottom?

spott · 47m ago

It is interesting that openai isn't offering any inference for these models.

bangaladore · 39m ago

Makes sense to me. Inference on these models will be a race to the bottom. Hosting inference themselves will be a waste of compute / dollar for them.

IceHegel · 1h ago

Listed performance of ~5 points less than o3 on benchmarks is pretty impressive.

Wonder if they feel the bar will be raised soon (GPT-5) and feel more comfortable releasing something this strong.

HanClinto · 1h ago

Holy smokes, there's already llama.cpp support:

https://github.com/ggml-org/llama.cpp/pull/15091

carbocation · 1h ago

And it's already on ollama, it appears: https://ollama.com/library/gpt-oss

incomingpain · 15m ago

lm studio immediately released the new appimage with support.

Leary · 1h ago

GPQA Diamond: gpt-oss-120b: 80.1%, Qwen3-235B-A22B-Thinking-2507: 81.1%

Humanity’s Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%

jasonjmcghee · 1h ago

Wow - I will give it a try then. I'm cynical about OpenAI minmaxing benchmarks, but still trying to be optimistic as this in 8bit is such a nice fit for apple silicon

modeless · 52m ago

Even better, it's 4 bit

amarcheschi · 1h ago

Glm 4.5 seems on par as well

thegeomaster · 1h ago

GLM-4.5 seems to outperform it on TauBench, too. And it's suspicious OAI is not sharing numbers for quite a few useful benchmarks (nothing related to coding, for example).

One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.

lcnPylGDnU4H9OF · 54m ago

Was the Qwen model using tools for Humanity's Last Exam?

thimabi · 1h ago

Open weight models from OpenAI with performance comparable to that of o3 and o4-mini in benchmarks… well, I certainly wasn’t expecting that.

What’s the catch?

coreyh14444 · 1h ago

Because GPT-5 comes out later this week?

thimabi · 1h ago

It could be, but there’s so much hype surrounding the GPT-5 release that I’m not sure whether their internal models will live up to it.

For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.

rrrrrrrrrrrryan · 24m ago

It seems like a big part of GPT-5 will be that it will be able to intelligently route your request to the appropriate model variant.

jona777than · 55m ago

As a sidebar, I’m still not sure if GPT-5 will be transformative due to its capabilities as much as its accessibility. All it really needs to do to be highly impactful is lower the barrier of entry for the more powerful models. I could see that contributing to it being worth the hype. Surely it will be better, but if more people are capable of leveraging it, that’s just as revolutionary, if not more.

sebzim4500 · 1h ago

Surely OpenAI would not be releasing this now unless GPT-5 was much better than it.

NitpickLawyer · 48m ago

> What’s the catch?

Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.

No comments yet

logicchains · 57m ago

The catch is that it only has ~5 billion active params so should perform worse than the top Deepseek and Qwen models, which have around 20-30 billion, unless OpenAI pulled off a miracle.

rmonvfer · 1h ago

What a day! Models aside, the Harmony Response Format[1] also seems pretty interesting and I wonder how much of an impact it might have in performance of these models.

[1] https://github.com/openai/harmony

jakozaur · 1h ago

The coding seems to be one of the strongest use cases for LLMs. Though currently they are eating too many tokens to be profitable. So perhaps these local models could offload some tasks to local computers.

E.g. Hybrid architecture. Local model gathers more data, runs tests, does simple fixes, but frequently asks the stronger model to do the real job.

Local model gathers data using tools and sends more data to the stronger model.

Imustaskforhelp · 59m ago

I have always thought that if we can somehow get an AI which is insanely good at coding, so much so that It can improve itself, then through continuous improvements, they will get better models of everything else idk

Maybe you guys call it AGI, so anytime I see progress in coding, I think it goes just a tiny bit towards the right direction

Plus it also helps me as a coder to actually do some stuff just for the fun. Maybe coding is the only truly viable use of AI and all others are negligible increases.

There is so much polarization in the use of AI on coding but I just want to say this, it would be pretty ironic that an industry which automates others job is this time the first to get their job automated.

But I don't see that as an happening, far from it. But still each day something new, something better happens back to back. So yeah.

NitpickLawyer · 46m ago

Not to open that can of worms, but in most definitions self-improvement is not an AGI requirement. That's already ASI territory (Super Intelligence). That's the proverbial skynet (pessimists) or singularity (optimists).

Imustaskforhelp · 30m ago

Hmm my bad. Maybe Yeah I always thought that it was the endgame of humanity but isn't AGI supposed to be that (the endgame)

What would AGI mean, solving some problem that it hasn't seen? or what exactly? I mean I think AGI is solved, no?

If not, I see people mentioning that horizon alpha is actually a gpt 5 model and its predicted to release on thursday on some betting market, so maybe that fits AGI definition?

hooverd · 56m ago

Optimistically, there's always more crap to get done.

jona777than · 50m ago

I agree. It’s not improbable for there to be _more_ needs to meet in the future, in my opinion.

user_7832 · 35m ago

Newbie question: I remember folks talking about how kimi 2’s launch might have pushed OpenAI to launch their model later. Now that we (shortly will) know how this model performs, how do they stack up? Did openAI likely actually hold off releasing weights because of kimi, in retrospect?

ahmetcadirci25 · 8m ago

I started downloading, I'm eager to test it. I will share my personal experiences. https://ahmetcadirci.com/2025/gpt-oss/

nirav72 · 11m ago

I don't exactly have the ideal hardware to run locally - but just ran the 20b in LMStudio with a 3080 Ti (12gb vram) with some offloading to CPU. Ran couple of quick code generation tests. On average about 20t/sec. But response quality was very similar or on-par with chatgpt o3 for the same code it outputted. So its not bad.

mythz · 29m ago

Getting great performance running gpt-oss on 3x A4000's:

    gpt-oss:20b = ~46 tok/s

More than 2x faster than my previous leading OSS models:

    mistral-small3.2:24b = ~22 tok/s 
    gemma3:27b           = ~19.5 tok/s

Strangely getting nearly the opposite performance running on 1x 5070 Ti:

    mistral-small3.2:24b = ~39 tok/s 
    gpt-oss:20b          = ~21 tok/s

Where gpt-oss is nearly 2x slow vs mistral-small 3.2.

dsco · 1h ago

Does anyone get the demos at https://www.gpt-oss.com to work, or are the servers down immediately after launch? I'm only getting the spinner after prompting.

lukasgross · 1h ago

(I helped build the microsite)

Our backend is falling over from the load, spinning up more resources!

eliseumds · 1h ago

Getting lots of 502s from `https://api.gpt-oss.com/chatkit` at the moment.

lukasgross · 34m ago

Update: try now!

modeless · 1h ago

Can't wait to see third party benchmarks. The ones in the blog post are quite sparse and it doesn't seem possible to fully compare to other open models yet. But the few numbers available seem to suggest that this release will make all other non-multimodal open models obsolete.

Rhubarrbb · 23m ago

What's the best agent to run this on? Is it compatible with Codex? For OSS agents, I've been using Qwen Code (clunky fork of Gemini), and Goose.

Workaccount2 · 1h ago

Wow, today is a crazy AI release day:

- OAI open source

- Opus 4.1

- Genie 3

- ElevenLabs Music

orphea · 1h ago

  OAI open source

Yeah. This certainly was not on my bingo card.

Disposal8433 · 1h ago

Please don't use the open-source term unless you ship the TBs of data downloaded from Anna's Archive that are required do build it yourself. And dont forget all the system prompts to censor the multiple topics that they don't want you to see.

Quarrel · 1h ago

Is your point really that- "I need to see all data downloaded to make this model, before I can know it is open"? Do you have $XXB worth of GPU time to ingest that data with a state of the art framework to make a model? I don't. Even if I did, I'm not sure FB or Google are in any better position to claim this model is or isn't open beyond the fact that the weights are there.

They're giving you a free model. You can evaluate it. You can sue them. But the weights are there. If you dislike the way they license the weights, because the license isn't open enough, then sure, speak up, but because you can't see all the training data??! Wtf.

layer8 · 1h ago

The parent’s point is that open weight is not the same as open source.

Rough analogy:

SaaS = AI as a service

Locally executable closed-source software = open-weight model

Open-source software = open-source model (whatever allows to reproduce the model from training data)

ticulatedspline · 1h ago

To many people there's an important distinction between "open source" and "open weights". I agree with the distinction, open source has a particular meaning which is not really here and misuse is worth calling out in order to prevent erosion of the terminology.

Historically this would be like calling a free but closed-source application "open source" simply because the application is free.

someperson · 1h ago

Keep fighting the "open weights" terminology fight, because diluting the term open-source for a blob of neural network weights (even inference code is open-source) is not open-source.

mhh__ · 1h ago

The system prompt is an inference parameter, no?

rvnx · 1h ago

I don’t know why you got so much downvoted, these models are not open-source/open-recipes. They are censored open weights models. Better than nothing, but far from being Open

outlore · 1h ago

by your definition most of the current open weight models would not qualify

robotmaxtron · 1h ago

Correct. I agree with them, most of the open weight models are not open source.

layer8 · 1h ago

That’s why they are called open weight and not open source.

NitpickLawyer · 44m ago

It's apache2.0, so by definition it's open source. Stop pushing for training data, it'll never happen, and there's literally 0 reason for it to happen (both theoretical and practical). Apache2.0 IS opensource.

_flux · 32m ago

No, it's open weight. You wouldn't call applications with only Apache 2.0-licensed binaries "open source". The weights are not the "source code" of the model, they are the "compiled" binary, therefore they are not open source.

However, for the sake of argument let's say this release should be called open source.

Then what do you call a model that also comes with its training material and tools to reproduce the model? Is it also called open source, and there is no material difference between those two releases? Or perhaps those two different terms should be used for those two different kind of releases?

If you say that actually open source releases are impossible now (for mostly copyright reasons I imagine), it doesn't mean that they will be perpetually so. For that glorious future, we can leave them space in the terminology by using the term open weight. It is also the term that should not be misleading to anyone.

organsnyder · 34m ago

What is the source that's open? Aren't the models themselves more akin to compiled code than to source code?

NitpickLawyer · 28m ago

No, not compiled code. Weights are hardcoded values. Code is the combination of model architecture + config + inferencing engine. You run inference based on the architecture (what and when to compute), using some hardcoded values (weights).

WhyNotHugo · 29m ago

It’s open source, but it’s a binary-only release.

It’s like getting a compiled software with an Apache license. Technically open source, but you can’t modify and recompile since you don’t have the source to recompile. You can still tinker with the binary tho.

NitpickLawyer · 22m ago

Weights are not binary. I have no idea why this is so often spread, it's simply not true. You can't do anything with the weights themselves, you can't "run" the weights.

You run inference (via a library) on a model using it's architecture (config file), tokenizer (what and when to compute) based on weights (hardcoded values). That's it.

> but you can’t modify

Yes, you can. It's called finetuning. And, most importantly, that's exactly how the model creators themselves are "modifying" the weights! No sane lab is "recompiling" a model every time they change something. They perform a pre-training stage (feed everything and the kitchen sink), they get the hardcoded values (weights), and then they post-train using "the same" (well, maybe their techniques are better, but still the same concept) as you or I would. Just with more compute. That's it. You can do the exact same modifications, using basically the same concepts.

> don’t have the source to recompile

In pure practical ways, neither do the labs. Everyone that has trained a big model can tell you that the process is so finicky that they'd eat a hat if a big train session can be somehow made reproducible to the bit. Between nodes failing, datapoints balooning your loss and having to go back, and the myriad of other problems, what you get out of a big training run is not guaranteed to be the same even with 100 - 1000 more attempts, in practice. It's simply the nature of training large models.

koolala · 4m ago

You can do a lot with a binary also. That's what game mods are all about.

koolala · 7m ago

Calls them open-weight. Names them 'oss'. What does oss stand for?

ukprogrammer · 11m ago

> we also introduced an additional layer of evaluation by testing an adversarially fine-tuned version of gpt-oss-120b

What could go wrong?

ahmedhawas123 · 1h ago

Exciting as this is to toy around with...

Perhaps I missed it somewhere, but I find it frustrating that, unlike most other open weight models and despite this being an open release, OpenAI has chosen to provide pretty minimal transparency regarding model architecture and training. It's become the norm for LLama, Deepseek, Qwenn, Mistral and others to provide a pretty detailed write up on the model which allows researchers to advance and compare notes.

gundawar · 59m ago

Their model card [0] has some information. It is quite a standard architecture though; it's always been that their alpha is in their internal training stack.

[0] https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

sebzim4500 · 1h ago

The model files contain an exact description of the architecture of the network, there isn't anything novel.

Given these new models are closer to the SOTA than they are to competing open models, this suggests that the 'secret sauce' at OpenAI is primarily about training rather than model architecture.

Hence why they won't talk about the training.

PeterStuer · 53m ago

I love how they frame High-end desktops and laptops as having "a single H100 GPU".

organsnyder · 35m ago

I read that as it runs in data centers (H100 GPUs) or high-end desktops/laptops (Strix Halo?).

bobsmooth · 2m ago

Hopefully the dolphin team will work their magic and uncensor this model

jstummbillig · 57m ago

Shoutout to the hn consensus regarding an OpenAI open model release from 4 days ago: https://news.ycombinator.com/item?id=44758511

jp1016 · 1h ago

i wish these models had a minimum ram , cpu and gpu size listed on the site instead of high end and medium end pc.

pamelafox · 58m ago

Anyone tried running on a Mac M1 with 16GB RAM yet? I've never run higher than an 8GB model, but apparently this one is specifically designed to work well with 16 GB of RAM.

pamelafox · 43m ago

Update: I tried it out. It took about 8 seconds per token, and didn't seem to be using much of my GPU (MPU), but was using a lot of RAM. Not a model that I could use practically on my machine.

thimabi · 52m ago

It works fine, although with a bit more latency than non-local models. However, swap usage goes way beyond what I’m comfortable with, so I’ll continue to use smaller models for the foreseeable future.

Hopefully other quantizations of these OpenAI models will be available soon.

ArtTimeInvestor · 1h ago

Why do companies release open source LLMs?

I would understand it, if there was some technology lock-in. But with LLMs, there is no such thing. One can switch out LLMs without any friction.

koolala · 3m ago

I don't because it would kill their data scrapping buisness's competitive advantage.

TrackerFF · 28m ago

LLMs are terrible, purely speaking from the business economic side of things.

Frontier / SOTA models are barely profitable. Previous gen model lose 90% of their value. Two gens back and they're worthless.

And given that their product life cycle is something like 6-12 months, you might as well open source them as part of sundowning them.

gnulinux · 58m ago

Name recognition? Advertisement? Federal grant to beat Chinese competition?

There could be many legitimate reasons, but yeah I'm very surprised by this too. Some companies take it a bit too seriously and go above and beyond too. At this point unless you need the absolute SOTA models because you're throwing LLM at an extremely hard problem, there is very little utility using larger providers. In OpenRouter, or by renting your own GPU you can run on-par models for much cheaper.

mclau157 · 19m ago

Partially because using their own GPUs is expensive, so maybe offloading some GPU usage

johntiger1 · 1h ago

Wow, this will eat Meta's lunch

asdev · 1h ago

Meta is so cooked, I think most enterprises will opt for OpenAI or Anthropic and others will host OSS models themselves or on AWS/infra providers.

a_wild_dandan · 55m ago

I'll accept Meta's frontier AI demise if they're in their current position a year from now. People killed Google prematurely too (remember Bard?), because we severely underestimate the catch-up power bought with ungodly piles of cash.

atonse · 28m ago

And boy, with the $250m offers to people, Meta is definitely throwing ungodly piles of cash at the problem.

But Apple is waking up too. So is Google. It's absolutely insane, the amount of money being thrown around.

asdev · 34m ago

catching up gets exponentially harder as time passes. way harder to catch up to current models than it was to the first iteration of gpt-4

seydor · 1h ago

I believe their competition is from chinese companies , for some time now

BoorishBears · 1h ago

Maverick and Scout were not great, even with post-training in my experience, and then several Chinese models at multiple sizes made them kind of irrelevant (dots, Qwen, MiniMax)

If anything this helps Meta: another model to inspect/learn from/tweak etc. generally helps anyone making models

mhh__ · 1h ago

They will clone it

jcmontx · 23m ago

I'm out of the loop for local models. For my M3 24gb ram macbook, what token throughput can I expect?

chown · 1h ago

Shameless plug: if someone wants to try it in a nice ui, you could give Msty[1] a try. It's private and local.

[1]: https://msty.ai

Imustaskforhelp · 1h ago

Is this the same model (Horizon Beta) on openrouter or not? Because I still see Horizon beta available with its codename on openrouter

Robdel12 · 26m ago

I’m on my phone and haven’t been able to break away to check, but anyone plug these into Codex yet?

incomingpain · 5m ago

First coding test: Just going copy and paste out of chat. It aced my first coding test in 5 seconds... this is amazing. It's really good at coding.

Trying to use it for agentic coding...

lots of fail. This harmony formatting? Anyone have a working agentic tool?

openhands and void ide are failing due to the new tags.

nodesocket · 11m ago

Anybody got this working in Ollama? I'm running latest version 0.11.0 with WebUI v0.6.18 but getting:

> List the US presidents in order starting with George Washington and their time in office and year taken office.

>> 00: template: :3: function "currentDate" not defined

genpfault · 2m ago

https://github.com/ollama/ollama/issues/11673

shpongled · 58m ago

I looked through their torch implementation and noticed that they are applying RoPE to both query and key matrices in every layer of the transformer - is this standard? I thought positional encodings were usually just added once at the first layer

m_ke · 57m ago

No they’re usually done at each attention layer.

shpongled · 42m ago

Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2

Scene_Cast2 · 27m ago

Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.

Nimitz14 · 13m ago

This is normal. Rope was introduced after bert/gpt2

abidlabs · 1h ago

Test it with a web UI: https://huggingface.co/spaces/abidlabs/openai-gpt-oss-120b-t...

pu_pe · 43m ago

Very sparse benchmarking results released so far. I'd bet the Chinese open source models beat them on quite a few of them.

n42 · 59m ago

my very early first impression of the 20b model on ollama is that it is quite good, at least for the code I am working on; arguably good enough to drop a subscription or two

isoprophlex · 44m ago

Can these do image inputs as well? I can't find anything about that on the linked page, so I guess not..?

Nimitz14 · 11m ago

I'm surprised at the model dim being 2.8k with an output size of 200k. My gut feeling had told me you don't want too large of a gap between the two, seems I was wrong.

anonymoushn · 27m ago

guys, what does OSS stand for?

k2xl · 1h ago

Is there any details about hardware requirements for a sensible tokens per second for each size of these models?

emehex · 1h ago

So 120B was Horizon Alpha and 20B was Horizon Beta?

ImprobableTruth · 50m ago

Unfortunately not, this model is noticeably worse. I imagine horizon is either gpt 5 nano/mini.

jedisct1 · 55m ago

For some reason I'm less excited about this that I was with the Qwen models.

minimaxir · 1h ago

I'm disappointed that the smallest model size is 21B parameters, which strongly restricts how it can be run on personal hardware. Most competitors have released a 3B/7B model for that purpose.

For self-hosting, it's smart that they targeted a 16GB VRAM config for it since that's the size of the most cost-effective server GPUs, but I suspect "native MXFP4 quantization" has quality caveats.

hnuser123456 · 13m ago

Native FP4 quantization means it requires half as many bytes as parameters, and will have next to zero quality loss (on the order of 0.1%) compared to using twice the VRAM and exponentially more expensive hardware. FP3 and below gets messier.

strangecasts · 44m ago

A small part of me is considering going from a 4070 to a 16GB 5060 Ti just to avoid having to futz with offloading

I'd go for an ..80 card but I can't find any that fit in a mini-ITX case :(

4b6442477b1280b · 1h ago

with quantization, 20B fits effortlessly in 24GB

with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM

sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.

moffkalast · 1h ago

Eh 20B is pretty managable, 32GB of regular RAM and some VRAM will run you a 30B with partial offloading. After that it gets tricky.

Tostino · 57m ago

I am not at all disappointed. I'm glad they decided to go for somewhat large but reasonable to run models on everything but phones.

Quite excited to give this a try

incomingpain · 1h ago

I dont see the unsloth files yet but they'll be here: https://huggingface.co/unsloth/gpt-oss-20b-GGUF

Super excited to test these out.

The benchmarks from 20B are blowing away major >500b models. Insane.

On my hardware.

43 tokens/sec.

I got an error with flash attention turning on. Cant run it with flash attention?

31,000 context is max it will allow or model wont load.

no kv or v quantization.

mikert89 · 1h ago

ACCELERATE

kingkulk · 56m ago

Welcome to the future!

hubraumhugo · 1h ago

Meta's goal with Llama was to target OpenAI with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape. Looks like OpenAI is now using the same playbook.

tempay · 1h ago

It seems like the various Chinese companies are far outplaying Meta at that game. It remains to be seen if they’re able to throw money at the problem to turn things around.

kgwgk · 52m ago

It may be useless for many use cases given that its policy prevents it for example to provide "advice or instructions about how to buy something."

(I included details about its refusal to answer even after using tools for web searching but hopefully shorter comment means fewer downvotes.)

DSingularity · 1h ago

Ha. Secure funding and proceed to immediately make a decision that would likely conflict viscerally with investors.

4b6442477b1280b · 1h ago

their promise to release an open weights model predates this round of funding by, iirc, over half a year.

DSingularity · 1h ago

Yeah but they never released until now.

hnuser123456 · 1h ago

Maybe someone got tired of waiting paid them to release something actually open

hnuser123456 · 1h ago

Text only, when local multimodal became table stakes last year.

ebiester · 1h ago

Honestly, it's a tradeoff. If you can reduce the size and make a higher quality in specific tasks, that's better than a generalist that can't run on a laptop or can't compete at any one task.

We will know soon the actual quality as we go.

greenavocado · 41m ago

That's what I thought too until Qwen-Image was released

BoorishBears · 1h ago

The community can always figure out hooking it up to other modalities.

Native might be better, but no native multimodal model is very competitive yet, so better to take a competitive model and latch on vision/audio

MutedEstate45 · 1h ago

The repeated safety testing delays might not be purely about technical risks like misuse or jailbreaks. Releasing open weights means relinquishing the control OpenAI has had since GPT-3. No rate limits, no enforceable RLHF guardrails, no audit trail. Unlike API access, open models can't be monitored or revoked. So safety may partly reflect OpenAI's internal reckoning with that irreversible shift in power, not just model alignment per se. What do you guys think?

BoorishBears · 1h ago

I think it's pointless: if you SFT even their closed source models on a specific enough task, the guardrails disappear.

AI "safety" is about making it so that a journalist can't get out a recipe for Tabun just by asking.

MutedEstate45 · 56m ago

True, but there's still a meaningful difference in friction and scale. With closed APIs, OpenAI can monitor for misuse, throttle abuse and deploy countermeasures in real-time. With open weights, a single prompt jailbreak or exploit spreads instantly. No need for ML expertise, just a Reddit post.

The risk isn’t that bad actors suddenly become smarter. It’s that anyone can now run unmoderated inference and OpenAI loses all visibility into how the model’s being used or misused. I think that’s the control they’re grappling with under the label of safety.

BoorishBears · 18m ago

OpenAI and Azure both have zero retention options, and the NYT saga has given pretty strong confirmation they meant it when they said zero.

Ask HN: Have you ever regretted open-sourcing something?

Tell HN: Anthropic expires paid credits after a year

Tell HN: I underestimated how lonely building solo can be

Ask HN: What trick of the trade took you too long to learn?

Ask HN: Why Did Mercurial Die?:(

Ask HN: What's your biggest success–or failure–using AI?

28, $4k in savings, feeling stuck – need some life advice

Ask HN: Who wants to be hired? (August 2025)

Ask HN: Why is it called "Vibe Coding"?

What's the latest on NAD+ and longevity in 2025?

Ask HN: What change enabled you to consistently finish your side projects?

Ask HN: Feedback on my privacy-first resume builder (no login, no tracking)

I launched 17 side projects. Result? I'm rich in expired domains

Ask HN: Who is hiring? (August 2025)

Uber Allowing Drivers Stall for Better Fares, Passengers Stranded

Ask HN: If this was your last project, what would you build?

People can exploit your social media pictures and so I've made a tool

Ask HN: What are your best practices for Claude Code?

Companies Tried to Save Money with AI Now Spending a Fortune to Fix Its Mistakes

Ask HN: What if I fail to make it?

Ask HN: What service should I use to send email from my Node.js application?

Bitlocker Question

Ask HN: Why does YC care what tech stack I use?

Ask HN: Want to leave my job with nothing lined up

Ask HN: How to Extract Shell Commands from Raw PTY Sessions? (Rewindtty)

Nova: A New Web Framework for Erlang

Ask HN: What are you working on? (July 2025)

Claude Code weekly rate limits

Claude Research refuses to answer questions about cytotoxic mushrooms

Ask HN: Is fast.ai's "Deep Learning for Coders" still relevant in 2025?

Ask HN: Whats Your Observability Setup?

Ask HN: Will AI push more of us into freelancing?

Ask HN: Why was Windows ME so bad?

Open models by OpenAI

Comments (200)