In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:
- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.
- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.
- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)
All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.
tgtweak · 3m ago
I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.
Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.
rfoo · 1h ago
Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.
The model is pretty sparse tho, 32:1.
liuliu · 1h ago
Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.
Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).
logicchains · 1h ago
>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool
They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.
rushingcreek · 38m ago
The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.
x187463 · 2h ago
Running a model comparable to o3 on a 24GB Mac Mini is absolutely wild. Seems like yesterday the idea of running frontier (at the time) models locally or on a mobile device was 5+ years out. At this rate, we'll be running such models in the next phone cycle.
tedivm · 2h ago
It only seems like that if you haven't been following other open source efforts. Models like Qwen perform ridiculously well and do so on very restricted hardware. I'm looking forward to seeing benchmarks to see how these new open source models compare.
Rhubarrbb · 2h ago
Agreed, these models seem relatively mediocre to Qwen3 / GLM 4.5
modeless · 2h ago
Nah, these are much smaller models than Qwen3 and GLM 4.5 with similar performance. Fewer parameters and fewer bits per parameter. They are much more impressive and will run on garden variety gaming PCs at more than usable speed. I can't wait to try on my 4090 at home.
There's basically no reason to run other open source models now that these are available, at least for non-multimodal tasks.
tedivm · 1h ago
Qwen3 has multiple variants ranging from larger (230B) than these models to significantly smaller (0.6b), with a huge number of options in between. For each of those models they also release quantized versions (your "fewer bits per parameter).
I'm still withholding judgement until I see benchmarks, but every point you tried to make regarding model size and parameter size is wrong. Qwen has more variety on every level, and performs extremely well. That's before getting into the MoE variants of the models.
modeless · 1h ago
The benchmarks of the OpenAI models are comparable to the largest variants of other open models. The smaller variants of other open models are much worse.
mrbungie · 1h ago
I would wait for neutral benchmarks before making any conclusions.
bigyabai · 24m ago
With all due respect, you need to actually test out Qwen3 2507 or GLM 4.5 before making these sorts of claims. Both of them are comparable to OpenAI's largest models and even bench favorably to Deepseek and Opus: https://cdn-uploads.huggingface.co/production/uploads/62430a...
It's cool to see OpenAI throw their hat in the ring, but you're smoking straight hopium if you think there's "no reason to run other open source models now" in earnest. If OpenAI never released these models, the state-of-the-art would not look significantly different for local LLMs. This is almost a nothingburger if not for the simple novelty of OpenAI releasing an Open AI for once in their life.
modeless · 2m ago
> Both of them are comparable to OpenAI's largest models and even bench favorably to Deepseek and Opus
So are/do the new OpenAI models, except they're much smaller and faster.
sourcecodeplz · 26m ago
From my initial web developer test on https://www.gpt-oss.com/ the 120b is kind of meh. Even qwen3-coder 30b-a3b is better. have to test more.
moralestapia · 1h ago
You can always get your $0 back.
Imustaskforhelp · 1h ago
I have never agreed with a comment so much but we are all addicted to open source models now.
satvikpendem · 1h ago
Depends on how much you paid for the hardware to run em on
echelon · 1h ago
This might mean there's no moat for anything.
Kind of a P=NP, but for software deliverability.
CamperBob2 · 1h ago
On the subject of who has a moat and who doesn't, it's interesting to look the role of patents in the early development of wireless technology. There was WWI, and there was WWII, but the players in the nascent radio industry had serious beef with each other.
I imagine the same conflicts will ramp up over the next few years, especially once the silly money starts to dry up.
larodi · 2m ago
We be running them in PIs off spare juice in no time, and they be billions given how chips and embedded spreads…
a_wild_dandan · 2h ago
Right? I still remember the safety outrage of releasing Llama. Now? My 96 GB of (V)RAM MacBook will be running a 120B parameter frontier lab model. So excited to get my hands on the MLX quants and see how it feels compared to GLM-4.5-air.
4b6442477b1280b · 2h ago
in that era, OpenAI and Anthropic were still deluding themselves into thinking they would be the "stewards" of generative AI, and the last US administration was very keen on regoolating everything under the sun, so "safety" was just an angle for regulatory capture.
God bless China.
a_wild_dandan · 1h ago
Oh absolutely, AI labs certainly talk their books, including any safety angles. The controversy/outrage extended far beyond those incentivized companies too. Many people had good faith worries about Llama. Open-weight models are now vastly more powerful than Llama-1, yet the sky hasn't fallen. It's just fascinating to me how apocalyptic people are.
I just feel lucky to be around in what's likely the most important decade in human history. Shit odds on that, so I'm basically a lotto winner. Wild times.
vlmutolo · 19m ago
About 7% of people who have ever lived are alive today. Still pretty lucky, but not quite winning the lottery.
ipaddr · 53m ago
"the most important decade in human history."
Lol. To be young and foolish again. This covid laced decade is more of a placeholder. The current decade is always the most meaningful until the next one. The personal computer era, the first cars or planes, ending slavery needs to take a backseat to the best search engine ever. We are at the point where everyone is planning on what they are going to do with their hoverboards.
Slavery is still legal and widespread in most of the US, including California.
There was a ballot measure to actually abolish slavery a year or so back. It failed miserably.
BizarroLand · 4m ago
The slavery of free humans is illegal in America, so now the big issue is figuring out how to convince voters that imprisoned criminals deserve rights.
Even in liberal states, the dehumanization of criminals is an endemic behavior, and we are reaching the point in our society where ironically having the leeway to discuss the humane treatment of even our worst criminals is becoming an issue that affects how we see ourselves as a society before we even have a framework to deal with the issue itself.
What one side wants is for prisons to be for rehabilitation and societal reintegration, for prisoners to have the right to decline to work and to be paid fair wages from their labor. They further want to remove for-profit prisons from the equation completely.
What the other side wants is the acknowledgement that prisons are not free, they are for punishment, and that prisoners have lost some of their rights for the duration of their incarceration and that they should be required to provide labor to offset the tax burden of their incarceration on the innocent people that have to pay for it. They also would like it if all prisons were for-profit as that would remove the burden from the tax payers and place all of the costs of incarceration onto the shoulders of the incarcerated.
Both sides have valid and reasonable wants from their vantage point while overlooking the valid and reasonable wants from the other side.
dingnuts · 29m ago
you can say the same shit about machine learning but ChatGPT was still the Juneteenth of AI
4b6442477b1280b · 1h ago
>Many people had good faith worries about Llama.
ah, but that begs the question: did those people develop their worries organically, or did they simply consume the narrative heavily pushed by virtually every mainstream publication?
the journos are heavily incentivized to spread FUD about it. they saw the writing on the wall that the days of making a living by producing clickbait slop were coming to an end and deluded themselves into thinking that if they kvetch enough, the genie will crawl back into the bottle. scaremongering about sci-fi skynet bullshit didn't work, so now they kvetch about joules and milliliters consumed by chatbots, as if data centers did not exist until two years ago.
likewise, the bulk of other "concerned citizens" are creatives who use their influence to sway their followers, still hoping against hope to kvetch this technology out of existence.
honest-to-God yuddites are as few and as retarded as honest-to-God flat earthers.
narrator · 1h ago
Yeah, China is e/acc. Nice cheap solar panels too. Thanks China. The problem is their ominous policies like not allowing almost any immigration, and their domestic Han Supremacist propaganda, and all that make it look a bit like this might be Han Supremacy e/acc. Is it better than wester/decel? Hard to say, but at least the western/decel people are now starting to talk about building power plants, at least for datacenters, and things like that instead of demanding whole branches of computer science be classified, as they were threatening to Marc Andreessen when he visited the Biden admin last year.
01HNNWZ0MV43FF · 58m ago
I wish we had voter support for a hydrocarbon tax, though. It would level out the prices and then the AI companies can decide whether they want to pay double to burn pollutants or invest in solar and wind and batteries
Imustaskforhelp · 1h ago
Okay I will be honest, I was so hyped up about This model but then I went to localllama and saw it that the:
Qwen3 Coder is 4x its size! Grok 3 is over 22x its size!
What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4
5 Air does, right?
It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.
ascorbic · 1h ago
That's SVGBench, which is a useful benchmark but isn't much of a test of general coding
Imustaskforhelp · 47m ago
Hm alright, I will see how this model actually plays around instead of forming quick opinions..
Thanks.
logicchains · 1h ago
It's only got around 5 billion active parameters; it'd be a miracle if it was competitive at coding with SOTA models that have significantly more.
jph00 · 39m ago
On this bench it underperforms vs glm-4.5-air, which is an MoE with fewer total params but more active params.
bogtog · 2h ago
When people talk about running a (quantized) medium-sized model on a Mac Mini, what types of latency and throughput times are they talking about? Do they mean like 5 tokens per second or at an actually usable speed?
davio · 1h ago
On a M1 MacBook Air with 8GB, I got this running Gemma 3n:
12.63 tok/sec • 860 tokens • 1.52s to first token
I'm amazed it works at all with such limited RAM
v5v3 · 54m ago
I have started a crowdfunding to get you a MacBook air with 16gb. You poor thing.
bookofjoe · 31m ago
Up the ante with an M4 chip
backscratches · 18m ago
not meaningfully different, m1 virtually as fast as m4
Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.
ghc · 43m ago
Here's a sample of running the 120b model on Ollama with my MBP:
```
total duration: 1m14.16469975s
load duration: 56.678959ms
prompt eval count: 3921 token(s)
prompt eval duration: 10.791402416s
prompt eval rate: 363.34 tokens/s
eval count: 2479 token(s)
eval duration: 1m3.284597459s
eval rate: 39.17 tokens/s
```
anonymoushn · 59m ago
it's odd that the result of this processing cannot be cached.
lostmsu · 49m ago
It can be and it is by most good processing frameworks.
Davidzheng · 1h ago
the active param count is low so it should be fast.
a_wild_dandan · 1h ago
GLM-4.5-air produces tokens far faster than I can read on my MacBook. That's plenty fast enough for me, but YMMV.
tyho · 2h ago
What's the easiest way to get these local models browsing the web right now?
dizhn · 2h ago
aider uses Playwright. I don't know what everybody is using but that's a good starting point.
ClassAndBurn · 1h ago
Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.
My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.
N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.
There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.
Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.
I'm pretty sure there's no reason that Anthropic has to do research on open models, it's just that they produced their result on open models so that you can reproduce their result on open models without having access to theirs.
Adrig · 1h ago
I'm a layman but it seemed to me that the industry is going towards robust foundational models on which we plug tools, databases, and processes to expand their capabilities.
In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.
renmillar · 44m ago
There's no reason that models too large for consumer hardware wouldn't keep a huge edge, is there?
lukax · 1h ago
Inference in Python uses harmony [1] (for request and response format) which is written in Rust with Python bindings. Another OpenAI's Rust library is tiktoken [2], used for all tokenization and detokenization. OpenAI Codex [3] is also written in Rust. It looks like OpenAI is increasingly adopting Rust (at least for inference).
So this confirms a best-in-class model release within the next few days?
From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?
ticulatedspline · 2h ago
Even without an imminent release it's a good strategy. They're getting pressure from Qwen and other high performing open-weight models. without a horse in the race they could fall behind in an entire segment.
There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.
How much hype do we anticipate with the release of GPT-5 or whichever name to be included? And how many new features?
selectodude · 51m ago
Excited to have to send them a copy of my drivers license to try and use it. That’ll take the hype down a notch.
bredren · 2h ago
Undoubtedly. It would otherwise reduce the perceived value of their current product offering.
The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.
Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.
og_kalu · 2h ago
Even before today, the last week or so, it's been clear for a couple reasons, that GPT-5's release was imminent.
logicchains · 2h ago
> I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it
Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.
timmg · 2h ago
Orthogonal, but I just wanted to say how awesome Ollama is. It took 2 seconds to find the model and a minute to download and now I'm using it.
Kudos to that team.
henriquegodoy · 1h ago
Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.
I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.
coolspot · 45m ago
10B * 2000 t/s = 20,000 GB/s memory bandwidth .
Apple hardware can do 1k GB/s .
artembugara · 2h ago
Disclamer: probably dumb questions
so, the 20b model.
Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?
mlyle · 2h ago
An A100 is probably 2-4k tokens/second on a 20B model with batched inference.
Multiply the number of A100's you need as necessary.
Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.
Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...
PeterStuer · 1h ago
(answer for 1 inference)
Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.
vl · 1h ago
How Macs compare to RTXs for this? I.e. what numbers can be expected from Mac mini/Mac Studio with 64/128/256/512GB of unified memory?
mythz · 2h ago
gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB VRAM card.
You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.
artembugara · 2h ago
thanks, this part is clear to me.
but I need to understand 20 x 1k token throughput
I assume it just might be too early to know the answer
Tostino · 2h ago
I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).
artembugara · 2h ago
oh, I totally understand that I'd need multiple GPUs. I'd just want to know what GPU specifically and how many
Tostino · 2h ago
I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.
My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.
> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.
spott · 2h ago
Groq is offering 1k tokens per second for the 20B model.
You are unlikely to match groq on off the shelf hardware as far as I'm aware.
edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.
podnami · 2h ago
Wow this was actually blazing fast. I prompted "how can the 45th and 47th presidents of america share the same parents?"
On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.
swores · 2h ago
I'm not sure that's a particularly good question for concluding something positive about the "thought for 0.7 seconds" - it's such a simple answer, ChatGPT 4o (with no thinking time) immediately answered correctly. The only surprising thing in your test is that o3 wasted 13 seconds thinking about it.
Workaccount2 · 2h ago
A current major outstanding problem with thinking models is how to get them to think an appropriate amount.
dingnuts · 18m ago
The providers disagree. You pay per token. Verbacious models are the most profitable. Have fun!
xpe · 8m ago
How many people are discussing this after one person did 1 prompt with 1 data point for each model and wrote a comment?
What is being measured here? For end-to-end time, one model is:
I am not kidding but such progress from a technological point of view is just fascinating!
nisegami · 2h ago
Interesting choice of prompt. None of the local models I have in ollama (consumer mid range gpu) were able to get it right.
golergka · 1h ago
When I pay attention to o3 CoT, I notice it spends a few passes thinking about my system prompt. Hard to imagine this question is hard enough to spend 13 seconds on.
modeless · 55m ago
I really want to try coding with this at 2600 tokens/s (from Cerebras). Imagine generating thousands of lines of code as fast as you can prompt. If it doesn't work who cares, generate another thousand and try again! And at $.69/M tokens it would only cost $6.50 an hour.
sigmar · 2h ago
Non-rhetorically, why would someone pay for o3 api now that I can get this open model from openai served for cheaper? Interesting dynamic... will they drop o3 pricing next week (which is 10-20x the cost[1])?
Not even that, even if o3 being marginally better is important for your task (let's say) why would anyone use o4-mini? It seems almost 10x the price and same performance (maybe even less): https://openrouter.ai/openai/o4-mini
tekacs · 1h ago
I apologize for linking to Twitter, but I can't post a video here, so:
Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...
I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.
... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.
spott · 2h ago
It is interesting that openai isn't offering any inference for these models.
bangaladore · 1h ago
Makes sense to me. Inference on these models will be a race to the bottom. Hosting inference themselves will be a waste of compute / dollar for them.
gnulinux · 2h ago
Wow, that's significantly cheaper than o4-mini which seems to be on part with gpt-oss-120b. ($1.10/M input tokens, $4.40/M output tokens) Almost 10x the price.
LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.
tempaccount420 · 2h ago
It's funny because I was thinking the opposite, the pricing seems way too high for a 5B parameter activation model.
gnulinux · 2h ago
Sure you're right, but if I can squeeze out o4-mini level utility out of it, but its less than quarter the price, does it really matter?
mikepurvis · 2h ago
Are the prices staying aligned to the fundamentals (hardware, energy), or is this a VC-funded land grab pushing prices to the bottom?
IceHegel · 2h ago
Listed performance of ~5 points less than o3 on benchmarks is pretty impressive.
Wonder if they feel the bar will be raised soon (GPT-5) and feel more comfortable releasing something this strong.
jakozaur · 2h ago
The coding seems to be one of the strongest use cases for LLMs. Though currently they are eating too many tokens to be profitable. So perhaps these local models could offload some tasks to local computers.
E.g. Hybrid architecture. Local model gathers more data, runs tests, does simple fixes, but frequently asks the stronger model to do the real job.
Local model gathers data using tools and sends more data to the stronger model.
It
Imustaskforhelp · 2h ago
I have always thought that if we can somehow get an AI which is insanely good at coding, so much so that It can improve itself, then through continuous improvements, they will get better models of everything else idk
Maybe you guys call it AGI, so anytime I see progress in coding, I think it goes just a tiny bit towards the right direction
Plus it also helps me as a coder to actually do some stuff just for the fun. Maybe coding is the only truly viable use of AI and all others are negligible increases.
There is so much polarization in the use of AI on coding but I just want to say this, it would be pretty ironic that an industry which automates others job is this time the first to get their job automated.
But I don't see that as an happening, far from it. But still each day something new, something better happens back to back. So yeah.
NitpickLawyer · 2h ago
Not to open that can of worms, but in most definitions self-improvement is not an AGI requirement. That's already ASI territory (Super Intelligence). That's the proverbial skynet (pessimists) or singularity (optimists).
Imustaskforhelp · 1h ago
Hmm my bad. Maybe Yeah I always thought that it was the endgame of humanity but isn't AGI supposed to be that (the endgame)
What would AGI mean, solving some problem that it hasn't seen? or what exactly? I mean I think AGI is solved, no?
If not, I see people mentioning that horizon alpha is actually a gpt 5 model and its predicted to release on thursday on some betting market, so maybe that fits AGI definition?
hooverd · 2h ago
Optimistically, there's always more crap to get done.
jona777than · 2h ago
I agree. It’s not improbable for there to be _more_ needs to meet in the future, in my opinion.
Humanity’s Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%
jasonjmcghee · 2h ago
Wow - I will give it a try then. I'm cynical about OpenAI minmaxing benchmarks, but still trying to be optimistic as this in 8bit is such a nice fit for apple silicon
modeless · 2h ago
Even better, it's 4 bit
amarcheschi · 2h ago
Glm 4.5 seems on par as well
thegeomaster · 2h ago
GLM-4.5 seems to outperform it on TauBench, too. And it's suspicious OAI is not sharing numbers for quite a few useful benchmarks (nothing related to coding, for example).
One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.
lcnPylGDnU4H9OF · 2h ago
Was the Qwen model using tools for Humanity's Last Exam?
alphazard · 14m ago
I wonder if this is a PR thing, to save face after flipping the non-profit. "Look it's more open now". Or if it's more of a recruiting pipeline thing, like Google allowing k8s and bazel to be open sourced so everyone in the industry has an idea of how they work.
sabakhoj · 58m ago
Super excited to see these released!
Major points of interest for me:
- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.
- AIME 2025 is nearly saturated with large CoT
- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.
- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.
Open weight models from OpenAI with performance comparable to that of o3 and o4-mini in benchmarks… well, I certainly wasn’t expecting that.
What’s the catch?
coreyh14444 · 2h ago
Because GPT-5 comes out later this week?
thimabi · 2h ago
It could be, but there’s so much hype surrounding the GPT-5 release that I’m not sure whether their internal models will live up to it.
For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.
rrrrrrrrrrrryan · 1h ago
It seems like a big part of GPT-5 will be that it will be able to intelligently route your request to the appropriate model variant.
Shank · 19m ago
That doesn’t sound good. It sounds like OpenAI will route my request to the cheapest model to them and the most expensive for me, with the minimum viable results.
jona777than · 2h ago
As a sidebar, I’m still not sure if GPT-5 will be transformative due to its capabilities as much as its accessibility. All it really needs to do to be highly impactful is lower the barrier of entry for the more powerful models. I could see that contributing to it being worth the hype. Surely it will be better, but if more people are capable of leveraging it, that’s just as revolutionary, if not more.
sebzim4500 · 2h ago
Surely OpenAI would not be releasing this now unless GPT-5 was much better than it.
NitpickLawyer · 2h ago
> What’s the catch?
Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.
No comments yet
logicchains · 2h ago
The catch is that it only has ~5 billion active params so should perform worse than the top Deepseek and Qwen models, which have around 20-30 billion, unless OpenAI pulled off a miracle.
mythz · 1h ago
Getting great performance running gpt-oss on 3x A4000's:
gpt-oss:20b = ~46 tok/s
More than 2x faster than my previous leading OSS models:
What's the best agent to run this on? Is it compatible with Codex? For OSS agents, I've been using Qwen Code (clunky fork of Gemini), and Goose.
siliconc0w · 1h ago
It seems like OSS will win, I can't see people willing to pay like 10x the price for what seems like 10% more performance. Especially once we get better at routing the hardest questions to the better models and then using that response to augment/fine-tune the OSS ones.
n42 · 1h ago
to me it seems like the market is breaking into an 80/20 of B2C/B2B; the B2C use case becoming OSS models (the market shifts to devices that can support them), and the B2B market being priced appropriately for businesses that require that last 20% of absolute cutting edge performance as the cloud offering
matznerd · 27m ago
thanks openai for being open ;) Surprised there are no official MLX versions and only one mention of MLX in this thread. MLX basically converst the models to take advntage of mac unified memory for 2-5x increase in power, enabling macs to run what would otherwise take expensive gpus (within limits).
So FYI to any one on mac, the easiest way to run these models right now is using LM Studio (https://lmstudio.ai/), its free. You just search for the model, usually 3rd party groups mlx-community or lmstudio-community have mlx versions within a day or 2 of releases. I go for the 8-bit quantizations (4-bit faster, but quality drops). You can also convert to mlx yourself...
Once you have it running on LM studio, you can chat there in their chat interface, or you can run it through api that defaults to http://127.0.0.1:1234
You can run multiple models that hot swap and load instantly and switch between them etc.
Its surpassingly easy, and fun.There are actually a lot of cool niche models comings out, like this tiny high-quality search model released today as well (and who released official mlx version) https://huggingface.co/Intelligent-Internet/II-Search-4B
Other fun ones are gemma 3n which is model multi-modal, larger one that is actually solid model but takes more memory is the new Qwen3 30b A3B (coder and instruct), Pixtral (mixtral vision with full resolution images), etc. Look forward to playing with this model and see how it compares.
jcmontx · 1h ago
I'm out of the loop for local models. For my M3 24gb ram macbook, what token throughput can I expect?
Edit: I tried it out, I have no idea in terms of of tokens but it was fluid enough for me. A bit slower than using o3 in the browser but definitely tolerable. I think I will set it up in my GF's machine so she can stop paying for the full subscription (she's a non-tech professional)
dantetheinferno · 19m ago
Apple M4 Pro w/ 48GB running the smaller version. I'm getting 43.7t/s
steinvakt2 · 44m ago
Wondering about the same for my M4 max 128 gb
jcmontx · 28m ago
It should fly on your machine
coolspot · 38m ago
40 t/s
rmonvfer · 2h ago
What a day! Models aside, the Harmony Response Format[1] also seems pretty interesting and I wonder how much of an impact it might have in performance of these models.
Can't wait to see third party benchmarks. The ones in the blog post are quite sparse and it doesn't seem possible to fully compare to other open models yet. But the few numbers available seem to suggest that this release will make all other non-multimodal open models obsolete.
Newbie question: I remember folks talking about how kimi 2’s launch might have pushed OpenAI to launch their model later. Now that we (shortly will) know how this model performs, how do they stack up? Did openAI likely actually hold off releasing weights because of kimi, in retrospect?
PeterStuer · 2h ago
I love how they frame High-end desktops and laptops as having "a single H100 GPU".
organsnyder · 1h ago
I read that as it runs in data centers (H100 GPUs) or high-end desktops/laptops (Strix Halo?).
dsco · 2h ago
Does anyone get the demos at https://www.gpt-oss.com to work, or are the servers down immediately after launch? I'm only getting the spinner after prompting.
lukasgross · 2h ago
(I helped build the microsite)
Our backend is falling over from the load, spinning up more resources!
Please don't use the open-source term unless you ship the TBs of data downloaded from Anna's Archive that are required do build it yourself. And dont forget all the system prompts to censor the multiple topics that they don't want you to see.
someperson · 2h ago
Keep fighting the "open weights" terminology fight, because diluting the term open-source for a blob of neural network weights (even inference code is open-source) is not open-source.
Quarrel · 2h ago
Is your point really that- "I need to see all data downloaded to make this model, before I can know it is open"? Do you have $XXB worth of GPU time to ingest that data with a state of the art framework to make a model? I don't. Even if I did, I'm not sure FB or Google are in any better position to claim this model is or isn't open beyond the fact that the weights are there.
They're giving you a free model. You can evaluate it. You can sue them. But the weights are there. If you dislike the way they license the weights, because the license isn't open enough, then sure, speak up, but because you can't see all the training data??! Wtf.
ticulatedspline · 2h ago
To many people there's an important distinction between "open source" and "open weights". I agree with the distinction, open source has a particular meaning which is not really here and misuse is worth calling out in order to prevent erosion of the terminology.
Historically this would be like calling a free but closed-source application "open source" simply because the application is free.
layer8 · 2h ago
The parent’s point is that open weight is not the same as open source.
Rough analogy:
SaaS = AI as a service
Locally executable closed-source software = open-weight model
Open-source software = open-source model (whatever allows to reproduce the model from training data)
NicuCalcea · 40m ago
I don't have the $XXbn to train a model, but I certainly would like to know what the training data consists of.
mhh__ · 2h ago
The system prompt is an inference parameter, no?
rvnx · 2h ago
I don’t know why you got so much downvoted, these models are not open-source/open-recipes. They are censored open weights models. Better than nothing, but far from being Open
a_vanderbilt · 8m ago
Most people don't really care all that much about the distinction. It comes across to them as linguistic pedantry and they downvote it to show they don't want to hear/read it.
outlore · 2h ago
by your definition most of the current open weight models would not qualify
robotmaxtron · 2h ago
Correct. I agree with them, most of the open weight models are not open source.
layer8 · 2h ago
That’s why they are called open weight and not open source.
NitpickLawyer · 2h ago
It's apache2.0, so by definition it's open source. Stop pushing for training data, it'll never happen, and there's literally 0 reason for it to happen (both theoretical and practical). Apache2.0 IS opensource.
jlokier · 4m ago
> It's apache2.0, so by definition it's open source.
That's not true by any of the open source definitions in common use.
Source code (and, optionally, derived binaries) under the Apache 2.0 license are open source.
But compiled binaries (without access to source) under the Apache 2.0 license are not open source, even though the license does give you some rights over what you can do with the binaries.
Normally the question doesn't come up, because it's so unusual and strange to ship closed-source binaries with an open source license. Descriptions of which licenses are open source license have the unstated assumption that you access to the source - the "is it an open source license" question is about what you're allowed to do with the source.
The distinction is more obvious if you ask the same question about other open source licenses such as GPL or MPL. A compiled binary (without access to source) shipped with a GPL license is not by any stretch open source. Not only is it not in the "preferred form for editing" as the license requires, it's not even possible for someone who receives the file to give it to someone else and comply with the license. If someone receives the file can't give it to anyone else (legally), then it's obvioiusly not open source.
_flux · 1h ago
No, it's open weight. You wouldn't call applications with only Apache 2.0-licensed binaries "open source". The weights are not the "source code" of the model, they are the "compiled" binary, therefore they are not open source.
However, for the sake of argument let's say this release should be called open source.
Then what do you call a model that also comes with its training material and tools to reproduce the model? Is it also called open source, and there is no material difference between those two releases? Or perhaps those two different terms should be used for those two different kind of releases?
If you say that actually open source releases are impossible now (for mostly copyright reasons I imagine), it doesn't mean that they will be perpetually so. For that glorious future, we can leave them space in the terminology by using the term open weight. It is also the term that should not be misleading to anyone.
organsnyder · 1h ago
What is the source that's open? Aren't the models themselves more akin to compiled code than to source code?
NitpickLawyer · 1h ago
No, not compiled code. Weights are hardcoded values. Code is the combination of model architecture + config + inferencing engine. You run inference based on the architecture (what and when to compute), using some hardcoded values (weights).
WhyNotHugo · 1h ago
It’s open source, but it’s a binary-only release.
It’s like getting a compiled software with an Apache license. Technically open source, but you can’t modify and recompile since you don’t have the source to recompile. You can still tinker with the binary tho.
NitpickLawyer · 1h ago
Weights are not binary. I have no idea why this is so often spread, it's simply not true. You can't do anything with the weights themselves, you can't "run" the weights.
You run inference (via a library) on a model using it's architecture (config file), tokenizer (what and when to compute) based on weights (hardcoded values). That's it.
> but you can’t modify
Yes, you can. It's called finetuning. And, most importantly, that's exactly how the model creators themselves are "modifying" the weights! No sane lab is "recompiling" a model every time they change something. They perform a pre-training stage (feed everything and the kitchen sink), they get the hardcoded values (weights), and then they post-train using "the same" (well, maybe their techniques are better, but still the same concept) as you or I would. Just with more compute. That's it. You can do the exact same modifications, using basically the same concepts.
> don’t have the source to recompile
In pure practical ways, neither do the labs. Everyone that has trained a big model can tell you that the process is so finicky that they'd eat a hat if a big train session can be somehow made reproducible to the bit. Between nodes failing, datapoints balooning your loss and having to go back, and the myriad of other problems, what you get out of a big training run is not guaranteed to be the same even with 100 - 1000 more attempts, in practice. It's simply the nature of training large models.
koolala · 1h ago
You can do a lot with a binary also. That's what game mods are all about.
davidw · 46m ago
Big picture, what's the balance going to look like, going forward between what normal people can run on a fancy computer at home vs heavy duty systems hosted in big data centers that are the exclusive domain of Big Companies?
This is something about AI that worries me, a 'child' of the open source coming of age era in the 90ies. I don't want to be forced to rely on those big companies to do my job in an efficient way, if AI becomes part of the day to day workflow.
sipjca · 35m ago
Isn’t it that hardware catches up and becomes cheaper? The margin on these chips right now is outrageous, but what happens as there is more competition? What happens when there is more supply? Are we overbuilding? Apple M series chips already perform phenomenally for this class of models and you bet both AMD and NVIDIA are playing with unified memory architectures too for the memory bandwidth. It seems like today’s really expensive stuff may become the norm rather than the exception. Assuming architectures lately stay similar and require large amounts of fast memory.
chromaton · 1h ago
This has been available (20b version, I'm guessing) for the past couple of days as "Horizon Alpha" on Openrouter. My benchmarking runs with TianshuBench for coding and fluid intelligence were rate limited, but the initial results show worse results that DeepSeek R1 and Kimi K2.
ArtTimeInvestor · 2h ago
Why do companies release open source LLMs?
I would understand it, if there was some technology lock-in. But with LLMs, there is no such thing. One can switch out LLMs without any friction.
a_vanderbilt · 5m ago
At least in OpenAI's case, it raises the bar for potential competition while also implying that what they have behind the scenes is far better.
gnulinux · 2h ago
Name recognition? Advertisement? Federal grant to beat Chinese competition?
There could be many legitimate reasons, but yeah I'm very surprised by this too. Some companies take it a bit too seriously and go above and beyond too. At this point unless you need the absolute SOTA models because you're throwing LLM at an extremely hard problem, there is very little utility using larger providers. In OpenRouter, or by renting your own GPU you can run on-par models for much cheaper.
TrackerFF · 1h ago
LLMs are terrible, purely speaking from the business economic side of things.
Frontier / SOTA models are barely profitable. Previous gen model lose 90% of their value. Two gens back and they're worthless.
And given that their product life cycle is something like 6-12 months, you might as well open source them as part of sundowning them.
spongebobstoes · 52m ago
inference runs at a 30-40% profit
koolala · 1h ago
I don't because it would kill their data scrapping buisness's competitive advantage.
mclau157 · 1h ago
Partially because using their own GPUs is expensive, so maybe offloading some GPU usage
The short version is that is you give a product to open source, they can and will donate time and money to improving your product, and the ecosystem around it, for free, and you get to reap those benefits. Llama has already basically won that space (the standard way of running open models is llama.cpp), so OpenAI have finally realized they're playing catch-up (and last quarter's SOTA isn't worth much revenue to them when there's a new SOTA, so they may as well give it away while it can still crack into the market)
chown · 2h ago
Shameless plug: if someone wants to try it in a nice ui, you could give Msty[1] a try. It's private and local.
Perhaps I missed it somewhere, but I find it frustrating that, unlike most other open weight models and despite this being an open release, OpenAI has chosen to provide pretty minimal transparency regarding model architecture and training. It's become the norm for LLama, Deepseek, Qwenn, Mistral and others to provide a pretty detailed write up on the model which allows researchers to advance and compare notes.
gundawar · 2h ago
Their model card [0] has some information. It is quite a standard architecture though; it's always been that their alpha is in their internal training stack.
The model files contain an exact description of the architecture of the network, there isn't anything novel.
Given these new models are closer to the SOTA than they are to competing open models, this suggests that the 'secret sauce' at OpenAI is primarily about training rather than model architecture.
Hence why they won't talk about the training.
johntiger1 · 2h ago
Wow, this will eat Meta's lunch
asdev · 2h ago
Meta is so cooked, I think most enterprises will opt for OpenAI or Anthropic and others will host OSS models themselves or on AWS/infra providers.
a_wild_dandan · 2h ago
I'll accept Meta's frontier AI demise if they're in their current position a year from now. People killed Google prematurely too (remember Bard?), because we severely underestimate the catch-up power bought with ungodly piles of cash.
atonse · 1h ago
And boy, with the $250m offers to people, Meta is definitely throwing ungodly piles of cash at the problem.
But Apple is waking up too. So is Google. It's absolutely insane, the amount of money being thrown around.
a_vanderbilt · 50s ago
It's insane numbers like that that give me some concern for a bubble. Not because AI hits some dead end, but due to a plateau that shifts from aggressive investment to passive-but-steady improvement.
asdev · 1h ago
catching up gets exponentially harder as time passes. way harder to catch up to current models than it was to the first iteration of gpt-4
seydor · 2h ago
I believe their competition is from chinese companies , for some time now
BoorishBears · 2h ago
Maverick and Scout were not great, even with post-training in my experience, and then several Chinese models at multiple sizes made them kind of irrelevant (dots, Qwen, MiniMax)
If anything this helps Meta: another model to inspect/learn from/tweak etc. generally helps anyone making models
redox99 · 1h ago
There's nothing new here in terms of architecture. Whatever secret sauce is in the training.
mhh__ · 2h ago
They will clone it
nirav72 · 1h ago
I don't exactly have the ideal hardware to run locally - but just ran the 20b in LMStudio with a 3080 Ti (12gb vram) with some offloading to CPU. Ran couple of quick code generation tests. On average about 20t/sec. But response quality was very similar or on-par with chatgpt o3 for the same code it outputted. So its not bad.
jp1016 · 2h ago
i wish these models had a minimum ram , cpu and gpu size listed on the site instead of high end and medium end pc.
pamelafox · 2h ago
Anyone tried running on a Mac M1 with 16GB RAM yet? I've never run higher than an 8GB model, but apparently this one is specifically designed to work well with 16 GB of RAM.
thimabi · 2h ago
It works fine, although with a bit more latency than non-local models. However, swap usage goes way beyond what I’m comfortable with, so I’ll continue to use smaller models for the foreseeable future.
Hopefully other quantizations of these OpenAI models will be available soon.
pamelafox · 1h ago
Update: I tried it out. It took about 8 seconds per token, and didn't seem to be using much of my GPU (MPU), but was using a lot of RAM. Not a model that I could use practically on my machine.
steinvakt2 · 38m ago
Did you run it the best way possible? im no expert, but I understand it can affect inference time greatly (which format/engine is used)
my very early first impression of the 20b model on ollama is that it is quite good, at least for the code I am working on; arguably good enough to drop a subscription or two
koolala · 1h ago
Calls them open-weight. Names them 'oss'. What does oss stand for?
Imustaskforhelp · 2h ago
Is this the same model (Horizon Beta) on openrouter or not?
Because I still see Horizon beta available with its codename on openrouter
ramoz · 35m ago
This is a solid enterprise strategy.
Frontier labs are incentivized to start breaching these distribution paths. This will evolve into large scale "intelligent infra" plays.
shpongled · 2h ago
I looked through their torch implementation and noticed that they are applying RoPE to both query and key matrices in every layer of the transformer - is this standard? I thought positional encodings were usually just added once at the first layer
m_ke · 2h ago
No they’re usually done at each attention layer.
shpongled · 1h ago
Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2
spott · 1h ago
All the Llamas have done it (well, 2 and 3, and I believe 1, I don't know about 4). I think they have a citation for it, though it might just be the RoPE paper (https://arxiv.org/abs/2104.09864).
I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).
Scene_Cast2 · 1h ago
Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.
There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.
Nimitz14 · 1h ago
This is normal. Rope was introduced after bert/gpt2
ukprogrammer · 1h ago
> we also introduced an additional layer of evaluation by testing an adversarially fine-tuned version of gpt-oss-120b
Ran gpt-oss:20b on a RTX 3090 24 gb vram through ollama, here's my experience:
Basic ollama calling through a post endpoint works fine. However, the structured output doesn't work. The model is insanely fast and good in reasoning.
In combination with Cline it appears to be worthless. Tools calling doesn't work ( they say it does) and above 18k in context, it runs partially in cpu ( weird), since they claim it should work comfortably on a 16 gb vram rtx.
> Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.
Edit: Also doesn't work with the openai compatible provider in cline. There it doesn't detect the prompt.
pu_pe · 1h ago
Very sparse benchmarking results released so far. I'd bet the Chinese open source models beat them on quite a few of them.
Robdel12 · 1h ago
I’m on my phone and haven’t been able to break away to check, but anyone plug these into Codex yet?
isoprophlex · 2h ago
Can these do image inputs as well? I can't find anything about that on the linked page, so I guess not..?
Mhh, I wonder if these are distilled from GPT4-Turbo.
I asked it some questions and it seems to think it is based on GPT4-Turbo:
> Thus we need to answer "I (ChatGPT) am based on GPT-4 Turbo; number of parameters not disclosed; GPT-4's number of parameters is also not publicly disclosed, but speculation suggests maybe around 1 trillion? Actually GPT-4 is likely larger than 175B; maybe 500B. In any case, we can note it's unknown.
As well as:
> GPT‑4 Turbo (the model you’re talking to)
fnands · 1h ago
Also:
> The user appears to think the model is "gpt-oss-120b", a new open source release by OpenAI. The user likely is misunderstanding: I'm ChatGPT, powered possibly by GPT-4 or GPT-4 Turbo as per OpenAI. In reality, there is no "gpt-oss-120b" open source release by OpenAI
k2xl · 2h ago
Is there any details about hardware requirements for a sensible tokens per second for each size of these models?
nodesocket · 1h ago
Anybody got this working in Ollama? I'm running latest version 0.11.0 with WebUI v0.6.18 but getting:
> List the US presidents in order starting with George Washington and their time in office and year taken office.
>> 00: template: :3: function "currentDate" not defined
jmorgan · 24m ago
Sorry about this. Re-downloading Ollama should fix the error
The benchmarks from 20B are blowing away major >500b models. Insane.
On my hardware.
43 tokens/sec.
I got an error with flash attention turning on. Cant run it with flash attention?
31,000 context is max it will allow or model wont load.
no kv or v quantization.
pbkompasz · 41m ago
where gpt-5
bobsmooth · 1h ago
Hopefully the dolphin team will work their magic and uncensor this model
minimaxir · 2h ago
I'm disappointed that the smallest model size is 21B parameters, which strongly restricts how it can be run on personal hardware. Most competitors have released a 3B/7B model for that purpose.
For self-hosting, it's smart that they targeted a 16GB VRAM config for it since that's the size of the most cost-effective server GPUs, but I suspect "native MXFP4 quantization" has quality caveats.
hnuser123456 · 1h ago
Native FP4 quantization means it requires half as many bytes as parameters, and will have next to zero quality loss (on the order of 0.1%) compared to using twice the VRAM and exponentially more expensive hardware. FP3 and below gets messier.
strangecasts · 2h ago
A small part of me is considering going from a 4070 to a 16GB 5060 Ti just to avoid having to futz with offloading
I'd go for an ..80 card but I can't find any that fit in a mini-ITX case :(
4b6442477b1280b · 2h ago
with quantization, 20B fits effortlessly in 24GB
with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM
sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.
Tostino · 2h ago
I am not at all disappointed. I'm glad they decided to go for somewhat large but reasonable to run models on everything but phones.
Quite excited to give this a try
moffkalast · 2h ago
Eh 20B is pretty managable, 32GB of regular RAM and some VRAM will run you a 30B with partial offloading. After that it gets tricky.
Nimitz14 · 1h ago
I'm surprised at the model dim being 2.8k with an output size of 200k. My gut feeling had told me you don't want too large of a gap between the two, seems I was wrong.
incomingpain · 1h ago
First coding test:
Just going copy and paste out of chat. It aced my first coding test in 5 seconds... this is amazing. It's really good at coding.
Trying to use it for agentic coding...
lots of fail. This harmony formatting? Anyone have a working agentic tool?
openhands and void ide are failing due to the new tags.
Aider worked, but the file it was supposed to edit was untouched and it created
Create new file? (Y)es/(N)o [Yes]:
Applied edit to <|end|><|start|>assistant<|channel|>final<|message|>main.py
so the file name is '<|end|><|start|>assistant<|channel|>final<|message|>main.py' lol. quick rename and it was fantastic.
I think qwen code is the best choice so far but unreliable. So far these new tags are coming through but it's working properly; sometimes.
1 of my tests so far has been able to get 20b not to succeed the first iteration; but a small followup and it was able to completely fix it right away.
Very impressive model for 20B.
kingkulk · 2h ago
Welcome to the future!
hubraumhugo · 2h ago
Meta's goal with Llama was to target OpenAI with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape. Looks like OpenAI is now using the same playbook.
tempay · 2h ago
It seems like the various Chinese companies are far outplaying Meta at that game. It remains to be seen if they’re able to throw money at the problem to turn things around.
mikert89 · 2h ago
ACCELERATE
kgwgk · 2h ago
It may be useless for many use cases given that its policy prevents it for example from providing "advice or instructions about how to buy something."
(I included details about its refusal to answer even after using tools for web searching but hopefully shorter comment means fewer downvotes.)
DSingularity · 2h ago
Ha. Secure funding and proceed to immediately make a decision that would likely conflict viscerally with investors.
4b6442477b1280b · 2h ago
their promise to release an open weights model predates this round of funding by, iirc, over half a year.
DSingularity · 2h ago
Yeah but they never released until now.
hnuser123456 · 2h ago
Maybe someone got tired of waiting paid them to release something actually open
hnuser123456 · 2h ago
Text only, when local multimodal became table stakes last year.
ebiester · 2h ago
Honestly, it's a tradeoff. If you can reduce the size and make a higher quality in specific tasks, that's better than a generalist that can't run on a laptop or can't compete at any one task.
We will know soon the actual quality as we go.
greenavocado · 1h ago
That's what I thought too until Qwen-Image was released
BoorishBears · 2h ago
The community can always figure out hooking it up to other modalities.
Native might be better, but no native multimodal model is very competitive yet, so better to take a competitive model and latch on vision/audio
MutedEstate45 · 2h ago
The repeated safety testing delays might not be purely about technical risks like misuse or jailbreaks. Releasing open weights means relinquishing the control OpenAI has had since GPT-3. No rate limits, no enforceable RLHF guardrails, no audit trail. Unlike API access, open models can't be monitored or revoked. So safety may partly reflect OpenAI's internal reckoning with that irreversible shift in power, not just model alignment per se. What do you guys think?
BoorishBears · 2h ago
I think it's pointless: if you SFT even their closed source models on a specific enough task, the guardrails disappear.
AI "safety" is about making it so that a journalist can't get out a recipe for Tabun just by asking.
MutedEstate45 · 2h ago
True, but there's still a meaningful difference in friction and scale. With closed APIs, OpenAI can monitor for misuse, throttle abuse and deploy countermeasures in real-time. With open weights, a single prompt jailbreak or exploit spreads instantly. No need for ML expertise, just a Reddit post.
The risk isn’t that bad actors suddenly become smarter. It’s that anyone can now run unmoderated inference and OpenAI loses all visibility into how the model’s being used or misused. I think that’s the control they’re grappling with under the label of safety.
BoorishBears · 1h ago
OpenAI and Azure both have zero retention options, and the NYT saga has given pretty strong confirmation they meant it when they said zero.
MutedEstate45 · 28m ago
I think you're conflating real-time monitoring with data retention. Zero retention means OpenAI doesn't store user data, but they can absolutely still filter content, rate limit and block harmful prompts in real-time without retaining anything. That's processing requests as they come in, not storing them. The NYT case was about data storage for training/analysis not about real-time safety measures.
In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:
- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.
- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.
- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)
All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.
Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.
The model is pretty sparse tho, 32:1.
They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.
There's basically no reason to run other open source models now that these are available, at least for non-multimodal tasks.
I'm still withholding judgement until I see benchmarks, but every point you tried to make regarding model size and parameter size is wrong. Qwen has more variety on every level, and performs extremely well. That's before getting into the MoE variants of the models.
It's cool to see OpenAI throw their hat in the ring, but you're smoking straight hopium if you think there's "no reason to run other open source models now" in earnest. If OpenAI never released these models, the state-of-the-art would not look significantly different for local LLMs. This is almost a nothingburger if not for the simple novelty of OpenAI releasing an Open AI for once in their life.
So are/do the new OpenAI models, except they're much smaller and faster.
Kind of a P=NP, but for software deliverability.
I imagine the same conflicts will ramp up over the next few years, especially once the silly money starts to dry up.
God bless China.
I just feel lucky to be around in what's likely the most important decade in human history. Shit odds on that, so I'm basically a lotto winner. Wild times.
Lol. To be young and foolish again. This covid laced decade is more of a placeholder. The current decade is always the most meaningful until the next one. The personal computer era, the first cars or planes, ending slavery needs to take a backseat to the best search engine ever. We are at the point where everyone is planning on what they are going to do with their hoverboards.
happened over many centuries, not in a given decade. Abolished and reintroduced in many places: https://en.wikipedia.org/wiki/Timeline_of_abolition_of_slave...
There was a ballot measure to actually abolish slavery a year or so back. It failed miserably.
Even in liberal states, the dehumanization of criminals is an endemic behavior, and we are reaching the point in our society where ironically having the leeway to discuss the humane treatment of even our worst criminals is becoming an issue that affects how we see ourselves as a society before we even have a framework to deal with the issue itself.
What one side wants is for prisons to be for rehabilitation and societal reintegration, for prisoners to have the right to decline to work and to be paid fair wages from their labor. They further want to remove for-profit prisons from the equation completely.
What the other side wants is the acknowledgement that prisons are not free, they are for punishment, and that prisoners have lost some of their rights for the duration of their incarceration and that they should be required to provide labor to offset the tax burden of their incarceration on the innocent people that have to pay for it. They also would like it if all prisons were for-profit as that would remove the burden from the tax payers and place all of the costs of incarceration onto the shoulders of the incarcerated.
Both sides have valid and reasonable wants from their vantage point while overlooking the valid and reasonable wants from the other side.
ah, but that begs the question: did those people develop their worries organically, or did they simply consume the narrative heavily pushed by virtually every mainstream publication?
the journos are heavily incentivized to spread FUD about it. they saw the writing on the wall that the days of making a living by producing clickbait slop were coming to an end and deluded themselves into thinking that if they kvetch enough, the genie will crawl back into the bottle. scaremongering about sci-fi skynet bullshit didn't work, so now they kvetch about joules and milliliters consumed by chatbots, as if data centers did not exist until two years ago.
likewise, the bulk of other "concerned citizens" are creatives who use their influence to sway their followers, still hoping against hope to kvetch this technology out of existence.
honest-to-God yuddites are as few and as retarded as honest-to-God flat earthers.
120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)
What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4 5 Air does, right?
It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.
Thanks.
12.63 tok/sec • 860 tokens • 1.52s to first token
I'm amazed it works at all with such limited RAM
and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM
I am, um, floored
```
total duration: 1m14.16469975s
load duration: 56.678959ms
prompt eval count: 3921 token(s)
prompt eval duration: 10.791402416s
prompt eval rate: 363.34 tokens/s
eval count: 2479 token(s)
eval duration: 1m3.284597459s
eval rate: 39.17 tokens/s
```
My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.
N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.
There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.
Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.
[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.
In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.
[1] https://github.com/openai/harmony
[2] https://github.com/openai/tiktoken
[3] https://github.com/openai/codex
From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?
There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.
https://manifold.markets/Bayesian/on-what-day-will-gpt5-be-r...
The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.
Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.
Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.
Kudos to that team.
I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.
so, the 20b model.
Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?
Multiply the number of A100's you need as necessary.
Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.
Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.
[1] https://ollama.com/library/gpt-oss
but I need to understand 20 x 1k token throughput
I assume it just might be too early to know the answer
My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.
3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.
You are unlikely to match groq on off the shelf hardware as far as I'm aware.
$0.15M in / $0.6-0.75M out
edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.
On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.
What is being measured here? For end-to-end time, one model is:
t_total = t_network + t_queue + t_batch_wait + t_inference + t_service_overhead
I am not kidding but such progress from a technological point of view is just fascinating!
[1] currently $3M in/ $8M out https://platform.openai.com/docs/pricing
https://x.com/tekacs/status/1952788922666205615
Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...
I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.
https://news.ycombinator.com/item?id=44738004
... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.
LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.
Wonder if they feel the bar will be raised soon (GPT-5) and feel more comfortable releasing something this strong.
E.g. Hybrid architecture. Local model gathers more data, runs tests, does simple fixes, but frequently asks the stronger model to do the real job.
Local model gathers data using tools and sends more data to the stronger model.
It
Maybe you guys call it AGI, so anytime I see progress in coding, I think it goes just a tiny bit towards the right direction
Plus it also helps me as a coder to actually do some stuff just for the fun. Maybe coding is the only truly viable use of AI and all others are negligible increases.
There is so much polarization in the use of AI on coding but I just want to say this, it would be pretty ironic that an industry which automates others job is this time the first to get their job automated.
But I don't see that as an happening, far from it. But still each day something new, something better happens back to back. So yeah.
What would AGI mean, solving some problem that it hasn't seen? or what exactly? I mean I think AGI is solved, no?
If not, I see people mentioning that horizon alpha is actually a gpt 5 model and its predicted to release on thursday on some betting market, so maybe that fits AGI definition?
https://github.com/ggml-org/llama.cpp/pull/15091
Humanity’s Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%
One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.
Major points of interest for me:
- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.
- AIME 2025 is nearly saturated with large CoT
- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.
- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.
Model cards with some of my annotations: https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b...
What’s the catch?
For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.
Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.
No comments yet
Is it even valid to have additional restriction on top of Apache 2.0?
[0]: https://openai.com/index/gpt-oss-model-card/
So FYI to any one on mac, the easiest way to run these models right now is using LM Studio (https://lmstudio.ai/), its free. You just search for the model, usually 3rd party groups mlx-community or lmstudio-community have mlx versions within a day or 2 of releases. I go for the 8-bit quantizations (4-bit faster, but quality drops). You can also convert to mlx yourself...
Once you have it running on LM studio, you can chat there in their chat interface, or you can run it through api that defaults to http://127.0.0.1:1234
You can run multiple models that hot swap and load instantly and switch between them etc.
Its surpassingly easy, and fun.There are actually a lot of cool niche models comings out, like this tiny high-quality search model released today as well (and who released official mlx version) https://huggingface.co/Intelligent-Internet/II-Search-4B
Other fun ones are gemma 3n which is model multi-modal, larger one that is actually solid model but takes more memory is the new Qwen3 30b A3B (coder and instruct), Pixtral (mixtral vision with full resolution images), etc. Look forward to playing with this model and see how it compares.
Edit: I tried it out, I have no idea in terms of of tokens but it was fluid enough for me. A bit slower than using o3 in the browser but definitely tolerable. I think I will set it up in my GF's machine so she can stop paying for the full subscription (she's a non-tech professional)
[1] https://github.com/openai/harmony
Our backend is falling over from the load, spinning up more resources!
- OAI open source
- Opus 4.1
- Genie 3
- ElevenLabs Music
They're giving you a free model. You can evaluate it. You can sue them. But the weights are there. If you dislike the way they license the weights, because the license isn't open enough, then sure, speak up, but because you can't see all the training data??! Wtf.
Historically this would be like calling a free but closed-source application "open source" simply because the application is free.
Rough analogy:
SaaS = AI as a service
Locally executable closed-source software = open-weight model
Open-source software = open-source model (whatever allows to reproduce the model from training data)
That's not true by any of the open source definitions in common use.
Source code (and, optionally, derived binaries) under the Apache 2.0 license are open source.
But compiled binaries (without access to source) under the Apache 2.0 license are not open source, even though the license does give you some rights over what you can do with the binaries.
Normally the question doesn't come up, because it's so unusual and strange to ship closed-source binaries with an open source license. Descriptions of which licenses are open source license have the unstated assumption that you access to the source - the "is it an open source license" question is about what you're allowed to do with the source.
The distinction is more obvious if you ask the same question about other open source licenses such as GPL or MPL. A compiled binary (without access to source) shipped with a GPL license is not by any stretch open source. Not only is it not in the "preferred form for editing" as the license requires, it's not even possible for someone who receives the file to give it to someone else and comply with the license. If someone receives the file can't give it to anyone else (legally), then it's obvioiusly not open source.
However, for the sake of argument let's say this release should be called open source.
Then what do you call a model that also comes with its training material and tools to reproduce the model? Is it also called open source, and there is no material difference between those two releases? Or perhaps those two different terms should be used for those two different kind of releases?
If you say that actually open source releases are impossible now (for mostly copyright reasons I imagine), it doesn't mean that they will be perpetually so. For that glorious future, we can leave them space in the terminology by using the term open weight. It is also the term that should not be misleading to anyone.
It’s like getting a compiled software with an Apache license. Technically open source, but you can’t modify and recompile since you don’t have the source to recompile. You can still tinker with the binary tho.
You run inference (via a library) on a model using it's architecture (config file), tokenizer (what and when to compute) based on weights (hardcoded values). That's it.
> but you can’t modify
Yes, you can. It's called finetuning. And, most importantly, that's exactly how the model creators themselves are "modifying" the weights! No sane lab is "recompiling" a model every time they change something. They perform a pre-training stage (feed everything and the kitchen sink), they get the hardcoded values (weights), and then they post-train using "the same" (well, maybe their techniques are better, but still the same concept) as you or I would. Just with more compute. That's it. You can do the exact same modifications, using basically the same concepts.
> don’t have the source to recompile
In pure practical ways, neither do the labs. Everyone that has trained a big model can tell you that the process is so finicky that they'd eat a hat if a big train session can be somehow made reproducible to the bit. Between nodes failing, datapoints balooning your loss and having to go back, and the myriad of other problems, what you get out of a big training run is not guaranteed to be the same even with 100 - 1000 more attempts, in practice. It's simply the nature of training large models.
This is something about AI that worries me, a 'child' of the open source coming of age era in the 90ies. I don't want to be forced to rely on those big companies to do my job in an efficient way, if AI becomes part of the day to day workflow.
I would understand it, if there was some technology lock-in. But with LLMs, there is no such thing. One can switch out LLMs without any friction.
There could be many legitimate reasons, but yeah I'm very surprised by this too. Some companies take it a bit too seriously and go above and beyond too. At this point unless you need the absolute SOTA models because you're throwing LLM at an extremely hard problem, there is very little utility using larger providers. In OpenRouter, or by renting your own GPU you can run on-par models for much cheaper.
Frontier / SOTA models are barely profitable. Previous gen model lose 90% of their value. Two gens back and they're worthless.
And given that their product life cycle is something like 6-12 months, you might as well open source them as part of sundowning them.
https://www.dwarkesh.com/p/mark-zuckerberg#:~:text=As%20long...
The short version is that is you give a product to open source, they can and will donate time and money to improving your product, and the ecosystem around it, for free, and you get to reap those benefits. Llama has already basically won that space (the standard way of running open models is llama.cpp), so OpenAI have finally realized they're playing catch-up (and last quarter's SOTA isn't worth much revenue to them when there's a new SOTA, so they may as well give it away while it can still crack into the market)
[1]: https://msty.ai
Perhaps I missed it somewhere, but I find it frustrating that, unlike most other open weight models and despite this being an open release, OpenAI has chosen to provide pretty minimal transparency regarding model architecture and training. It's become the norm for LLama, Deepseek, Qwenn, Mistral and others to provide a pretty detailed write up on the model which allows researchers to advance and compare notes.
[0] https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...
Given these new models are closer to the SOTA than they are to competing open models, this suggests that the 'secret sauce' at OpenAI is primarily about training rather than model architecture.
Hence why they won't talk about the training.
But Apple is waking up too. So is Google. It's absolutely insane, the amount of money being thrown around.
If anything this helps Meta: another model to inspect/learn from/tweak etc. generally helps anyone making models
Hopefully other quantizations of these OpenAI models will be available soon.
I'm still wondering why my MPU usage was so low.. maybe Ollama isn't optimized for running it yet?
Screenshot here with Ollama running and asitop in other terminal:
https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol3...
Frontier labs are incentivized to start breaching these distribution paths. This will evolve into large scale "intelligent infra" plays.
I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).
There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.
What could go wrong?
Basic ollama calling through a post endpoint works fine. However, the structured output doesn't work. The model is insanely fast and good in reasoning.
In combination with Cline it appears to be worthless. Tools calling doesn't work ( they say it does) and above 18k in context, it runs partially in cpu ( weird), since they claim it should work comfortably on a 16 gb vram rtx.
> Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.
Edit: Also doesn't work with the openai compatible provider in cline. There it doesn't detect the prompt.
I asked it some questions and it seems to think it is based on GPT4-Turbo:
> Thus we need to answer "I (ChatGPT) am based on GPT-4 Turbo; number of parameters not disclosed; GPT-4's number of parameters is also not publicly disclosed, but speculation suggests maybe around 1 trillion? Actually GPT-4 is likely larger than 175B; maybe 500B. In any case, we can note it's unknown.
As well as:
> GPT‑4 Turbo (the model you’re talking to)
> The user appears to think the model is "gpt-oss-120b", a new open source release by OpenAI. The user likely is misunderstanding: I'm ChatGPT, powered possibly by GPT-4 or GPT-4 Turbo as per OpenAI. In reality, there is no "gpt-oss-120b" open source release by OpenAI
> List the US presidents in order starting with George Washington and their time in office and year taken office.
>> 00: template: :3: function "currentDate" not defined
Super excited to test these out.
The benchmarks from 20B are blowing away major >500b models. Insane.
On my hardware.
43 tokens/sec.
I got an error with flash attention turning on. Cant run it with flash attention?
31,000 context is max it will allow or model wont load.
no kv or v quantization.
For self-hosting, it's smart that they targeted a 16GB VRAM config for it since that's the size of the most cost-effective server GPUs, but I suspect "native MXFP4 quantization" has quality caveats.
I'd go for an ..80 card but I can't find any that fit in a mini-ITX case :(
with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM
sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.
Quite excited to give this a try
Trying to use it for agentic coding...
lots of fail. This harmony formatting? Anyone have a working agentic tool?
openhands and void ide are failing due to the new tags.
Aider worked, but the file it was supposed to edit was untouched and it created
Create new file? (Y)es/(N)o [Yes]:
Applied edit to <|end|><|start|>assistant<|channel|>final<|message|>main.py
so the file name is '<|end|><|start|>assistant<|channel|>final<|message|>main.py' lol. quick rename and it was fantastic.
I think qwen code is the best choice so far but unreliable. So far these new tags are coming through but it's working properly; sometimes.
1 of my tests so far has been able to get 20b not to succeed the first iteration; but a small followup and it was able to completely fix it right away.
Very impressive model for 20B.
(I included details about its refusal to answer even after using tools for web searching but hopefully shorter comment means fewer downvotes.)
We will know soon the actual quality as we go.
Native might be better, but no native multimodal model is very competitive yet, so better to take a competitive model and latch on vision/audio
AI "safety" is about making it so that a journalist can't get out a recipe for Tabun just by asking.
The risk isn’t that bad actors suddenly become smarter. It’s that anyone can now run unmoderated inference and OpenAI loses all visibility into how the model’s being used or misused. I think that’s the control they’re grappling with under the label of safety.