GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.

In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.

So presumably, this comes down to...

- training technique or data

- dimension

- lower number of large experts vs higher number of small experts

jszymborski · 1h ago

If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.

7moritz7 · 1h ago

That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.

Edit: found this analysis, it's on the HN frontpage right now

> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.

https://x.com/jxmnop/status/1953899426075816164

CuriouslyC · 1h ago

The strategy of Phi isn't bad, it's just not general. It's really a model that's meant to be fine tuned, but unfortunately fine tuning tends to shit on RL'd behavior, so it ended up not being that useful. If someone made a Phi style model with an architecture that was designed to take knowledge adapters/experts (i.e. small MoE model designed to get separately trained networks plugged into them with routing updates via special LoRA) it'd actually be super useful.

unstatusthequo · 40m ago

Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.

“What is the word that starts with S, ends with E, and contains A? → SAEA”

Then when I said that’s not a word and you gave me the answer already, no fun, it said

“I do not have access to confirm that word.”

roscas · 16m ago

"The word is SAME." for qwen3-coder and it took like 10 seconds to write all the thinking because it writes it down, but processing was 0 seconds.

"I need to find a word that starts with S, ends with E, and contains A.

Let me think through this step by step:

    Starts with S
    Ends with E
    Contains A somewhere in the middle

One word that fits this pattern is "SAGE" - but that only has 4 letters and would be S-A-G-E, which doesn't work because it doesn't contain A in the middle.

Wait, let me reconsider. The word "SAGE" actually works if I read it correctly:

    Starts with S
    Ends with E  
    Contains A

Actually, "SAGE" is S-A-G-E, so it starts with S and ends with E, but it contains A in the second position.

Let me think of other possibilities:

    "SQUARE" - starts with S, ends with E, contains A (in the middle)
    "SALE" - starts with S, ends with E, contains A
    "SAME" - starts with S, ends with E, contains A

Wait, "SAME" is S-A-M-E - that works! It starts with S, ends with E, and contains A.

The word is SAME. "

verisimi · 24m ago

lol. The answer it gave doesn't even end in an 'E'.

roscas · 19m ago

From my experience, qwen3-coder is way better. I only have gpt-oss:20b installed to make a few more tests but I give it a program to make a summary of what it does and qwen3 just works in a few seconds, while gpt-oss was cancelled after 5 minuts... doing nothing.

So I just use qwen3. Fast and great ouput. If for some reason I don't get what I need, I might use search engines or Perplexity.

I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.

Qwen3-coder is amazing. Best I used so far.

smokel · 15m ago

The 20B version doesn't fit in 10GB. That might explain some issues?

homarp · 1h ago

"From GPT-2 to gpt-oss: Analyzing the Architectural Advances And How They Stack Up Against Qwen3"

AOL closes its dial up internet service (ispreview.co.uk)

Try and (ygdp.yale.edu)

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Show HN: Engineering.fyi – Search across tech engineering blogs in one place (engineering.fyi)

Diffusion Language Models Are Super Data Learners (jinjieni.notion.site)

MCP: An (Accidentally) Universal Plugin System (worksonmymachine.ai)

Type (YC W23) is hiring a founding engineer to build an AI-native doc editor (ycombinator.com)

Booting 5000 Erlangs on Ampere One 192-core (underjord.io)

Writing simple tab-completions for Bash and Zsh (mill-build.org)

Inside OS/2 (1987) (gitpi.us)

Flintlock – Create and manage the lifecycle of MicroVMs, backed by containerd (github.com)

Zig's Lovely Syntax (matklad.github.io)

Abogen – Generate audiobooks from EPUBs, PDFs and text (github.com)

Open Lovable (github.com)

How I code with AI on a budget/free (wuu73.org)

GPT-5: It Just Does Stuff (oneusefulthing.org)

Curious about the training data of OpenAI's new GPT-OSS models? I was too (twitter.com)

Show HN: The current sky at your approximate location, as a CSS gradient (sky.dlazaro.ca)

Abusing Entra OAuth for fun and access to internal Microsoft applications (research.eye.security)

My Lethal Trifecta talk at the Bay Area AI Security Meetup (simonwillison.net)

OpenFreeMap survived 100k requests per second (blog.hyperknot.com)

How Potatoes Evolved (nhm.ac.uk)

POML: Prompt Orchestration Markup Language (github.com)

The Framework Desktop is a beast (world.hey.com)

Sunlight-activated material turns PFAS in water into harmless fluoride (phys.org)

“The Hollow Men” at 100 (prufrock.substack.com)

A CT scanner reveals surprises inside the 386 processor's ceramic package (righto.com)

The current state of LLM-driven development (blog.tolki.dev)

Melonking Website (melonking.net)

LHC's New Chip Tackles Radiation Challenges (spectrum.ieee.org)

Ch.at – A lightweight LLM chat service accessible through HTTP, SSH, DNS and API (ch.at)

Quickshell – building blocks for your desktop (quickshell.org)

Don't “let it crash”, let it heal (zachdaniel.dev)

Long-term exposure to outdoor air pollution linked to increased risk of dementia (cam.ac.uk)

Did California's fast food minimum wage reduce employment? (nber.org)

Debian 13 “Trixie” (debian.org)

People returned to live in Pompeii's ruins, archaeologists say (bbc.com)

An engineer's perspective on hiring (jyn.dev)

Who got arrested in the raid on the XSS crime forum? (krebsonsecurity.com)

I want everything local – Building my offline AI workspace (instavm.io)

P-fast trie, but smaller (dotat.at)

LLMs Aren't World Models (yosefk.com)

Employees spotting problems help the business, but leaders empower flatterers (phys.org)

An AI-first program synthesis framework built around a new programming language (queue.acm.org)

ESP32 Bus Pirate 0.5 – A hardware hacking tool that speaks every protocol (github.com)

The importance of offtopic (blog.tadzik.net)

A Simple CPU on the Game of Life (2021) (nicholas.carlini.com)

GPTs and Feeling Left Behind (whynothugo.nl)

R0ML's Ratio (blog.glyph.im)

MCP overlooks hard-won lessons from distributed systems (julsimon.medium.com)

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2

Comments (10)