Qwen3-Next

30 tosh 5 9/12/2025, 6:32:04 AM qwen.ai ↗

Comments (5)

jychang · 4m ago
Coolest part of Qwen3-Next, in my opinion, is that they do MTP without adding another un-embedding matrix.

Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...

But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.

Qwen3-Next doesn't have that, so it saves a few GB in active parameters for MTP, which is a Big Deal.

croemer · 9m ago
Jgoauh · 17m ago
Seems impressive, i believe better architectures are really the path forward, i don't think you need more than 100B params taking this model and what GPT OSS 120B can acchieve
NitpickLawyer · 1m ago
New arch seems cool, and it's amazing that we have these published in the open.

That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.

croemer · 10m ago
ERR_NAME_NOT_RESOLVED