So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.
puilp0502 · 4h ago
What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
jychang · 3h ago
Speculative decoding! It makes inference a LOT faster.
Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.
moffkalast · 2h ago
Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions?
jychang · 2h ago
Because if you generate token n+1 with all 48 layers of Qwen3-Next and 80 billion params, and also generate token n+2 with the 1 MTP layer at 2bil params... that n+2 token can be much lower quality than the n+1 token but mostly correct.
Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
littlestymaar · 41m ago
> Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
That doesn't match my understanding of what speculative decoding does: AFAIK with regular speculative decoding you ask a smaller llm infer the next few tokens (let say 5 tokens) and then, you can have the big model infer token 1, 2, 3, 4, 5 and 6 in parallel (each time starting from the sentence partially completed by the smaller model). Because llms are bandwidth bound, doing the same work six times in parallel isn't slower than doing it only once (what's costly is moving the massive model weights between VRAM and the GPU cores).
If token 1,2 and 3 match what the small models inferred, then you keep them. As soon as you have a mismatched token (say token 4) it means that you have to discard the next inferred tokens (here token 5 and 6) because they were calculated under a wrong assumption for token 4.
So if the MTP layer merely replace the smaller llm in the previous scheme with everything else working the same way, you would save anything when inferring “Obama” (you'd still need to “generate it the regular way”, as there isn't really another way) but you could also start working on the word immediately after “Obama” by assuming “Obama” was already chose. And if the model actually outputted “Hussein” instead of “Obama”, then the token calculated to happen after “Obama” would have to be discarded.
Or maybe my understanding of speculative decoding is completely off…
SonOfLilit · 2h ago
If you ask me to guess an answer, I'll _usually_ produce the same answer as if I had time to think about it deeply, but not always...
No comments yet
EMM_386 · 24m ago
I believe it's something along these lines. The MTP head runs simultaneously and generates a probability list based on what it thinks the results will be, learned during training.
If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90)
If n+1 = "The" then n+2 = "quick" (confidence: 0.45)
If n+1 = "President" then n+2 = "Biden" (confidence: 0.75)
A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough".
rfoo · 3h ago
It could be a better draft model than separately trained EAGLE etc for speculative decoding.
Razengan · 33m ago
Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')
wickedsight · 18m ago
For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.
Alifatisk · 1h ago
Alibaba keeps releasing gold content
I just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysis
According to Qwen Chat, Qwen3-Next has the following limits:
Maximum context length: 262,144 tokens
Max summary generation length: 32,768 tokens
This is 2x higher on context length and 4x higher on summary generation compared to Qwen3-235B-A22B, damn
> Qwen3-Next [...] excels in ultra-long-context understanding and complex tasks
Even though their new hybrid architecture is fascinating, I think I'll continue to stick with Qwen2.5-Turbo because it's one of the few models that supports
1M tokens in context length. My use case is uploading large pdfs and ask questions across chapters
pilotneko · 23m ago
If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.
> Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method.
> If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.
I read the article, but as I said Qwen chat only provides up to 262k tokens in context length, so I'll stick with Qwen2.5 Turbo which supports 1M tokens.
I am not in a position where I can self-host yet
gizmodo59 · 1h ago
My take on long context for many frontier models is not about support but the accuracy drops drastically as you increase the context. Even if a model claims to support 10M context, reality is it doesn’t perform well when you saturate. Curious to hear others perspective on this
kridsdale3 · 29m ago
This is my experience with Gemini. Yes, I really can put an entire codebase and all the docs and pre-dev discussions and all the inter-engineer chat logs in there.
I still see the model becoming more intoxicated as turn count gets high.
cpursley · 53m ago
How are you prepping the PDF data before shoving it into Qwen?
navbaker · 26m ago
Not OP, but we use the docling library to extract text and put it in markdown before storing for use with an LLM.
Alifatisk · 39m ago
I just compress the file size as low as possible without losing the quality, didn't even know there was more ways to prep it.
I do sometimes chop up the PDF into smaller pdfs with their own individual chapters
amelius · 30m ago
On Linux you can use pdftotext also if you are only concerned with the text.
mmmllm · 1h ago
The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc
mdp2021 · 47s ago
The real demand quality is not there, so more processing is very probably needed, so efficiency gains may allow the extra processing.
(A string example read today of Real demand quality: the administration of Albania wants some sort of automated Cabinet Minister. Not just an impartial and incorruptible algorithm (what we normally try to do with deterministic computation): a "minister". Good luck with that.)
Sure but where is the demand going to come from? LLMs are already in every google search, in Whatsapp/Messenger, throughout Google workspace, Notion, Slack, etc. ChatGPT already has a billion users.
Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.
idopmstuff · 4m ago
> Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc.
Is that true? BLS estimates of customer service reps in the US is 2.8M (https://www.bls.gov/oes/2023/may/oes434051.htm), and while I'll grant that's from 2023, I would wager a lot that the number is still above 2M. Similarly, the overwhelming majority of software developers haven't lost their jobs to AI.
A sufficiently advanced LLM will be able to replace most, if not all of those people. Penetration into those areas is very low right now relative to where it could be.
philipp-gayret · 8m ago
If LLMs were next to free and faster I would personally increase my consumption 100x or more and Im only the "programming" category.
mirekrusin · 25m ago
We've seen several orders of magnitude improvements in cpus over the years, yet you try to do anything now and interaction is often slower than that on zx spectrum. We can easily fill in order of magnitude improvement and that's only going to create more demand. We can/will have models thinking for us all the time, in parallel and bother us with findings/final solutions only. There is no limit here really.
amelius · 27m ago
If you can make an LLM solve a problem but from 100 different angles at the same time, that's worth something.
mmmllm · 14m ago
Isn't that essentially how the MoE models already work? Besides, if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost?
Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.
mirekrusin · 5m ago
MoE is something different - it's a technique to activate just a small subset of parameters during inference.
Whatever is good enough now, can be much better for the same cost (time, computation, actual cost). People will always choose better over worse.
sauwan · 28m ago
Long running agents?
ls65536 · 1h ago
I'm not going to speculate about what might be ahead in regards to Oracle's forecasting of data center demand, but regarding the idea of efficiency gains leading to lower demand, don't you think something like Jevons paradox might apply here?
thinkingemote · 47m ago
If someone had to bet on an AI crash which I imagine would led to unused datacentres and cheap GPUs how would they invest their winnings to exploit these resources?
CuriouslyC · 24m ago
If the price of inference drops through the floor all the AI wrapper companies become instantly more valuable. Cursor is living on borrowed time because their agents suck and they're coasting on first mover advantage with weak products in general, but their position would get much better with cheap inference.
kridsdale3 · 31m ago
Assuming your question isn't rhetorical, massive Oracle Crypto Farm.
jstummbillig · 42m ago
For the last 2 years, despite all efficiency gains, I am literally watching characters appear on my screen, as if this was a hacker movie. Lately, I am also waiting for at least 60s for anything to appear at all.
If that happened at 10x the speed, I would still be slow in computer terms, and that increasingly matter, because I will not be the one reading the stuff – it will be other computers. I think looking back a few years from now, every single piece of silicon that is planned right will look like a laudable but laughable drop in the ocean.
syntaxing · 3h ago
The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.
moffkalast · 2h ago
In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.
jychang · 1h ago
Strong disagree.
Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.
Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.
In essence, it was Deepseek R1 before Deepseek R1.
seunosewa · 51m ago
He is definitely talking about Llama4.
NitpickLawyer · 1h ago
It's not that clear. Yes, it underperforms in recent benchmarks and usecases (i.e. agentic stuff), but it is still one of the strongest open models in terms of "knowledge". Dense does have that advantage of MoE, even if it's extremely expensive to run inference on.
Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?
Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.
Naturally. That's how LLMs work. During training you measure the loss, the difference between the model output and the ground-truth and try to minimize it.
We prize models for their ability to learn. Here we can see that the large model does a great job at learning to draw bob, while the small model performs poorly.
ACCount37 · 47m ago
We don't value LLMs for rote memorization though. Perfect memorization is a long solved task. We value LLMs for their generalization capabilities.
A scuffed but fully original ASCII SpongeBob is usually more valuable than a perfect recall of an existing one.
One major issue with highly sparse MoE is that it appears to advance memorization more than it advances generalization. Which might be what we're seeing here.
endymion-light · 3h ago
I'd argue that actually, the smaller model is doing a better job at "learning" - in that it's including key characteristics within an ascii image while poor.
The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.
Maybe a pelican riding a bike as ascii for both?
ricardobeat · 2h ago
For the model to have memorized the entire sequence of characters precisely, this must appear hundreds of times in the training data?
ginko · 3h ago
Conveniently removed the artist's signature though.
irthomasthomas · 3h ago
Yes - they all do that. Actually, most attempts start well but unravel toward the end.
Certainly not defending LLMs here, don't mistake with that.
Humans do it too. I have given up on my country's non-local information sources, because I could recognize original sources that are being deliberately omitted. There's a satiric webpage that is basically a reddit scrape. Most of users don't notice and those who do, don't seem to care.
yorwba · 3h ago
Yes, the most likely reason the model omitted the signature is that humans reposted more copies of this image omitting the signature than ones that preserve it.
matchcc · 1h ago
I think there is some distillation relationship between Kimi K2 and Qwen Coder or other related other models, or same training data. I tried most of LLMs, only kimi K2 gave the exact same ASCII.
kimi K2:
Here’s a classic ASCII art of SpongeBob SquarePants for you:
For ascii to look right, not messed up, the generator has to know the width of the div in ascii characters, e.g. 80, 240, etc, so it can make sure the lines don't wrap. So how does an LLM know anything about the UI it's serving? Is it just luck? what if you ask it to draw something that like 16:9 in aspect ratio... would it know to scale it dowm so lines won't wrap? how about loss of details if it does? Also, is it as good with Unicode art? So many questions.
Leynos · 17m ago
They don't see runs of spaces very well, so most of them are terrible at ASCII art. (They'll often regurgitate something from their training data rather than try themselves.)
And unless their terminal details are included in the context, they'll just have to guess.
Seems impressive, i believe better architectures are really the path forward, i don't think you need more than 100B params taking this model and what GPT OSS 120B can acchieve
CuriouslyC · 8m ago
We definitely need more parameters, low param models are hallucination machines, though low actives is probably fine assuming the routing is good.
NitpickLawyer · 4h ago
New arch seems cool, and it's amazing that we have these published in the open.
That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.
vintermann · 4h ago
Could you give some practical examples? I don't know what Qwen's 36T-token training set is like, so I don't know what it's overfitting to...
NitpickLawyer · 4h ago
Take math and coding for example:
- in math, if they can solve a problem, or a class of problems, they'll solve it. If you use a "thinking" model + maj@x, you'll get strong results. But if you try for example to have the model consider a particular way or method of exploring a problem, it'll default to "solving" mode. It's near impossible to have it do something else with a math problem, other than solving it. Say "explore this part, in this way, using this method". Can't do it. It'll maybe play a bit, but then enter "solving" mdoe and continue to solve it as it was trained.
In practice, this means that "massive parallel" test time compute becomes harder to do with these models, because you can't "guide" them towards certain aspects of a problem. They are extremely "stubborn".
- in coding it's even more obvious. Ask them to produce any 0shot often tested and often shown things (spa, game, visualisation, etc) - and they do it. Convincingly.
But ask them to look at a piece of code and extract meaning, and they fail. Or ask them to reverse an implementation. Figure out what a function does and reverse its use, or make it do something else, and they fail.
CuriouslyC · 6m ago
That's the thing people miss that's so good about GPT5. It's incredibly steerable in a way a lot of models aren't.
vintermann · 3h ago
Oof, that sounds frustrating. Yeah, I can relate to this failure mode, it's basically "did you mean (more likely query)" up to 11.
It does sound like an artifact of the dialog/thinking tuning though.
Hmm. 80B. These days I am on the lookout for new models in the 32B range, since that is what fits and runs comfortably on my MacBook Pro (M4, 64GB).
I use ollama every day for spam filtering: gemma3:27b works great, but I use gpt-oss:20b on a daily basis because it's so much faster and comparable in performance.
electroglyph · 1h ago
it'll run great, it's an moe.
slimebot80 · 4h ago
Complete newbie here - some questions, if I may!
This stuff can run on a local machine without internet access, correct?
Also -- what are the specs for a machine to run it (even if slowly!)
NitpickLawyer · 4h ago
This model can be run completely offline, yes. You'll need anywhere from 60-200 gb of RAM (either VRAM for high speeds, or a combination of VRAM and RAM, or just CPU+RAM). The active params are really low (3B) so it'll likely run fine even on CPU. Should get 10-15+t/s even on old DDR4 systems. Offload some experts to a GPU (can be as low as 8-16gb) and you'll see greater speeds.
This has nothing to do with nano banana, or image generation. For that you want the qwen image edit[1] models.
what you mean is Qwen Image and Qwen Image Edit, you can run it on local machine, using Draw Things application for example.
the model discussed here is text model, so similar to ChatGPT. You can also run it on your local machine, but not yet, as apps need to be updated with Qwen 3 Next support (llama.cpp, Ollama, etc)
dragonwriter · 4h ago
> This stuff can run on a local machine without internet access, correct?
Yes.
> And it can pretty much match Nano Banana?
No, Qwen3-Next is not a multimodal model, it has no image generation function.
Davidzheng · 4h ago
Isn't this one a text model
slimebot80 · 4h ago
Ah, maybe! I am lost reading this page with all the terminology
arcanemachiner · 3h ago
You'll get used to it.
Make sure to lurk on r/LocalLlama.
diggan · 1h ago
> Make sure to lurk on r/LocalLlama.
Please do take everything you read there with a bit of salt though, as the "hive-mind" effect is huge there, even when compared to other subreddits.
I'm guessing the huge influx of money + reputations on the line + a high traffic community is ripe for both hive-minding + influence campaigns.
kristopolous · 1h ago
I was getting a bunch of strange hallucinations and weird dialog. It sounds like some exasperated person on the verge of a mental breakdown
techsystems · 3h ago
How does the context length scaling at 256K tokens compare to Llama's 1M in terms of performance? How are the contexts treated differently?
Western0 · 41m ago
where is gguf?
pveierland · 3h ago
> "The content loading failed."
It's amazing how far and how short we've come with software architectures.
yekanchi · 3h ago
how much vram it requires?
NitpickLawyer · 3h ago
A good rule of thumb is to think that one param is one unit of storage. The "default" unit of storage these days is bf16 (i.e. 16 bits for 1 weight). So for a 80B model that'll be ~160GB of weights. Then you have quantisation, usually in 8bit and 4bit. That means each weight is "stored" in 8bits or 4bits. So for a 80B model that'll be ~80GB in fp8 and ~40GB in fp4/int4.
But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.
So you'll see in practice that you need 20-50% more RAM than this rule of thumb.
For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).
theanonymousone · 2h ago
But the RAM+VRAM can never be less than the size of the total (not active) model, right?
NitpickLawyer · 2h ago
Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model.
That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.
theanonymousone · 2h ago
Can you give me a name please? Is that distributed llama or something else?
Thats not a meaningful question. Models can be quantized to fit into much smaller memory requirements, and not all MoE layers (in MoE models) have to be offloaded to VRAM to maintain performance.
yekanchi · 3h ago
i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?
EnPissant · 3h ago
MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.
regularfry · 3h ago
This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.
EnPissant · 2h ago
What you are describing would be uselessly slow and nobody does that.
DiabloD3 · 1h ago
I don't load all the MoE layers onto my GPU, and I have only about a 15% reduction in token generation speed while maintaining a model 2-3 times larger than VRAM alone.
furyofantares · 1h ago
I do it with gpt-oss-120B on 24 GB VRAM.
littlestymaar · 1h ago
AFAIK many people on /r/localLlama do pretty much that.
keyle · 3h ago
For a model that can run offline, they've nailed how the website can too.
Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...
But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.
Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...
So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.
Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.
Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
That doesn't match my understanding of what speculative decoding does: AFAIK with regular speculative decoding you ask a smaller llm infer the next few tokens (let say 5 tokens) and then, you can have the big model infer token 1, 2, 3, 4, 5 and 6 in parallel (each time starting from the sentence partially completed by the smaller model). Because llms are bandwidth bound, doing the same work six times in parallel isn't slower than doing it only once (what's costly is moving the massive model weights between VRAM and the GPU cores).
If token 1,2 and 3 match what the small models inferred, then you keep them. As soon as you have a mismatched token (say token 4) it means that you have to discard the next inferred tokens (here token 5 and 6) because they were calculated under a wrong assumption for token 4.
So if the MTP layer merely replace the smaller llm in the previous scheme with everything else working the same way, you would save anything when inferring “Obama” (you'd still need to “generate it the regular way”, as there isn't really another way) but you could also start working on the word immediately after “Obama” by assuming “Obama” was already chose. And if the model actually outputted “Hussein” instead of “Obama”, then the token calculated to happen after “Obama” would have to be discarded.
Or maybe my understanding of speculative decoding is completely off…
No comments yet
If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90) If n+1 = "The" then n+2 = "quick" (confidence: 0.45) If n+1 = "President" then n+2 = "Biden" (confidence: 0.75)
A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough".
I just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysis
According to Qwen Chat, Qwen3-Next has the following limits:
Maximum context length: 262,144 tokens
Max summary generation length: 32,768 tokens
This is 2x higher on context length and 4x higher on summary generation compared to Qwen3-235B-A22B, damn
> Qwen3-Next [...] excels in ultra-long-context understanding and complex tasks
Even though their new hybrid architecture is fascinating, I think I'll continue to stick with Qwen2.5-Turbo because it's one of the few models that supports 1M tokens in context length. My use case is uploading large pdfs and ask questions across chapters
> Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method.
Source: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct#proc...
I read the article, but as I said Qwen chat only provides up to 262k tokens in context length, so I'll stick with Qwen2.5 Turbo which supports 1M tokens.
I am not in a position where I can self-host yet
I still see the model becoming more intoxicated as turn count gets high.
I do sometimes chop up the PDF into smaller pdfs with their own individual chapters
(A string example read today of Real demand quality: the administration of Albania wants some sort of automated Cabinet Minister. Not just an impartial and incorruptible algorithm (what we normally try to do with deterministic computation): a "minister". Good luck with that.)
Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.
Is that true? BLS estimates of customer service reps in the US is 2.8M (https://www.bls.gov/oes/2023/may/oes434051.htm), and while I'll grant that's from 2023, I would wager a lot that the number is still above 2M. Similarly, the overwhelming majority of software developers haven't lost their jobs to AI.
A sufficiently advanced LLM will be able to replace most, if not all of those people. Penetration into those areas is very low right now relative to where it could be.
Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.
Whatever is good enough now, can be much better for the same cost (time, computation, actual cost). People will always choose better over worse.
If that happened at 10x the speed, I would still be slow in computer terms, and that increasingly matter, because I will not be the one reading the stuff – it will be other computers. I think looking back a few years from now, every single piece of silicon that is planned right will look like a laudable but laughable drop in the ocean.
Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.
https://artificialanalysis.ai/#frontier-language-model-intel...
Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.
In essence, it was Deepseek R1 before Deepseek R1.
Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...
Here's a classic ASCII art representation of SpongeBob SquarePants:
Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.
A scuffed but fully original ASCII SpongeBob is usually more valuable than a perfect recall of an existing one.
One major issue with highly sparse MoE is that it appears to advance memorization more than it advances generalization. Which might be what we're seeing here.
The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.
Maybe a pelican riding a bike as ascii for both?
Humans do it too. I have given up on my country's non-local information sources, because I could recognize original sources that are being deliberately omitted. There's a satiric webpage that is basically a reddit scrape. Most of users don't notice and those who do, don't seem to care.
And unless their terminal details are included in the context, they'll just have to guess.
That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.
- in math, if they can solve a problem, or a class of problems, they'll solve it. If you use a "thinking" model + maj@x, you'll get strong results. But if you try for example to have the model consider a particular way or method of exploring a problem, it'll default to "solving" mode. It's near impossible to have it do something else with a math problem, other than solving it. Say "explore this part, in this way, using this method". Can't do it. It'll maybe play a bit, but then enter "solving" mdoe and continue to solve it as it was trained.
In practice, this means that "massive parallel" test time compute becomes harder to do with these models, because you can't "guide" them towards certain aspects of a problem. They are extremely "stubborn".
- in coding it's even more obvious. Ask them to produce any 0shot often tested and often shown things (spa, game, visualisation, etc) - and they do it. Convincingly.
But ask them to look at a piece of code and extract meaning, and they fail. Or ask them to reverse an implementation. Figure out what a function does and reverse its use, or make it do something else, and they fail.
It does sound like an artifact of the dialog/thinking tuning though.
I use ollama every day for spam filtering: gemma3:27b works great, but I use gpt-oss:20b on a daily basis because it's so much faster and comparable in performance.
This stuff can run on a local machine without internet access, correct?
And it can pretty much match Nano Banana? https://github.com/PicoTrex/Awesome-Nano-Banana-images/blob/...
Also -- what are the specs for a machine to run it (even if slowly!)
This has nothing to do with nano banana, or image generation. For that you want the qwen image edit[1] models.
1 - https://huggingface.co/Qwen/Qwen-Image-Edit
the model discussed here is text model, so similar to ChatGPT. You can also run it on your local machine, but not yet, as apps need to be updated with Qwen 3 Next support (llama.cpp, Ollama, etc)
Yes.
> And it can pretty much match Nano Banana?
No, Qwen3-Next is not a multimodal model, it has no image generation function.
Make sure to lurk on r/LocalLlama.
Please do take everything you read there with a bit of salt though, as the "hive-mind" effect is huge there, even when compared to other subreddits.
I'm guessing the huge influx of money + reputations on the line + a high traffic community is ripe for both hive-minding + influence campaigns.
It's amazing how far and how short we've come with software architectures.
But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.
So you'll see in practice that you need 20-50% more RAM than this rule of thumb.
For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).
That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.
And it appears like it's thinking about it! /s