Ask HN: What is the best LLM for consumer grade hardware?
209 VladVladikoff 165 5/30/2025, 11:02:19 AM
I have a 5060ti with 16GB VRAM. I’m looking for a model that can hold basic conversations, no physics or advanced math required. Ideally something that can run reasonably fast, near real time.
In general there's no "best" LLM model, all of them will have some strengths and weaknesses. There are a bunch of good picks; for example:
> DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Released today; probably the best reasoning model in 8B size.
> Qwen3 - https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
Recently released. Hybrid thinking/non-thinking models with really great performance and plethora of sizes for every hardware. The Qwen3-30B-A3B can even run on CPU with acceptable speeds. Even the tiny 0.6B one is somewhat coherent, which is crazy.
Normally with llama-cpp you specifiy how many (full) layers you want to put in GPU (-ngl) . But CPU-offloading specific tensors that don't require heavy computation , saves GPU space without affecting speed that much.
I've also read a paper on loading only "hot" neurons into the cpu [2] . The future of home AI looks so cool!
[1] https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...
[2] https://arxiv.org/abs/2312.12456
For folks new to reddit, it's worth noting that LocalLlama, just like the rest of the internet but especially reddit, is filled with misinformed people spreading incorrect "facts" as truth, and you really can't use the upvote/downvote count as an indicator of quality or how truthful something is there.
Something that is more accurate but put in a boring way will often be downvoted, while straight up incorrect but funny/emotional/"fitting the group think" comments usually get upvoted.
For us who've spent a lot of time on the web, this sort of bullshit detector is basically built-in at this point, but if you're new to places where the group think is so heavy as on reddit, it's worth being careful taking anything at face value.
- Learning basic terms and concepts.
- Learning how to run local inference.
- Inference-level considerations (e.g., sampling).
- Pointers to where to get other information.
- Getting the vibe of where things are.
- Healthy skepticism about benchmarks.
- Some new research; there have been a number of significant discoveries that either originated in LocalLlama or got popularized there.
LocalLlama is bad because:
- Confusing information about finetuning; there's a lot of myths from early experiments that get repeated uncritically.
- Lots of newbie questions get repeated.
- Endless complaints that it's been took long since a new model was released.
- Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.
Is there a good place for this? Currently I just regularly sift through all of the garbage myself on arxiv to find the good stuff, but it is somewhat of a pain to do.
It also helps that the target audience has been filtered with that moderation, so over time this site (on average) skews more technical and informed.
One way to gauge this property of a community is whether people who are known experts in a respective field participate in it, and unfortunately there are very few of them on HackerNews (this was not always the case). I've had some opportunities to meet with people who are experts, usually at conferences/industry events, and while many of them tend to be active on Twitter... they all say the same things about this site, namely that it's simply full of bad information and the amount of effort needed to dispel that information is significantly higher than the amount of effort needed to spread it.
Next time someone posts an article about a topic you are intimately familiar with, like top 1% subject matter expert in... review the comment section for it and you'll find just heaps of misconceptions, superficial knowledge, and my favorite are the contrarians who take these very strong opinions on a subject they have some passing knowledge about but talk about their contrarian opinion with such a high degree of confidence.
One issue is you may not actually be a subject matter expert on a topic that comes up a lot on HackerNews, so you won't recognize that this happens... but while people here are a lot more polite and the moderation policies do encourage good behavior... moderation policies don't do a lot to stop the spread of bad information from poorly informed people.
I consider myself an expert in one tiny niche field (computer generated code), and when that field comes up (on HN and elsewhere) over the last 30 years the general mood (from people who don't do it) is that it's poor quality code.
Pre-AI this was demonstrably untrue, but meh, I don't need to convince you, so I accept your point of view, and continue doing my thing. Our company revenue is important to me, not the opinion of done guy on the internet.
(AI has freshened the conversation, and it is currently giving mixed results, which is to be expected since it is non-deterministic. But I've been doing deterministic generation for 35 years.)
So yeah. Lots of comments from people who don't fo something, and I'm really not interested in taking the time to "prove" them wrong.
But equally I think the general level of discussion in areas where I'm not an expert (but experienced) is high. And around a lot of topics experience can be highly different.
For example companies, employees and employers come in all sorts of ways. Some folk have been burned and see (all) management through a certain light. Whereas of course, some are good, some are bad.
Yes, most people still use voting as a measure of "I agree with this", rather than the quality of the discussion, but that's just people, and I'm not gonna die on that hill.
And yeah, I'm not above joining in on a topic I don't technically use or know much about. I'll happily say that the main use for crypto (as a currency) is for illegal activity. Or that crypto in general is a ponzi scheme. Maybe I'm wrong, maybe it really is the future. But for now, it walks like a duck.
So I both agree, and disagree, with you. But I'm still happy to hang out here and get into (hopefully) illuminating discussions.
Perhaps we are defining experts differently?
The other aspect is that people on here think they're that if they are an expert in one thing, they instantly become an expert in another thing.
There's also no actual constructive discussion when it comes to future looking tech. The Cybertruck, Vision Pro, LLMs are some of the most recent items that were absolutely inaccurately called by the most popular comments. And their reasoning for their prediction had no actual substance in their comments.
Eg just look at the 2012+ videos of thunderf00t.
Yet people were literally banned here just for pointing out that he hasn't actually delivered on anything in the capacity he promised until he did the salute.
It's pointless to list other examples, as this page is- as dingnuts pointed out - exactly the same and most people aren't actually willing to change their opinion based on arguments. They're set in their opinions and think everyone else is dumb.
I'd be shocked if they (you?) were banned just for critiquing Musk. So please link the post. I'm prepared to be shocked.
I'm also pretty sure that I could make a throwaway account that only posted critiques of Musk (or about any single subject for that matter) and manage to keep it alive by making the critiques timely, on-topic and thoughtful or get it banned by being repetitive and unconstructive. So would you say I was banned for talking about <topic>? Or would you say I was banned for my behavior while talking about <topic>?
Today he’s pitching moonshot projects as core to Tesla.
10 years ago he was saying self-driving was easy, but he was also selling by far the best electric vehicle on the market. So lying about self driving and Tesla semis mattered less.
Fwiw I’ve been subbed to tf00t since his 50 part creationist videos in early 2010s.
> They're set in their opinions and think everyone else is dumb.
Well, anyway, I read and post comments here because commenters here think critically about discussion topics. It’s not a perfect community with perfect moderation but the discussions are of a quality that’s hard to find elsewhere, let alone reddit.
Scroll to the bottom of comment sections on HN, you’ll find the kind of low-effort drive-by comments that are usually at the top of Reddit comment sections.
In other words, it helps to have real moderators.
HN has an active grifter culture reinforced by the VC funding cycles. Reddit can only dream about lying as well as HN does.
HN tends to push up grifter hype slop, and there are a lot of those people around cause VC, but you can still see comments pushing back.
Reading reddit reminds me of highschool forum arguments I've had 20 years ago, but lower quality because of population selection. It's just too mainstream at this point and shows you what the middle of the bell curve looks like.
Personally I also think the submissions that make it to the front page(s) are much better than any subreddit.
> Friend, this website is EXACTLY the same
And it gnows it: https://news.ycombinator.com/item?id=4881042
For example I find all comments about model X be more "friendly" or "chatty" and model Y being more "unhinged" or whatever to be mostly BS. Like there's gazillion ways a conversation can go and I don't find model X or Y to be consistently chatty or unhinged or creative or whatever every time.
Like, we’re fucking two years in and only now do we have a thread about something like this? The whole crowd here needs to speed up to catch up.
There are others wondering if this is another hype juggernaut like CORBA, J2EE, WSDL, XML, no-SQL, or who-knows-what. A way to do things that some people treated as the new One True Way, but others could completely bypass for their entire, successful career and look at it now in hindsight with a chuckle.
I think LLMs will find their uses, it just takes time to distil what they are really useful for vs what the AI companies are generating hype for.
For example, I think they can be used to create better auto-complete by giving them the context information (matching functions, etc.) and letting them generate the completion text from that.
No. I was not trolling. If you explain why you think I'm trolling I could provide a better response to your generic reply.
Are you trolling?
I wouldn't count Qwen as that much of a conversationalist though. Mistral Nemo and Small are pretty decent. All of Llama 3.X are still very good models even by today's standards. Gemma 3s are great but a bit unhinged. And of course QwQ when you need GPT4 at home. And probably lots of others I'm forgetting.
Sometimes it’s hard to find models that can effectively use tools
https://leaderboard.techfren.net/
Thanks for the recommendation, will try it out
Thank you for thinking of the vibe coders.
> all of them will have some strengths and weaknesses
Sometimes a higher parameter model with less quantization and low context will be the best, sometimes lower parameter model with some quantization and huge context will be the best, sometimes high parameter count + lots of quantization + medium context will be the best.
It's really hard to say one model is better than another in a general way, since it depends on so many things like your use case, the prompts, the settings, quantization, quantization method and so on.
If you're building/trying to build stuff depending on LLMs in any capacity, the first step is coming up with your own custom benchmark/evaluation that you can run with your specific use cases being put under test. Don't share this publicly (so it doesn't end up in the training data) and run it in order to figure out what model is best for that specific problem.
Anything below 4-bits is usually not worth it unless you want to experiment with running a 70B+ model -- though I don't have any experience of doing that, so I don't know how well the increased parameter size balances the quantization.
See https://github.com/ggml-org/llama.cpp/pull/1684 and https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... for comparisons between quantization levels.
Note that that's a skill issue of whoever quantized the model. In general quantization even as low as 3-bit can be almost loseless when you do quantization-aware finetuning[1] (and apparently you don't even need that many training tokens), but even if you don't want to do any extra training you can be smart as to which parts of the model you're quantizing and by how much to minimize the damage (e.g. in the worst case over-quantizing even a single weight can have disastrous consequences[2])
Some time ago I ran an experiment where I finetuned a small model while quantizing parts of it to 2-bits to see which parts are most sensitive (the numbers are the final loss; lower is better):
So as you can see quantizing some parts of the model affects it more strongly. The downprojection in the MLP layers is the most sensitive part of the model (which also matches with what [2] found), so it makes sense to quantize this part of the model less and instead quantize other parts more strongly. But if you'll just do the naive "quantize everything in 4-bit" then sure, you might get broken models.[1] - https://arxiv.org/pdf/2502.02631 [2] - https://arxiv.org/pdf/2411.07191
And it's not a skill issue... it's the default behaviour/logic when using k-quants to quantize a model with llama.cpp.
I’m a bit surprised that 8GB is useful as a context window if that is the case—it just seems like you could fit a ton of research papers, emails, and textbooks in 2GB, for example.
But, I’m commenting from a place of ignorance and curiosity. Do models blow up the info in the context window, maybe do some processing to pre-digest it?
You absolutely can not fill even a single research paper in 2 GB much less an entire book.
Actually DeepSeek-R1-0528-Qwen3-8B was uploaded Thursday (yesterday) at 11 AM UTC / 7 PM CST. I had to check if a new version came out since! I am waiting for the other sizes! ;D
https://www.adweek.com/media/a-federal-judge-ordered-openai-...
Can it be solved locally with locally running MCPs? Or maybe it's a system API - like reading your calendar or checking your email. Otherwise it identifies the best cloud model and sends the prompt there.
Basically Siri if it was good
That idea makes so much sense on paper, but until you start implementing it that you realized why no one does it (including Siri). "Some tasks are complex and better suited for complex giant model, but small models are perfectly capable of running simple limited task" makes a ton of sense, but the component best equipped at evaluating that decision is the smarter component of your system. At which point, you might as well have had it run the task.
It's like assigning the intern to triage your work items.
When actually implementing the application with that approach, every time you encounter an "AI-miss" you would (understandably) blame the small model, and eventually give up and delegate yet-another-scnario to the cloud model.
Eventually you feel you're artificially handcuffing yourself compared to literally every body else trying to ship something utilizing a 1b model. You have the worst of all options, crappy model with lots of hiccups yet it's still (by far) the most resource intensive part of your application making the whole thing super heavy and you are delegating more and more to the cloud model.
The local LLM scenario is going to be entirely driven by privacy concerns (around which there is no option. It's not like an E2EE LLM API could exist) or cost concerns if you believe you can run it cheaper.
Of course, it still isn't at the same level as Codex itself, the model Codex is using is just way better so of course it'll get better results. But Devstral (as I currently use it) is able to make smaller changes and refactors, and I think if I evolve the software a bit more, can start making larger changes too.
And why not just use Openhands, which it was designed around which I presume can also do all those things?
It's an AI-driven chat system designed to support students in the Introduction to Computing course (ECE 120) at UIUC, offering assistance with course content, homework, or troubleshooting common problems.
It serves as an educational aid integrated into the course’s learning environment using UIUC Illinois Chat system [2].
Personally I've found it's really useful that it provides the details portions of course study materials for examples slides that's directly related to the discussions so the students can check the sources veracity of the answers provided by the LLM.
It seems to me that RAG is the killer feature for local LLM [3]. It directly addressed the main pain point of LLM hallucinations and help LLMs stick to the facts.
[1] Introduction to Computing course (ECE 120) Chatbot:
https://www.uiuc.chat/ece120/chat
[2] UIUC Illinois Chat:
https://uiuc.chat/
[3] Retrieval-augmented generation [RAG]:
https://en.wikipedia.org/wiki/Retrieval-augmented_generation
Running it locally helps me understand how these things work under the hood, which raises my value on the job market. I also play with various ideas which have LLM on the backend (think LLM-powered Web search, agents, things of that nature), I don't have to pay cloud providers, and I already had a gaming rig when LLaMa was released.
- Experiments with inference-level control; can't do the Outlines / Instructor stuff with most API services, can't do the advanced sampling strategies, etc. (They're catching up but they're 12 months behind what you can do locally.)
- Small, fast, finetuned models; _if you know what your domain is sufficiently to train a model you can outperform everything else_. General models usually win, if only due to ease of prompt engineering, but not always.
- Control over which model is being run. Some drift is inevitable as your real-world data changes, but when your model is also changing underneath you it can be harder to build something sustainable.
- More control over costs; this is the classic on-prem versus cloud decision. Most cases you just want to pay for the cloud...but we're not in ZIRP anymore and having a predictable power bill can trump sudden unpredictable API bills.
In general, the move to cloud services was originally a cynical OpenAI move to keep GPT-3 locked away. They've built up a bunch of reasons to prefer the in-cloud models (heavily subsidized fast inference, the biggest and most cutting edge models, etc.) so if you need the latest and greatest right now and are willing to pay, it's probably the right business move for most businesses.
This is likely to change as we get models that can reasonably run on edge devices; right now it's hard to build an app or a video game that incidentally uses LLM tech because user revenue is unlikely to exceed inference costs without a lot of careful planning or a subscription. Not impossible, but definitely adds business challenges. Small models running on end-user devices opens up an entirely new level of applications in terms of cost-effectiveness.
If you need the right answer, sometimes only the biggest cloud API model is acceptable. If you've got some wiggle room on accuracy and can live with sometimes getting a substandard response, then you've got a lot more options. The trick is that the things that an LLM is best at are always going to be things where less than five nines of reliability are acceptable, so even though the biggest models have more reliability, an average there are many tasks where you might be just fine with a small fast model that you have more control over.
Mostly I use it for testing tools and integrations via API not to spend money on subscriptions. When I see something working I switch it to proprietary one to get best results.
The stuff you can run on reasonable home hardware (e.g. a single GPU) isn't going to blow your mind. You can get pretty close to GPT3.5, but it'll feel dated and clunky compared to what you're used to.
Unless you have already spent big $$ on a GPU for gaming, I really don't think buying GPUs for home makes sense, considering the hardware and running costs, when you can go to a site like vast.ai and borrow one for an insanely cheap amount to try it out. You'll probably get bored and be glad you didn't spend your kids' college fund on a rack of H100s.
The average person in r/locallama has a machine that would make r/pcmasterrace users blush.
A brand new Mac Mini M4 is only $499.
https://www.apple.com/shop/buy-mac/mac-mini/m4
https://www.microcenter.com/product/688173/Mac_mini_MU9D3LL-...
The only thing that is itching me to get a new machine is it needs a 19V power supply. Luckily it's a pretty common barrel size, I already had several power cables laying around that work just fine. I'd prefer to just have all my portable devices to run off USB-C though.
https://www.microcenter.com/product/676305/acer-aspire-3-a31...
> (MoE) divides an AI model into separate sub-networks (or "experts"), each specializing in a subset of the input data, to jointly perform a task.
What you typically end up with in memory constrained environments is that the core shared layers are in fast memory (VRAM, ideally) and the rest are in slower memory (system RAM or even a fast SSD).
MoE models are typically very shallow-but-wide in comparison with the dense models, so they end up being faster than an equivalent dense model, because they're ultimately running through fewer layers each token.
There's no one "best" model, you just try a few and play with parameters and see which one fits your needs the best.
Since you're on HN, I'd recommend skipping Ollama and LMStudio. They might restrict access to the latest models and you typically only choose from the ones they tested with. And besides what kind of fun is this when you don't get to peek under the hood?
llamacpp can do a lot itself, and you can do most recently released models (when changes are needed they adjust literally within a few days). You can get models from huggingface obviously. I prefer GGUF format, saves me some memory (you can use lower quantization, I find most 6-bit somewhat satisfactory).
I find that the the size of the model's GGUF file with roughly tell me if it'll fit in my VRAM. For example 24Gb GGUF model will NOT fit in 16Gb, whereas 12Gb likely will. However, the more context you add the more RAM will be needed.
Keep in mind that models are trained with certain context window. If it has 8Kb context (like most older models do) and you load it with 32Kb context it won't be much help.
You can run llamacpp on Linux, Windows, or MacOS, you can get the binaries or compile on your local. It can split the model between VRAM and RAM (if the model doesn't fit in your 16Gb). It even has simple React front-end (llamacpp-server). The same module provides REST service which has similar (but simpler) protocol to OpenAI and all the other "big" guys.
Since it implements OpenAI REST API, it also works with a lot of front-end tools if you want more functionality (ie oobabooga aka textgeneration webui).
Koboldcpp is another backend you can try if you find llamacpp to be too raw (I believe it's the still llamacpp under the hood).
`ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q8_0`
I disagree. With Ollama I can set up my desktop as an LLM server, interact with it over WiFi from any other device, and let Ollama switch seamlessly between models as I want to swap. Unless something has changed recently, with llama.cpp's CLI you still have to shut it down and restart it with a different command line flag in order to switch models even when run in server mode.
That kind of overhead gets in the way of experimentation and can also limit applications: there are some little apps I've built that rely on being able to quickly swap between a 1B and an 8B or 30B model by just changing the model parameter in the web request.
When you get Ollama to "switch seamlessly" between models it still simply reloads a different model with llamacpp which is what it's based on.
I prefer llamacpp because doing things "seamlessly" obscures the way things work behind the scenes, which is what I want to learn and play with.
Also, and I'm not sure if it's the case anymore but it used to be, when llamacpp gets adjusted to work with the latest model, sometimes it takes them a bit to update the Python API which is what Ollama is using. It was the case with one of the LlaMas, forget which one, where people said "oh yeah don't try this model with Ollama, they're waiting on llamacpp folks to update llama-cpp-python to bring the latest changes from llamacpp, and once they do, Ollama will bring the latest into their app and we'll be up and running. Be patient."
[1] https://www.localscore.ai/
You can even keep track of the quality of the answers over time to help guide your choice.
https://openwebui.com/
Once I figured out my local ROCm setup Ollama was able to run with GPU acceleration no problem. Connecting an OpenWebUI docker instance to my local Ollama server is as easy as a docker run command where you specify the OLLAMA_BASE_URL env var value. This isn't a production setup, but it works nicely for local usages like what the immediate parent is describing.
https://docs.openwebui.com/license/
However it's heavily censored on political topics because of its Chinese origin. For world knowledge, I'd recommend Gemma3.
This post will be outdated in a month. Check https://livebench.ai and https://aider.chat/docs/leaderboards/ for up to date benchmarks
The pace of change is mind boggling. Not only for the models but even the tools to put them to use. Routers, tools, MCP, streaming libraries, SDKs...
Do you have any advice for someone who is interested, developing alone and not surrounded by coworkers or meetups who wants to be able to do discovery and stay up to date?
It holds it's value so you won't lose much if anything when you resell it.
But otherwise, as said, install Ollama and/or Llama.cpp and run the model using the --verbose flag.
This will print out the token per second result after each promt is returned.
Then find the best model that gives you a token per second speed you are happy with.
And as also said, 'abliterated' models are less censored versions of normal ones.
Going below Q4 isn't worth it IMO. If you want significantly more context, probably drop down to a Q4 quant of Qwen3-8B rather than continuing to lobotomize the 14B.
Some folks have been recommending Qwen3-30B-A3, but I think 16GB of VRAM is probably not quite enough for that: at Q4 you'd be looking at 15GB for the weights alone. Qwen3-14B should be pretty similar in practice though despite being lower in param count, since it's a dense model rather than a sparse one: dense models are generally smarter-per-param than sparse models, but somewhat slower. Your 5060 should be plenty fast enough for the 14B as long as you keep everything on-GPU and stay away from CPU offloading.
Since you're on a Blackwell-generation Nvidia chip, using LLMs quantized to NVFP4 specifically will provide some speed improvements at some quality cost compared to FP8 (and will be faster than Q4 GGUF, although ~equally dumb). Ollama doesn't support NVFP4 yet, so you'd need to use vLLM (which isn't too hard, and will give better token throughput anyway). Finding pre-quantized models at NVFP4 will be more difficult since there's less-broad support, but you can use llmcompressor [1] to statically compress any FP16 LLM to NVFP4 locally — you'll probably need to use accelerate to offload params to CPU during the one-time compression process, which llmcompressor has documentation for.
I wouldn't reach for this particular power tool until you've decided on an LLM already, and just want faster perf, since it's a bit more involved than just using ollama and the initial quantization process will be slow due to CPU offload during compression (albeit it's only a one-time cost). But if you land on a Q4 model, it's not a bad choice once you have a favorite.
1: https://github.com/vllm-project/llm-compressor
I was trying Patricide unslop mell and some of the Qwen ones recently. Up to a point more params is better than worrying about quantization. But eventually you'll hit a compute wall with high params.
KV cache quantization is awesome (I use q4 for a 32k context with a 1080ti!) and context shifting is also awesome for long conversations/stories/games. I was using ooba but found recently that KoboldCPP not only runs faster for the same model/settings but also Kobold's context shifting works much more consistently than Ooba's "streaming_llm" option, which almost always re-evaluates the prompt when hooked up to something like ST.
- FP16: 2x 8GB = 16GB
- Q8: 1x 8GB
- Q4: 0.5x 8GB = 4GB
It doesn't 100% neatly map like this but this gives you a rough measure. In top of this you need some more memory depending on the context length and some other stuff.
Rationale for the calculation above: A model is basically a billions of variables with a floating number value. So the size of a model roughly maps to number of variables (weights) x word-precision of each variable (4, 8, 16bits..)
You don't have to quantize all layers to the same precision this is why sometimes you see fractional quantizations like 1.58bits.
For that level you can pack 4 weights in a byte using 2 bits per byte. However, there is one bit configuration in each that is unused.
More complex packing arrangements are done by grouping weights together (e.g. a group of 3) and assigning a bit configuration to each combination of values into a lookup table. This allows greater compression closer to the 1.68 bits value.
I'd like to know how many tokens you can get out of the larger models especially (using Ollama + Open WebUI on Docker Desktop, or LM Studio whatever). I'm probably not upgrading GPU this year, but I'd appreciate an anecdotal benchmark.
That said, Unsloth's version of Qwen3 30B, running via llama.cpp (don't waste your time with any other inference engine), with the following arguments (documented in Unsloth's docs, but sometimes hard to find): `--threads (number of threads your CPU has) --ctx-size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407 --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along with the other arguments you need.
Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF (since you have 16GB, grab Q3_K_XL, since it fits in vram and leaves about 3-4GB left for the other apps on your desktop and other allocations llama.cpp needs to make).
Also, why 30B and not the full fat 235B? You don't have 120-240GB of VRAM. The 14B and less ones are also not what you want: more parameters are better, parameter precision is vastly less important (which is why Unsloth has their specially crafted <=2bit versions that are 85%+ as good, yet are ridiculously tiny in comparison to their originals).
Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3
???
Just run a q4 quant of the same model and it will fit no problem.
A back of the envelope estimate of specifically unsloth/Qwen3-30B-A3B-128K-GGUF is 18.6GB for Q4_K_M.
It’s like asking what the best pair of shoes is.
Go on Ollama and look at the most popular models. You can decide for yourself what you value.
And start small, these things are GBs in size so you don’t want to wait an hour for a download only to find out a model runs at 1 token / second.
And the part I like the most is there is almost no censorship, at least not for the models I tried. For me, having an uncensored model is one of the most compelling reasons for running a LLM locally. Jailbreaks are a PITA and abliteration and other uncensoring fine-tunings tends to make models that have been made dumb by censorship even dumber.
I've found that Qwen3 is generally really good at following instructions and you can also very easily turn on or off the reasoning by adding "/no_think" in the prompt to turn it off.
The reason Qwen3:30B works so well is because it's a MoE. I have tested the 14B model and it's noticeably slower because it's a dense model.
I realize they aren’t going to be as good… but the whole search during reasoning is pretty great to have.
Ollama is the easiest way to get started trying things out IMO: https://ollama.com/
Ollama's default context length is frustratingly short in the era of 100k+ context windows.
My solution so far has been to boot up LM Studio to check if a model will work well on my machine, manually download the model myself through huggingface, run llama.cpp, and hook it up to open-webui. Which is less than ideal, and LM Studio's proprietary code has access to my machine specs.
Nobody uses Ollama as is. It's a model server. In clients you can specify the proper context lengths. This has never been a problem.
Qwen_Qwen3-14B-IQ4_XS.gguf https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF
Gemma3 is a good conversationalist but tends to hallucinate. Qwen3 is very smart but also very stubborn (not very steerable).
It's slow-ish but still useful, getting 5-10 tokens per second.
For 16gb and speed you could try Qwen3-30B-A3B with some offload to system ram or use a dense model Probably a 14B quant
I'll give Qwen2.5 a try on the Apple Silicon, thanks.
No comments yet
I asked it a question about militias. It thought for a few pages about the answer and whether to tell me, then came back with "I cannot comply".
Nidum is the name of uncensored Gemma, it does a good job most of the time.
Qwen3 family from Alibaba seem to be the best reasoning models that fit on local hardware right now. Reasoning models on local hardware are annoying in contexts where you just want an immediate response, but vastly outperform non-reasoning models on things where you want the model to be less naive/foolish.
Gemma3 from google is really good at intuition-oriented stuff, but with an obnoxious HR Boy Scout personality where you basically have to add "please don't add any disclaimers" to the system prompt for it to function. Like, just tell me how long you think this sprain will take to heal, I already know you are not a medical professional, jfc.
Devstral from Mistral performs the best on my command line utility where I describe the command I want and it executes that for me (e.g. give me a 1-liner to list the dotfiles in this folder and all subfolders that were created in the last month).
Nemo from Mistral, I have heard (but not tested) is really good for routing-type jobs, where you need something with to make a simple multiple-choice decision competently with low latency, and is easy to fine-tune if you want to get that sophisticated.
[0] https://ollama.com/search
It's pretty magical - it often feels like I'm talking to GPT-4o or o1, until it makes a silly mistake once in a while. It supports reasoning out of the box, which improves results considerably.
With the settings above, I get 60 tokens per second on an RTX 5090, because it fits entirely in GPU memory. It feels faster than GPT-4o. A 32k context with 2 parallel generations* consumes 28 GB of VRAM (with llama.cpp), so you still have 4 GB left for something else.
* I use 2 parallel generations because there's a few of us sharing the same GPU. If you use only 1 parallel generation, you can increase the context to 64k
Speaking of, would a Ryzen 9 12 core be nice for a 5090 setup?
Or should one really go dual 5090?
SmolVLM is pretty useful. https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct