Java Virtual Threads Ate My Memory: A Web Crawler's Tale of Speed vs. Memory (dariobalinzo.medium.com)

I have a 5060ti with 16GB VRAM. I’m looking for a model that can hold basic conversations, no physics or advanced math required. Ideally something that can run reasonably fast, near real time.

Comments (165)

kouteiheika · 20h ago

If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/

In general there's no "best" LLM model, all of them will have some strengths and weaknesses. There are a bunch of good picks; for example:

> DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Released today; probably the best reasoning model in 8B size.

> Qwen3 - https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...

Recently released. Hybrid thinking/non-thinking models with really great performance and plethora of sizes for every hardware. The Qwen3-30B-A3B can even run on CPU with acceptable speeds. Even the tiny 0.6B one is somewhat coherent, which is crazy.

xtracto · 10h ago

There was this great post the other day [1] showing that with llama-cpp you could offload some specific tensors to the CPU and maintain good performance. That's a good way to use lare(ish) models in commodity hardware.

Normally with llama-cpp you specifiy how many (full) layers you want to put in GPU (-ngl) . But CPU-offloading specific tensors that don't require heavy computation , saves GPU space without affecting speed that much.

I've also read a paper on loading only "hot" neurons into the cpu [2] . The future of home AI looks so cool!

[1] https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...

[2] https://arxiv.org/abs/2312.12456

diggan · 16h ago

> If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/

For folks new to reddit, it's worth noting that LocalLlama, just like the rest of the internet but especially reddit, is filled with misinformed people spreading incorrect "facts" as truth, and you really can't use the upvote/downvote count as an indicator of quality or how truthful something is there.

Something that is more accurate but put in a boring way will often be downvoted, while straight up incorrect but funny/emotional/"fitting the group think" comments usually get upvoted.

For us who've spent a lot of time on the web, this sort of bullshit detector is basically built-in at this point, but if you're new to places where the group think is so heavy as on reddit, it's worth being careful taking anything at face value.

ijk · 12h ago

LocalLlama is good for:

- Learning basic terms and concepts.

- Learning how to run local inference.

- Inference-level considerations (e.g., sampling).

- Pointers to where to get other information.

- Getting the vibe of where things are.

- Healthy skepticism about benchmarks.

- Some new research; there have been a number of significant discoveries that either originated in LocalLlama or got popularized there.

LocalLlama is bad because:

- Confusing information about finetuning; there's a lot of myths from early experiments that get repeated uncritically.

- Lots of newbie questions get repeated.

- Endless complaints that it's been took long since a new model was released.

- Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.

kouteiheika · 7h ago

> Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.

Is there a good place for this? Currently I just regularly sift through all of the garbage myself on arxiv to find the good stuff, but it is somewhat of a pain to do.

ddoolin · 16h ago

This is entirely why I can't bring myself to use it. The groupthink and virtue signaling is intense, when it's not just extremely low effort crud that rises to the top. And yes, before anyone says, I know, "curate." No, thank you.

dingnuts · 16h ago

Friend, this website is EXACTLY the same

rafterydj · 16h ago

I understand that the core similarities are there, but I disagree. The comparisons have been around since I started browsing HN years ago. The moderation on this site, for one, emphasizes constructive conversation and discussion in a way that most subreddits can only dream of.

It also helps that the target audience has been filtered with that moderation, so over time this site (on average) skews more technical and informed.

Maxatar · 12h ago

HackerNews isn't not exactly like reddit, sure, but it's not much better. People are much better behaved, but still spread a great deal of misinformation.

One way to gauge this property of a community is whether people who are known experts in a respective field participate in it, and unfortunately there are very few of them on HackerNews (this was not always the case). I've had some opportunities to meet with people who are experts, usually at conferences/industry events, and while many of them tend to be active on Twitter... they all say the same things about this site, namely that it's simply full of bad information and the amount of effort needed to dispel that information is significantly higher than the amount of effort needed to spread it.

Next time someone posts an article about a topic you are intimately familiar with, like top 1% subject matter expert in... review the comment section for it and you'll find just heaps of misconceptions, superficial knowledge, and my favorite are the contrarians who take these very strong opinions on a subject they have some passing knowledge about but talk about their contrarian opinion with such a high degree of confidence.

One issue is you may not actually be a subject matter expert on a topic that comes up a lot on HackerNews, so you won't recognize that this happens... but while people here are a lot more polite and the moderation policies do encourage good behavior... moderation policies don't do a lot to stop the spread of bad information from poorly informed people.

bruce511 · 3h ago

This is of course true is some cases and less true in others.

I consider myself an expert in one tiny niche field (computer generated code), and when that field comes up (on HN and elsewhere) over the last 30 years the general mood (from people who don't do it) is that it's poor quality code.

Pre-AI this was demonstrably untrue, but meh, I don't need to convince you, so I accept your point of view, and continue doing my thing. Our company revenue is important to me, not the opinion of done guy on the internet.

(AI has freshened the conversation, and it is currently giving mixed results, which is to be expected since it is non-deterministic. But I've been doing deterministic generation for 35 years.)

So yeah. Lots of comments from people who don't fo something, and I'm really not interested in taking the time to "prove" them wrong.

But equally I think the general level of discussion in areas where I'm not an expert (but experienced) is high. And around a lot of topics experience can be highly different.

For example companies, employees and employers come in all sorts of ways. Some folk have been burned and see (all) management through a certain light. Whereas of course, some are good, some are bad.

Yes, most people still use voting as a measure of "I agree with this", rather than the quality of the discussion, but that's just people, and I'm not gonna die on that hill.

And yeah, I'm not above joining in on a topic I don't technically use or know much about. I'll happily say that the main use for crypto (as a currency) is for illegal activity. Or that crypto in general is a ponzi scheme. Maybe I'm wrong, maybe it really is the future. But for now, it walks like a duck.

So I both agree, and disagree, with you. But I'm still happy to hang out here and get into (hopefully) illuminating discussions.

SamPatt · 11h ago

One of the things I appreciate most about HN is the fact that experts are often found in the comments.

Perhaps we are defining experts differently?

bb88 · 11h ago

There was a lot of pseudo science being published and voted up in the comments with Ivermectin/HCQ/etc and Covid, when those people weren't experts and before the Ivermectin paper got serious scrutiny.

The other aspect is that people on here think they're that if they are an expert in one thing, they instantly become an expert in another thing.

Freedom2 · 13h ago

This sites commenters attempt to apply technical solutions to social problems, then pats itself on the back despite their comments being entirely inappropriate to the problem space.

There's also no actual constructive discussion when it comes to future looking tech. The Cybertruck, Vision Pro, LLMs are some of the most recent items that were absolutely inaccurately called by the most popular comments. And their reasoning for their prediction had no actual substance in their comments.

yieldcrv · 12h ago

And the crypto asset discussions are very nontechnical here, veering into elementary and inaccurate philosophical discussions, despite this being a great forum to talk about technical aspects. every network has pull requests and governance proposals worth discussing, and the deepest discussion here is resurrected from 2012 about the entire concept not having a licit use case that the poster could imagine

tuwtuwtuwtuw · 12h ago

Do you have any sources to back up those claims?

ffsm8 · 15h ago

Frankly, no. As an obvious example that can be stated nowadays: musk has always been an over-promising liar.

Eg just look at the 2012+ videos of thunderf00t.

Yet people were literally banned here just for pointing out that he hasn't actually delivered on anything in the capacity he promised until he did the salute.

It's pointless to list other examples, as this page is- as dingnuts pointed out - exactly the same and most people aren't actually willing to change their opinion based on arguments. They're set in their opinions and think everyone else is dumb.

chucksmash · 14h ago

> Yet people were literally banned here just for pointing out that he hasn't actually delivered on anything in the capacity he promised until he did the salute.

I'd be shocked if they (you?) were banned just for critiquing Musk. So please link the post. I'm prepared to be shocked.

I'm also pretty sure that I could make a throwaway account that only posted critiques of Musk (or about any single subject for that matter) and manage to keep it alive by making the critiques timely, on-topic and thoughtful or get it banned by being repetitive and unconstructive. So would you say I was banned for talking about <topic>? Or would you say I was banned for my behavior while talking about <topic>?

janalsncm · 14h ago

Aside from the fact that I highly doubt anyone was banned as you describe, EM’s stories have gotten more and more grandiose. So it’s not the same.

Today he’s pitching moonshot projects as core to Tesla.

10 years ago he was saying self-driving was easy, but he was also selling by far the best electric vehicle on the market. So lying about self driving and Tesla semis mattered less.

Fwiw I’ve been subbed to tf00t since his 50 part creationist videos in early 2010s.

lcnPylGDnU4H9OF · 14h ago

I don’t see how that example refutes their point. It can be true both that there have been disagreeable bans and that the bans, in general, tend to result in higher quality discussions. The disagreeable bans seem to be outliers.

> They're set in their opinions and think everyone else is dumb.

Well, anyway, I read and post comments here because commenters here think critically about discussion topics. It’s not a perfect community with perfect moderation but the discussions are of a quality that’s hard to find elsewhere, let alone reddit.

janalsncm · 14h ago

Strongly disagree.

Scroll to the bottom of comment sections on HN, you’ll find the kind of low-effort drive-by comments that are usually at the top of Reddit comment sections.

In other words, it helps to have real moderators.

k__ · 11h ago

While the tone on HN is much more civil than on Reddit. It's still quite the echo chamber.

ddoolin · 15h ago

It happens in degrees, and the degree here is much lower.

bigyabai · 15h ago

I disagree. Reddit users are out to impress nobody but themselves, but the other day I saw someone submit a "Show HN" with AI-generated testimonials.

HN has an active grifter culture reinforced by the VC funding cycles. Reddit can only dream about lying as well as HN does.

rafaelmn · 13h ago

That's a tangential problem.

HN tends to push up grifter hype slop, and there are a lot of those people around cause VC, but you can still see comments pushing back.

Reading reddit reminds me of highschool forum arguments I've had 20 years ago, but lower quality because of population selection. It's just too mainstream at this point and shows you what the middle of the bell curve looks like.

turtlesdown11 · 15h ago

its actually the reverse, dunning kruger is off the charts on hacker news

ddoolin · 11h ago

I don't think there's a lot of groupthink or virtue signaling here, and those are the things that irritate me the most. If people here overestimate their knowledge or abilities, that's okay because I don't treat things people say as gospel/fact/truth unless I have clear and compelling reasons to do so. This is the internet after all.

Personally I also think the submissions that make it to the front page(s) are much better than any subreddit.

adolph · 12h ago

> > . . . The groupthink and virtue signaling is intense . . .

> Friend, this website is EXACTLY the same

And it gnows it: https://news.ycombinator.com/item?id=4881042

mountainriver · 14h ago

Strong disagree as well, this is one of the few places on the Internet which avoids this. I wish there were more

Der_Einzige · 13h ago

Lol this is true but also a TON of sampling innovations that are getting their love right now from the AI community (see min_p oral at ICLR 2025) came right from r/localllama so don't be a hater!!!

rahimnathwani · 11h ago

Poster: https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.png?t=174...

drillsteps5 · 12h ago

I use it as a discovery tool. Like if anybody mentions something interesting I go and research install/start playing with it. I could care less if they like it or not I'll make my own opinion.

For example I find all comments about model X be more "friendly" or "chatty" and model Y being more "unhinged" or whatever to be mostly BS. Like there's gazillion ways a conversation can go and I don't find model X or Y to be consistently chatty or unhinged or creative or whatever every time.

ivape · 15h ago

Well the unfortunate truth is HN has been behind the curve on local llm discussions so localllama has been the only one picking up the slack. There are just waaaaaaaay to many “ai is just hype” people here and the grassroots hardware/localllm discussions have been quite scant.

Like, we’re fucking two years in and only now do we have a thread about something like this? The whole crowd here needs to speed up to catch up.

saltcured · 14h ago

There are people who think LLMs are the future and a sweeping change you must embrace or be left behind.

There are others wondering if this is another hype juggernaut like CORBA, J2EE, WSDL, XML, no-SQL, or who-knows-what. A way to do things that some people treated as the new One True Way, but others could completely bypass for their entire, successful career and look at it now in hindsight with a chuckle.

rhdunn · 1h ago

And like those technologies it will find its own niches (like XML and no-SQL being used heavily in the publishing, standards, and other similar industries using document formats such as JATS) or fade away to be replaced with something else that fills the void (like CORBA and WSDL being replaced by other technologies).

I think LLMs will find their uses, it just takes time to distil what they are really useful for vs what the AI companies are generating hype for.

For example, I think they can be used to create better auto-complete by giving them the context information (matching functions, etc.) and letting them generate the completion text from that.

ivape · 1h ago

Are you trolling?

rhdunn · 52m ago

Huh? What part of my reply was trolling?

No. I was not trolling. If you explain why you think I'm trolling I could provide a better response to your generic reply.

Are you trolling?

moffkalast · 18h ago

Yes at this point it's starting to become almost a matter of how much you like the model's personality since they're all fairly decent. OP just has to start downloading and trying them out. With 16GB one can do partial DDR5 offloading with llama.cpp and run anything up to about 30B (even dense) or even more at a "reasonable" speed for chat purposes. Especially with tensor offload.

I wouldn't count Qwen as that much of a conversationalist though. Mistral Nemo and Small are pretty decent. All of Llama 3.X are still very good models even by today's standards. Gemma 3s are great but a bit unhinged. And of course QwQ when you need GPT4 at home. And probably lots of others I'm forgetting.

nico · 11h ago

What do you recommend for coding with aider or roo?

Sometimes it’s hard to find models that can effectively use tools

cchance · 10h ago

I havent found one good locally, i use DeepSeek r1 0528 its slow but free and really good at coding (openrouter has it free currently)

nico · 7h ago

Oh wow, just checked this leaderboard, r1 0528 looks really good

https://leaderboard.techfren.net/

Thanks for the recommendation, will try it out

ignoramous · 19h ago

> DeepSeek-R1-0528-Qwen3-8B https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B ... Released today; probably the best reasoning model in 8B size.

  ... we distilled the chain-of-thought from DeepSeek-R1-0528 to post-train Qwen3-8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B ... on AIME 2024, surpassing Qwen3-8B by +10.0% & matching the performance of Qwen3-235B-thinking.

Wild how effective distillation is turning out to be. No wonder, most shops have begun to "hide" CoT now: https://news.ycombinator.com/item?id=41525201

bn-l · 19h ago

> Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

Thank you for thinking of the vibe coders.

ivape · 18h ago

I'd also recommend you go with something like 8b, so you can have the other 8GB of vram for a decent sized context window. There's tons of good 8b ones, as mentioned above. If you go for the largest model you can fit, you'll have slower inference (as you pass in more tokens) and smaller context.

diggan · 18h ago

I think your recommendation falls within

> all of them will have some strengths and weaknesses

Sometimes a higher parameter model with less quantization and low context will be the best, sometimes lower parameter model with some quantization and huge context will be the best, sometimes high parameter count + lots of quantization + medium context will be the best.

It's really hard to say one model is better than another in a general way, since it depends on so many things like your use case, the prompts, the settings, quantization, quantization method and so on.

If you're building/trying to build stuff depending on LLMs in any capacity, the first step is coming up with your own custom benchmark/evaluation that you can run with your specific use cases being put under test. Don't share this publicly (so it doesn't end up in the training data) and run it in order to figure out what model is best for that specific problem.

svachalek · 18h ago

8b is the number of parameters. The most common quant is 4 bits per parameter so 8b params is roughly 4GB of VRAM. (Typically more like 4.5GB)

rhdunn · 17h ago

The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).

Anything below 4-bits is usually not worth it unless you want to experiment with running a 70B+ model -- though I don't have any experience of doing that, so I don't know how well the increased parameter size balances the quantization.

See https://github.com/ggml-org/llama.cpp/pull/1684 and https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... for comparisons between quantization levels.

kouteiheika · 16h ago

> The number of quantized bits is a trade off between size and quality. Ideally you should be aiming for a 6-bit or 5-bit model. I've seen some models be unstable at 4-bit (where they will either repeat words or start generating random words).

Note that that's a skill issue of whoever quantized the model. In general quantization even as low as 3-bit can be almost loseless when you do quantization-aware finetuning[1] (and apparently you don't even need that many training tokens), but even if you don't want to do any extra training you can be smart as to which parts of the model you're quantizing and by how much to minimize the damage (e.g. in the worst case over-quantizing even a single weight can have disastrous consequences[2])

Some time ago I ran an experiment where I finetuned a small model while quantizing parts of it to 2-bits to see which parts are most sensitive (the numbers are the final loss; lower is better):

    1.5275   mlp.downscale
    1.5061   mlp.upscale
    1.4665   mlp.gate
    1.4531   lm_head
    1.3998   attn.out_proj
    1.3962   attn.v_proj
    1.3794   attn.k_proj
    1.3811   input_embedding
    1.3662   attn.q_proj
    1.3397   unquantized baseline

So as you can see quantizing some parts of the model affects it more strongly. The downprojection in the MLP layers is the most sensitive part of the model (which also matches with what [2] found), so it makes sense to quantize this part of the model less and instead quantize other parts more strongly. But if you'll just do the naive "quantize everything in 4-bit" then sure, you might get broken models.

[1] - https://arxiv.org/pdf/2502.02631 [2] - https://arxiv.org/pdf/2411.07191

rhdunn · 13h ago

Interesting. I was aware of using an imatrix for the i-quants but didn't know you could use them for k-quants. I've not experimented with using imatrices in my local setup yet.

And it's not a skill issue... it's the default behaviour/logic when using k-quants to quantize a model with llama.cpp.

evilduck · 17h ago

With a 16GB GPU you can comfortably run like Qwen3 14B or Mistral Small 24B models at Q4 to Q6 and still have plenty of context space and get much better abilities than an 8B model.

datameta · 16h ago

Can system RAM be used for context (albeit at lower parsing speeds)?

ivape · 5h ago

Yeah, but it sucks. In fact, if you get the wrong graphics card and the memory bandwidth/speeds suck, things will suck too, so RAM is even worse (other than m1/m2/m3 stuff).

bee_rider · 16h ago

I’m curious (as someone who knows nothing about this stuff!)—the context window is basically a record of the conversation so far and other info that isn’t part of the model, right?

I’m a bit surprised that 8GB is useful as a context window if that is the case—it just seems like you could fit a ton of research papers, emails, and textbooks in 2GB, for example.

But, I’m commenting from a place of ignorance and curiosity. Do models blow up the info in the context window, maybe do some processing to pre-digest it?

hexomancer · 16h ago

Yes, every token is expanded into a vector that can be many thousand of dimensions. The vectors are stored for every token and every layer.

You absolutely can not fill even a single research paper in 2 GB much less an entire book.

luke-stanley · 18h ago

> Released today; probably the best reasoning model in 8B size.

Actually DeepSeek-R1-0528-Qwen3-8B was uploaded Thursday (yesterday) at 11 AM UTC / 7 PM CST. I had to check if a new version came out since! I am waiting for the other sizes! ;D

m4r71n · 18h ago

What is everyone using their local LLMs for primarily? Unless you have a beefy machine, you'll never approach the level of quality of proprietary models like Gemini or Claude, but I'm guessing these smaller models still have their use cases, just not sure what those are.

Rotundo · 18h ago

Not everyone is comfortable with sending their data and/or questions and prompts to an external party.

DennisP · 13h ago

Especially now that a court has ordered OpenAI to keep records of it all.

https://www.adweek.com/media/a-federal-judge-ordered-openai-...

barnabee · 17h ago

I generally try a local model first for most prompts. It's good enough surprisingly often (over 50% for sure). Every time I avoid using a cloud service is a win.

ativzzz · 18h ago

I think that the future of local LLMs is delegation. You give it a prompt and it very quickly identifies what should be used to solve the prompt.

Can it be solved locally with locally running MCPs? Or maybe it's a system API - like reading your calendar or checking your email. Otherwise it identifies the best cloud model and sends the prompt there.

Basically Siri if it was good

eddythompson80 · 1h ago

I completely disagree. I don't see the current status quo fundamentally changing.

That idea makes so much sense on paper, but until you start implementing it that you realized why no one does it (including Siri). "Some tasks are complex and better suited for complex giant model, but small models are perfectly capable of running simple limited task" makes a ton of sense, but the component best equipped at evaluating that decision is the smarter component of your system. At which point, you might as well have had it run the task.

It's like assigning the intern to triage your work items.

When actually implementing the application with that approach, every time you encounter an "AI-miss" you would (understandably) blame the small model, and eventually give up and delegate yet-another-scnario to the cloud model.

Eventually you feel you're artificially handcuffing yourself compared to literally every body else trying to ship something utilizing a 1b model. You have the worst of all options, crappy model with lots of hiccups yet it's still (by far) the most resource intensive part of your application making the whole thing super heavy and you are delegating more and more to the cloud model.

The local LLM scenario is going to be entirely driven by privacy concerns (around which there is no option. It's not like an E2EE LLM API could exist) or cost concerns if you believe you can run it cheaper.

diggan · 18h ago

I'm currently experimenting with Devstral for my own local coding agent I've slowly built together. It's in many ways nicer than Codex in that 1) full access to my hardware so can start VMs, make network requests and everything else I can do, which Codex cannot and 2) it's way faster both in initial setup, working through things and creating a patch.

Of course, it still isn't at the same level as Codex itself, the model Codex is using is just way better so of course it'll get better results. But Devstral (as I currently use it) is able to make smaller changes and refactors, and I think if I evolve the software a bit more, can start making larger changes too.

brandall10 · 13h ago

Why are you comparing it to Codex and not Claude Code, which can do all those things?

And why not just use Openhands, which it was designed around which I presume can also do all those things?

teleforce · 13h ago

This is an excellent example of local LLM application [1].

It's an AI-driven chat system designed to support students in the Introduction to Computing course (ECE 120) at UIUC, offering assistance with course content, homework, or troubleshooting common problems.

It serves as an educational aid integrated into the course’s learning environment using UIUC Illinois Chat system [2].

Personally I've found it's really useful that it provides the details portions of course study materials for examples slides that's directly related to the discussions so the students can check the sources veracity of the answers provided by the LLM.

It seems to me that RAG is the killer feature for local LLM [3]. It directly addressed the main pain point of LLM hallucinations and help LLMs stick to the facts.

[1] Introduction to Computing course (ECE 120) Chatbot:

https://www.uiuc.chat/ece120/chat

[2] UIUC Illinois Chat:

https://uiuc.chat/

[3] Retrieval-augmented generation [RAG]:

https://en.wikipedia.org/wiki/Retrieval-augmented_generation

staticcaucasian · 13h ago

Does this actually need to be local? Since the chat bot is open to the public and I assume the course material used for RAG all on this page (https://canvas.illinois.edu/courses/54315/pages/exam-schedul...) all stays freely accessible - I clicked a few links without being a student - I assume a pre-prompted larger non-local LLM would outperform the local instance. Though, you can imagine an equivalent course with all of its content ACL-gated/'paywalled' could benefit from local RAG, I guess.

drillsteps5 · 11h ago

I avoid using cloud whenever I can on principle. For instance, OpenAI recently indicated that they are working on some social network-like service for ChatGPT users to share their chats.

Running it locally helps me understand how these things work under the hood, which raises my value on the job market. I also play with various ideas which have LLM on the backend (think LLM-powered Web search, agents, things of that nature), I don't have to pay cloud providers, and I already had a gaming rig when LLaMa was released.

ijk · 12h ago

General local inference strengths:

- Experiments with inference-level control; can't do the Outlines / Instructor stuff with most API services, can't do the advanced sampling strategies, etc. (They're catching up but they're 12 months behind what you can do locally.)

- Small, fast, finetuned models; _if you know what your domain is sufficiently to train a model you can outperform everything else_. General models usually win, if only due to ease of prompt engineering, but not always.

- Control over which model is being run. Some drift is inevitable as your real-world data changes, but when your model is also changing underneath you it can be harder to build something sustainable.

- More control over costs; this is the classic on-prem versus cloud decision. Most cases you just want to pay for the cloud...but we're not in ZIRP anymore and having a predictable power bill can trump sudden unpredictable API bills.

In general, the move to cloud services was originally a cynical OpenAI move to keep GPT-3 locked away. They've built up a bunch of reasons to prefer the in-cloud models (heavily subsidized fast inference, the biggest and most cutting edge models, etc.) so if you need the latest and greatest right now and are willing to pay, it's probably the right business move for most businesses.

This is likely to change as we get models that can reasonably run on edge devices; right now it's hard to build an app or a video game that incidentally uses LLM tech because user revenue is unlikely to exceed inference costs without a lot of careful planning or a subscription. Not impossible, but definitely adds business challenges. Small models running on end-user devices opens up an entirely new level of applications in terms of cost-effectiveness.

If you need the right answer, sometimes only the biggest cloud API model is acceptable. If you've got some wiggle room on accuracy and can live with sometimes getting a substandard response, then you've got a lot more options. The trick is that the things that an LLM is best at are always going to be things where less than five nines of reliability are acceptable, so even though the biggest models have more reliability, an average there are many tasks where you might be just fine with a small fast model that you have more control over.

ozim · 17h ago

You still can get decent stuff out of local ones.

Mostly I use it for testing tools and integrations via API not to spend money on subscriptions. When I see something working I switch it to proprietary one to get best results.

nomel · 14h ago

If you're comfortable with the API, all the services provide pay-as-you-go API access that can be much cheaper. I've tried local, but the time cost of getting it to spit out something reasonable wasn't worth the literal pennies the answers from the flagship would cost.

qingcharles · 14h ago

This. The APIs are so cheap and they are up and running right now with 10x better quality output. Unless whatever you are doing is Totally Top Secret or completely nefarious, then send your prompts to an API.

ozim · 8h ago

I don’t see too much time spent to respond. I have above average hardware but nothing ultra fancy and I get decent response times from something like LLAMA 3.x. Maybe I am just happy with not instant replies but from online models O do not get replies much faster.

qingcharles · 14h ago

If you look on localllama you'll see most of the people there are really just trying to do NSFW or other questionable or unethical things with it.

The stuff you can run on reasonable home hardware (e.g. a single GPU) isn't going to blow your mind. You can get pretty close to GPT3.5, but it'll feel dated and clunky compared to what you're used to.

Unless you have already spent big $$ on a GPU for gaming, I really don't think buying GPUs for home makes sense, considering the hardware and running costs, when you can go to a site like vast.ai and borrow one for an insanely cheap amount to try it out. You'll probably get bored and be glad you didn't spend your kids' college fund on a rack of H100s.

moffkalast · 18h ago

> unless you have a beefy machine

The average person in r/locallama has a machine that would make r/pcmasterrace users blush.

rollcat · 17h ago

An Apple M1 is decent enough for LMs. My friend wondered why I got so excited about it when it came out five years ago. It wasn't that it was particularly powerful - it's decent. What it did was to set a new bar for "low end".

vel0city · 17h ago

A new Mac is easily starting around $1k and quickly goes up from there if you want a storage or RAM upgrade, especially for enough memory to really run some local models. Insane that a $1,000 computer is called "decent" and "low end". My daily driver personal laptop brand new was $300.

evilduck · 17h ago

An M1 Mac is about 5 years old at this point and can be had for far less than a grand.

A brand new Mac Mini M4 is only $499.

vel0city · 17h ago

Ah, I was focusing on the laptops, my bad. But still its more than $499. Just looked on the Apple store website, Mac Mini M4 starting at $599 (not $499), with only 256GB of storage.

https://www.apple.com/shop/buy-mac/mac-mini/m4

nickthegreek · 17h ago

microcenter routinely sells that system for $450.

https://www.microcenter.com/product/688173/Mac_mini_MU9D3LL-...

rollcat · 14h ago

Of course it depends on what you consider "low end" - it's relative to your expectations. I have a G4 TiBook, the definition of a high-end laptop, by 2002 standards. If you consider a $300 laptop a good daily driver, I'll one-up you with this: <https://www.chrisfenton.com/diy-laptop-v2/>

vel0city · 14h ago

My $300 laptop is a few years old. It has a Ryzen 3 3200U CPU, it has a 14" 1080p display, backlit keyboard. It came with 8GB of RAM and a 128GB SSD, I upgraded to 16GB from RAM acquired by a dumpster dive and a 256GB SSD for like $10 on clearance at Microcenter. I upgraded the WiFi to an Intel AX210 6e for about another $10 off Amazon. It gets 6-8 hours of battery life doing browsing and texting editing kind of workloads.

The only thing that is itching me to get a new machine is it needs a 19V power supply. Luckily it's a pretty common barrel size, I already had several power cables laying around that work just fine. I'd prefer to just have all my portable devices to run off USB-C though.

pram · 13h ago

I know I speak for everyone that your dumpster laptop is very impressive, give yourself a big pat on the back. You deserve it.

fennecfoxy · 16h ago

You're right - memory size and then bandwidth is imperative for LLMs. Apple currently lacks great memory bandwidth with their unified memory. But it's not a bad option if you can find one for a good price. The prices for new are just bonkers.

moffkalast · 17h ago

That's fun to hear given that low end laptops are now $800, mid range is like $1.5k and upper end is $3k+ even for non-Apple vendors. Inflation makes fools of us all.

vel0city · 17h ago

Low end laptops can still easily be found for far less than $800.

https://www.microcenter.com/product/676305/acer-aspire-3-a31...

Mr-Frog · 15h ago

The first IBM PC in 1981 cost $1,565, which is comparable to $5,500 after inflation.

mixmastamyk · 15h ago

Shouldn't the (MoE) mixture of experts approach allow one to conserve memory by working on specific problem type at a time?

> (MoE) divides an AI model into separate sub-networks (or "experts"), each specializing in a subset of the input data, to jointly perform a task.

ijk · 13h ago

Sort of, but the "experts" aren't easily divisible in a conceptually interpretable way so the naive understanding of MoE is misleading.

What you typically end up with in memory constrained environments is that the core shared layers are in fast memory (VRAM, ideally) and the rest are in slower memory (system RAM or even a fast SSD).

MoE models are typically very shallow-but-wide in comparison with the dense models, so they end up being faster than an equivalent dense model, because they're ultimately running through fewer layers each token.

cratermoon · 18h ago

I have a large repository of notes, article drafts, and commonplace book-type stuff. I experimented a year or so ago with a system using RAG to "ask myself" what I have to say about various topics. (I suppose nowadays I would use MCP instead of RAG?) I was not especially impressed by the results with the models I was able to run: long-winded responses full of slop and repetition, irrelevant information pulled in from notes that had some semantically similar ideas, and such. I'm certainly not going to feed the contents of my private notebooks to any of the AI companies.

notfromhere · 17h ago

You'd still use RAG, just use MCP to more easily connect an LLM to your RAG pipeline

cratermoon · 17h ago

To clarify: what I was doing was first querying for the documents via a standard document database query and then feeding the best matching documents to the LLM. My understanding is that with MCP I'd delegate the document query from the LLM to the tool.

longtimelistnr · 16h ago

As a beginner, I also haven't had much luck with embedded vector queries either. Firstly, setting it up was a major pain in the ass and I couldn't even get it to ingest anything beyond .txt files. Second, maybe it was my AI system prompt or the lack of outside search capabilities but unless i was very specific with my query the response was essentially "can't find what youre looking for"

drillsteps5 · 12h ago

I concur LocalLLama subreddit recommendation. Not in terms of choosing "the best model" but to answer questions, find guides, latest news and gossip, names of the tools, various models and how they stack against each other, etc.

There's no one "best" model, you just try a few and play with parameters and see which one fits your needs the best.

Since you're on HN, I'd recommend skipping Ollama and LMStudio. They might restrict access to the latest models and you typically only choose from the ones they tested with. And besides what kind of fun is this when you don't get to peek under the hood?

llamacpp can do a lot itself, and you can do most recently released models (when changes are needed they adjust literally within a few days). You can get models from huggingface obviously. I prefer GGUF format, saves me some memory (you can use lower quantization, I find most 6-bit somewhat satisfactory).

I find that the the size of the model's GGUF file with roughly tell me if it'll fit in my VRAM. For example 24Gb GGUF model will NOT fit in 16Gb, whereas 12Gb likely will. However, the more context you add the more RAM will be needed.

Keep in mind that models are trained with certain context window. If it has 8Kb context (like most older models do) and you load it with 32Kb context it won't be much help.

You can run llamacpp on Linux, Windows, or MacOS, you can get the binaries or compile on your local. It can split the model between VRAM and RAM (if the model doesn't fit in your 16Gb). It even has simple React front-end (llamacpp-server). The same module provides REST service which has similar (but simpler) protocol to OpenAI and all the other "big" guys.

Since it implements OpenAI REST API, it also works with a lot of front-end tools if you want more functionality (ie oobabooga aka textgeneration webui).

Koboldcpp is another backend you can try if you find llamacpp to be too raw (I believe it's the still llamacpp under the hood).

gavmor · 8h ago

Why skip ollama? I can pull any GGUF straight from HuggingFace eg:

`ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q8_0`

lolinder · 7h ago

> Since you're on HN, I'd recommend skipping Ollama and LMStudio.

I disagree. With Ollama I can set up my desktop as an LLM server, interact with it over WiFi from any other device, and let Ollama switch seamlessly between models as I want to swap. Unless something has changed recently, with llama.cpp's CLI you still have to shut it down and restart it with a different command line flag in order to switch models even when run in server mode.

That kind of overhead gets in the way of experimentation and can also limit applications: there are some little apps I've built that rely on being able to quickly swap between a 1B and an 8B or 30B model by just changing the model parameter in the web request.

drillsteps5 · 4h ago

llamacpp can set up REST server with OpenAI API so you can get many front-end LLM apps to talk to it the same way they talk to ChatGPT, Claude, etc. And you can connect to that machine from another one on the same network through whatever port you set it to. See llamacpp-server.

When you get Ollama to "switch seamlessly" between models it still simply reloads a different model with llamacpp which is what it's based on.

I prefer llamacpp because doing things "seamlessly" obscures the way things work behind the scenes, which is what I want to learn and play with.

Also, and I'm not sure if it's the case anymore but it used to be, when llamacpp gets adjusted to work with the latest model, sometimes it takes them a bit to update the Python API which is what Ollama is using. It was the case with one of the LlaMas, forget which one, where people said "oh yeah don't try this model with Ollama, they're waiting on llamacpp folks to update llama-cpp-python to bring the latest changes from llamacpp, and once they do, Ollama will bring the latest into their app and we'll be up and running. Be patient."

PhilippGille · 15h ago

Haven't seen Mozilla's LocalScore [1] mentioned in the comments yet. It's exactly made for the purpose of finding out how well different models run on different hardware.

[1] https://www.localscore.ai/

btreecat · 20h ago

I only have 8gb of vram to work with currently, but I'm running OpenWebUI as a frontend to ollamma and I have a very easy time loading up multiple models and letting them duke it out either at the same time or in a round robin.

You can even keep track of the quality of the answers over time to help guide your choice.

https://openwebui.com/

nicholasjarnold · 11h ago

AMD 6700XT owner here (12Gb VRAM) - Can confirm.

Once I figured out my local ROCm setup Ollama was able to run with GPU acceleration no problem. Connecting an OpenWebUI docker instance to my local Ollama server is as easy as a docker run command where you specify the OLLAMA_BASE_URL env var value. This isn't a production setup, but it works nicely for local usages like what the immediate parent is describing.

rthnbgrredf · 12h ago

Be aware of the the recent license change of "Open"WebUI. It is no longer open source.

lolinder · 12h ago

Thanks, somehow I missed that.

https://docs.openwebui.com/license/

arnaudsm · 14h ago

Qwen3 family (and the R1 qwen3-8b distill) is #1 in programming and reasoning.

However it's heavily censored on political topics because of its Chinese origin. For world knowledge, I'd recommend Gemma3.

This post will be outdated in a month. Check https://livebench.ai and https://aider.chat/docs/leaderboards/ for up to date benchmarks

the_sleaze_ · 13h ago

> This post will be outdated in a month

The pace of change is mind boggling. Not only for the models but even the tools to put them to use. Routers, tools, MCP, streaming libraries, SDKs...

Do you have any advice for someone who is interested, developing alone and not surrounded by coworkers or meetups who wants to be able to do discovery and stay up to date?

y2244 · 1h ago

Pick up a used 3090 with more ram.

It holds it's value so you won't lose much if anything when you resell it.

But otherwise, as said, install Ollama and/or Llama.cpp and run the model using the --verbose flag.

This will print out the token per second result after each promt is returned.

Then find the best model that gives you a token per second speed you are happy with.

And as also said, 'abliterated' models are less censored versions of normal ones.

reissbaker · 15h ago

At 16GB a Q4 quant of Mistral Small 3.1, or Qwen3-14B at FP8, will probably serve you best. You'd be cutting it a little close on context length due to the VRAM usage... If you want longer context, a Q4 quant of Qwen3-14B will be a bit dumber than FP8 but will leave you more breathing room. Mistral Small can take images as input, and Qwen3 will be a bit better at math/coding; YMMV otherwise.

Going below Q4 isn't worth it IMO. If you want significantly more context, probably drop down to a Q4 quant of Qwen3-8B rather than continuing to lobotomize the 14B.

Some folks have been recommending Qwen3-30B-A3, but I think 16GB of VRAM is probably not quite enough for that: at Q4 you'd be looking at 15GB for the weights alone. Qwen3-14B should be pretty similar in practice though despite being lower in param count, since it's a dense model rather than a sparse one: dense models are generally smarter-per-param than sparse models, but somewhat slower. Your 5060 should be plenty fast enough for the 14B as long as you keep everything on-GPU and stay away from CPU offloading.

Since you're on a Blackwell-generation Nvidia chip, using LLMs quantized to NVFP4 specifically will provide some speed improvements at some quality cost compared to FP8 (and will be faster than Q4 GGUF, although ~equally dumb). Ollama doesn't support NVFP4 yet, so you'd need to use vLLM (which isn't too hard, and will give better token throughput anyway). Finding pre-quantized models at NVFP4 will be more difficult since there's less-broad support, but you can use llmcompressor [1] to statically compress any FP16 LLM to NVFP4 locally — you'll probably need to use accelerate to offload params to CPU during the one-time compression process, which llmcompressor has documentation for.

I wouldn't reach for this particular power tool until you've decided on an LLM already, and just want faster perf, since it's a bit more involved than just using ollama and the initial quantization process will be slow due to CPU offload during compression (albeit it's only a one-time cost). But if you land on a Q4 model, it's not a bad choice once you have a favorite.

1: https://github.com/vllm-project/llm-compressor

fennecfoxy · 16h ago

Basic conversations are essentially RP I suppose. You can look at KoboldCPP or SillyTavern reddit.

I was trying Patricide unslop mell and some of the Qwen ones recently. Up to a point more params is better than worrying about quantization. But eventually you'll hit a compute wall with high params.

KV cache quantization is awesome (I use q4 for a 32k context with a 1080ti!) and context shifting is also awesome for long conversations/stories/games. I was using ooba but found recently that KoboldCPP not only runs faster for the same model/settings but also Kobold's context shifting works much more consistently than Ooba's "streaming_llm" option, which almost always re-evaluates the prompt when hooked up to something like ST.

emmelaich · 17h ago

Generally speaking, how can you tell how much vram a model will take? It seems to be a valuable bit of data which is missing from downloadable models (gguf) files.

omneity · 17h ago

Very rougly you can consider the Bs of a model as GBs of memory then it depends on the quantization level. Say for an 8B model:

- FP16: 2x 8GB = 16GB

- Q8: 1x 8GB

- Q4: 0.5x 8GB = 4GB

It doesn't 100% neatly map like this but this gives you a rough measure. In top of this you need some more memory depending on the context length and some other stuff.

Rationale for the calculation above: A model is basically a billions of variables with a floating number value. So the size of a model roughly maps to number of variables (weights) x word-precision of each variable (4, 8, 16bits..)

You don't have to quantize all layers to the same precision this is why sometimes you see fractional quantizations like 1.58bits.

rhdunn · 17h ago

The 1.58bit quantization is using 3 values -- -1, 0, 1. The bits number comes from log_2(3) = 1.58....

For that level you can pack 4 weights in a byte using 2 bits per byte. However, there is one bit configuration in each that is unused.

More complex packing arrangements are done by grouping weights together (e.g. a group of 3) and assigning a bit configuration to each combination of values into a lookup table. This allows greater compression closer to the 1.68 bits value.

fennecfoxy · 16h ago

Depends on quantization etc. But there are good calculators that will calculate for your KV cache etc as well: https://apxml.com/tools/vram-calculator.

nosecreek · 16h ago

Related question: what is everyone using to run a local LLM? I'm using Jan.ai and it's been okay. I also see OpenWebUI mentioned quite often.

Havoc · 16h ago

LM studio if you just want an app. openwebui is just a front end - you'd need to have either llama.cpp or vllm behind it to serve the model

op00to · 16h ago

LMStudio, and sometimes AnythingLLM.

fennecfoxy · 16h ago

KoboldCPP + SillyTavern, has worked the best for me.

arh68 · 15h ago

Wow, a 5060Ti. 16gb + I'm guessing >=32gb ram. And here I am spinning Ye Olde RX 570 4gb + 32gb.

I'd like to know how many tokens you can get out of the larger models especially (using Ollama + Open WebUI on Docker Desktop, or LM Studio whatever). I'm probably not upgrading GPU this year, but I'd appreciate an anecdotal benchmark.

  - gemma3:12b
  - phi4:latest (14b)
  - qwen2.5:14b [I get ~3 t/s on all these small models, acceptably slow]

  - qwen2.5:32b [this is about my machine's limit; verrry slow, ~1 t/s]
  - qwen2.5:72b [beyond my machine's limit, but maybe not yours]

diggan · 15h ago

I'm guessing you probably also want to include the quantization levels you're using, as otherwise they'll be a huge variance in your comparisons with others :)

arh68 · 14h ago

True, true. All Q4_K_M unless I'm mistaken. Thanks

benterix · 20h ago

I'm afraid that 1) you are not going to get a definite answer, 2) an objective answer is very hard to give, 3) you really need to try a few most recent models on your own and give them the tasks that seem most useful/meaningful to you. There is drastic difference in output quality depending on the task type.

mrbonner · 17h ago

Did someone have a chance to try local llama for the new AMD AI Max+ with 128gb of unified RAM?

sabareesh · 16h ago

This is what i have https://sabareesh.com/posts/llm-rig/ All You Need is 4x 4090 GPUs to Train Your Own Model

dr_kiszonka · 15h ago

Could you explain what is your use case for training 1B models? Learning or perhaps fine tuning?

sabareesh · 10h ago

Learning, prototype and then scale it in to cloud. Also can be used as inference engine to train another model if you are using model as a judge for RL.

yapyap · 15h ago

4 4090s is easily 8000$, nothing to scoff at IMO

ge96 · 15h ago

Imagine some SLI 16 x 1080

DiabloD3 · 15h ago

I'd suggest buying a better GPU, only because all the models you want need a 24GB card. Nvidia... more or less robbed you.

That said, Unsloth's version of Qwen3 30B, running via llama.cpp (don't waste your time with any other inference engine), with the following arguments (documented in Unsloth's docs, but sometimes hard to find): `--threads (number of threads your CPU has) --ctx-size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407 --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along with the other arguments you need.

Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF (since you have 16GB, grab Q3_K_XL, since it fits in vram and leaves about 3-4GB left for the other apps on your desktop and other allocations llama.cpp needs to make).

Also, why 30B and not the full fat 235B? You don't have 120-240GB of VRAM. The 14B and less ones are also not what you want: more parameters are better, parameter precision is vastly less important (which is why Unsloth has their specially crafted <=2bit versions that are 85%+ as good, yet are ridiculously tiny in comparison to their originals).

Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3

bigyabai · 15h ago

> only because all the models you want need a 24GB card

???

Just run a q4 quant of the same model and it will fit no problem.

DiabloD3 · 14h ago

Q4_K_M is the "default" for a lot of models on HF, and they generally require ~20GB of VRAM to run. It will not fit entirely on a 16GB card. You want to be about 3-4GB VRAM on top of what the model requires.

A back of the envelope estimate of specifically unsloth/Qwen3-30B-A3B-128K-GGUF is 18.6GB for Q4_K_M.

janalsncm · 14h ago

People ask this question a lot and annoyingly the answer is: there are many definitions of “best”. Speed, capabilities (e.g. do you want it to be able to handle images or just text?), quality, etc.

It’s like asking what the best pair of shoes is.

Go on Ollama and look at the most popular models. You can decide for yourself what you value.

And start small, these things are GBs in size so you don’t want to wait an hour for a download only to find out a model runs at 1 token / second.

GuB-42 · 8h ago

I like the Mistral models. Not the smartest but I find them to have good conversation while being small, fast and efficient.

And the part I like the most is there is almost no censorship, at least not for the models I tried. For me, having an uncensored model is one of the most compelling reasons for running a LLM locally. Jailbreaks are a PITA and abliteration and other uncensoring fine-tunings tends to make models that have been made dumb by censorship even dumber.

kekePower · 19h ago

I have an RTX 3070 with 8GB VRAM and for me Qwen3:30B-A3B is fast enough. It's not lightning fast, but more than adequate if you have a _little_ patience.

I've found that Qwen3 is generally really good at following instructions and you can also very easily turn on or off the reasoning by adding "/no_think" in the prompt to turn it off.

The reason Qwen3:30B works so well is because it's a MoE. I have tested the 14B model and it's noticeably slower because it's a dense model.

tedivm · 18h ago

How are you getting Qwen3:30B-A3B running with 8GB? On my system it takes 20GB of VRAM to launch it.

kekePower · 12h ago

It offloads to system memory, but since there are "only" 3 Billion active parameters, it works surprisingly well. I've been able to run models that are up to 29GB in size, albeit very, very slow on my system with 32GB RAM.

fennecfoxy · 16h ago

Probably offload to regular ram I'd wager. Or really, really, reaaaaaaally quantized to absolute fuck. Qwen3:30B-A3B Q1 with a 1k Q4 context uses 5.84GB of vram.

spott · 6h ago

Does anyone know of any local models that are capable of the same tool use (specifically web searches) during reasoning that the foundation models are?

I realize they aren’t going to be as good… but the whole search during reasoning is pretty great to have.

nickdothutton · 14h ago

I find Ollama + TypingMind (or similar interface) to work well for me. As for which models, I think this is changing from one month to the next (perhaps not quite that fast). We are in that kind of period. You'll need to make sure the model layers fit in VRAM.

dcminter · 19h ago

I think you'll find that on that card most models that are approaching the 16G memory size will be more than fast enough and sufficient for chat. You're in the happy position of needing steeper requirements rather than faster hardware! :D

Ollama is the easiest way to get started trying things out IMO: https://ollama.com/

giorgioz · 18h ago

I found LM Studios so much easier than ollama given it has a UI: https://lmstudio.ai/ Did you know about LM Studio? Why is ollama still recommended given it's just a CLI with worse UX?

dcminter · 18h ago

I recommended ollama because IMO that is the easiest way to get started (as I said).

ekianjo · 18h ago

lM studio is closed source

prophesi · 18h ago

Any FOSS solutions that let you browse models and guesstimates for you on whether you have enough VRAM to fully load the model? That's the only selling point to LM Studio for me.

Ollama's default context length is frustratingly short in the era of 100k+ context windows.

My solution so far has been to boot up LM Studio to check if a model will work well on my machine, manually download the model myself through huggingface, run llama.cpp, and hook it up to open-webui. Which is less than ideal, and LM Studio's proprietary code has access to my machine specs.

ekianjo · 1h ago

> Ollama's default context length is frustratingly short in the era of 100k+ context windows.

Nobody uses Ollama as is. It's a model server. In clients you can specify the proper context lengths. This has never been a problem.

nickthegreek · 17h ago

https://huggingface.co/docs/accelerate/v0.32.0/en/usage_guid...

prophesi · 10h ago

Thanks! That's really helpful.

y2244 · 1h ago

And I think LM Studio has non commercial restrictions

emson · 20h ago

Good question. I've had some success with Qwen2.5-Coder 14B, I did use the quantised version: huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct-GGUF:latest It worked well on my MacBook Pro M1 32Gb. It does get a bit hot on a laptop though.

redman25 · 18h ago

Gemma-3-12b-qat https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf

Qwen_Qwen3-14B-IQ4_XS.gguf https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF

Gemma3 is a good conversationalist but tends to hallucinate. Qwen3 is very smart but also very stubborn (not very steerable).

binary132 · 17h ago

I’ve had awesome results with Qwen3-30B-A3B compared to other local LMs I’ve tried. Still not crazy good but a lot better and very fast. I have 24GB of VRAM though

srazzaque · 20h ago

Agree with what others have said: you need to try a few out. But I'd put Qwen3-14B on your list of things to try out.

antirez · 15h ago

The largest Gemma 3 and Qwen 3 you can run. Offload to RAM as many layers you can.

nsxwolf · 13h ago

I'm running llama3.2 out of the box on my 2013 Mac Pro, the low end quad core Xeon one, with 64GB of RAM.

It's slow-ish but still useful, getting 5-10 tokens per second.

giorgioz · 18h ago

Try out some models with LM Studio: https://lmstudio.ai/ It has a UI so it's very easy to download the model and have a UI similar to the chatGPT app to query that model.

lxe · 15h ago

There's a new one every day it seems. Follow https://x.com/reach_vb from huggingface.

Havoc · 16h ago

It's a bit like asking what flavour of icecream is the best. Try a few and see.

For 16gb and speed you could try Qwen3-30B-A3B with some offload to system ram or use a dense model Probably a 14B quant

FuriouslyAdrift · 15h ago

I've had good luck with GPT4All (Nomic) and either reason v1 (Qwen 2.5 - Coder 7B) or Llama 3 8B Instruct.

ProllyInfamous · 19h ago

VEGA64 (8GB) is pretty much obsolete for this AI stuff, right (compared to e.g. M2Pro (16GB))?

I'll give Qwen2.5 a try on the Apple Silicon, thanks.

speedgoose · 14h ago

My personal preference this month is the biggest Gemma3 you can fit on your hardware.

depingus · 13h ago

Captain Eris Violet 12B fits those requirements.

vladslav · 18h ago

I use Gemma3:12b on a Mac M3 Pro, basically like Grammarly.

No comments yet

unethical_ban · 17h ago

Phi-4 is scared to talk about anything controversial, as if they're being watched.

I asked it a question about militias. It thought for a few pages about the answer and whether to tell me, then came back with "I cannot comply".

Nidum is the name of uncensored Gemma, it does a good job most of the time.

cowpig · 18h ago

Ollama[0] has a collection of models that are either already small or quantized/distilled, and come with hyperparameters that are pretty reasonable, and they make it easy to try them out. I recommend you install it and just try a bunch because they all have different "personalities", different strengths and weaknesses. My personal go-tos are:

Qwen3 family from Alibaba seem to be the best reasoning models that fit on local hardware right now. Reasoning models on local hardware are annoying in contexts where you just want an immediate response, but vastly outperform non-reasoning models on things where you want the model to be less naive/foolish.

Gemma3 from google is really good at intuition-oriented stuff, but with an obnoxious HR Boy Scout personality where you basically have to add "please don't add any disclaimers" to the system prompt for it to function. Like, just tell me how long you think this sprain will take to heal, I already know you are not a medical professional, jfc.

Devstral from Mistral performs the best on my command line utility where I describe the command I want and it executes that for me (e.g. give me a 1-liner to list the dotfiles in this folder and all subfolders that were created in the last month).

Nemo from Mistral, I have heard (but not tested) is really good for routing-type jobs, where you need something with to make a simple multiple-choice decision competently with low latency, and is easy to fine-tune if you want to get that sophisticated.

[0] https://ollama.com/search

tiahura · 17h ago

What about for a 5090?

kgeist · 8h ago

I run Qwen3-32B with Unsloth Dynamic Quants 2.0, quantized to 4+ bits, and with the key-value cache reduced to 8-bit. It's my favorite configuration so far. This configuration has the best quality/speed ratio at this moment imho.

It's pretty magical - it often feels like I'm talking to GPT-4o or o1, until it makes a silly mistake once in a while. It supports reasoning out of the box, which improves results considerably.

With the settings above, I get 60 tokens per second on an RTX 5090, because it fits entirely in GPU memory. It feels faster than GPT-4o. A 32k context with 2 parallel generations* consumes 28 GB of VRAM (with llama.cpp), so you still have 4 GB left for something else.

* I use 2 parallel generations because there's a few of us sharing the same GPU. If you use only 1 parallel generation, you can increase the context to 64k

sgt · 12h ago

Comes with 32GB VRAM right?

Speaking of, would a Ryzen 9 12 core be nice for a 5090 setup?

Or should one really go dual 5090?

jaggs · 18h ago

hf.co/bartowski/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF:Q6_K is a decent performing model, if you're not looking for blinding speed. It definitely ticks all the boxes in terms of model quality. Try a smaller quant if you need more speed.

spacecadet · 19h ago

Look for something in the 500m-3b parameters range. 3 might push it...

SmolVLM is pretty useful. https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct

rhdunn · 16h ago

It is feasible to run 7B, 8B models with q6_0 in 8GB VRAM, or q5_k_m/q4_k_m if you have to or want to free up some VRAM for other things. With q4_k_m you can run 10B and even 12B models.

nurettin · 13h ago

pretty much all Q_4 models on huggingface fit in consumer grade cards.

Beware of Fast-Math (simonbyrne.github.io)

Photos taken inside musical instruments (dpreview.com)

What's working for YC companies since the AI boom (jamesin.substack.com)

Simpler Backoff (commaok.xyz)

AI Responses May Include Mistakes (os2museum.com)

Valkey Turns One: Community fork of Redis (gomomento.com)

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) (cerebras.ai)

AccessOwl (YC S22) is hiring an AI TypeScript Engineer to connect 100s of SaaS (ycombinator.com)

Surprisingly fast AI-generated kernels we didn't mean to publish yet (crfm.stanford.edu)

Learn touch typing – it's worth it (typequicker.com)

Reverse engineering of Linear's sync engine (github.com)

The ‘white-collar bloodbath’ is all part of the AI hype machine (cnn.com)

Beating Google's kernelCTF PoW using AVX512 (anemato.de)

Mary Meeker's first Trends report since 2019, focused on AI (bondcap.com)

Show HN: MCP Defender – OSS AI Firewall for Protecting MCP in Cursor/Claude etc (mcpdefender.com)

Show HN: Icepi Zero – The FPGA Raspberry Pi Zero Equivalent (github.com)

Microsandbox: Virtual Machines that feel and perform like containers (github.com)

Java Virtual Threads Ate My Memory: A Web Crawler's Tale of Speed vs. Memory (dariobalinzo.medium.com)

Systems Correctness Practices at Amazon Web Services (cacm.acm.org)

Cap: Lightweight, modern open-source CAPTCHA alternative using proof-of-work (capjs.js.org)

How large should your sample size be? (vickiboykis.com)

Revenge of the Chickenized Reverse-Centaurs (pluralistic.net)

Ray Tracing in J (idle.nprescott.com)

C++ to Rust Phrasebook (cel.cs.brown.edu)

The Darwin Gödel Machine: AI that improves itself by rewriting its own code (sakana.ai)

Google Duo will be replaced by Google Meet in Sept 2025 (9to5google.com)

Silicon Valley finally has a big electronics retailer again: Micro Center opens (microcenter.com)

Every 5x5 Nonogram (pixelogic.app)

Anthropic launches a voice mode for Claude (techcrunch.com)

Oldest known tools made from whale bones dated to 20k years ago (apnews.com)

Show HN: Smart Silence – Remind your iPhone to stay quiet in quiet places (testflight.apple.com)

Jerry Lewis's “The Day the Clown Cried” discovered in Sweden after 53 years (thenationalnews.com)

Copy Excel to Markdown Table (and vice versa) (thisdavej.com)

Adam Riess and the Hubble tension (theatlantic.com)

Identifying Unmarked Iron (castironcollector.com)

How to run cron jobs in Postgres without extra infrastructure (wasp.sh)

De Bruijn notation, and why it's useful (blueberrywren.dev)

Radio Astronomy Software Defined Radio (Rasdr) (radio-astronomy.org)

L.A. council backs $30 minimum wage for tourism workers, despite warnings (latimes.com)

Show HN: W++ – A Python-style scripting language for .NET with NuGet support (github.com)

A Smiling Public Man (salmagundi.skidmore.edu)

Practical SDR: Getting started with software-defined radio (nostarch.com)

Atomics and Concurrency (redixhumayun.github.io)

Show HN: Git-Add–Interactive with Enhancements (github.com)

Why Writing by Hand Is Better for Memory and Learning (scientificamerican.com)

When will M&S take online orders again? (moneyweek.com)

Triangle splatting: radiance fields represented by triangles (trianglesplatting.github.io)

Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020) (ndingwall.github.io)

Build API integrations with SQL and YAML – no SaaS lock-in, no drag-and-drop UIs (github.com)

Show HN: Leap – Full-stack AI developer agent that deploys to AWS (leap.new)

Ask HN: What is the best LLM for consumer grade hardware?

Comments (165)