Ask HN: Are you running local LLMs? What are your key use cases?

5 briansun 1 8/8/2025, 2:07:05 PM
2025 feels like a breakout year for local models. Open‑weight releases are getting genuinely useful: from Google’s Gemma to recent *gpt‑oss* drops, the gap with frontier commercial models keeps narrowing for many day‑to‑day tasks.

Yet outside of this community, local LLMs still don’t seem mainstream. My hunch: *great UX and durable apps are still thin on the ground.*

If you are using local models, I’d love to learn from your setup and workflows. Please be specific so others can calibrate:

Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M).

Runtime/tooling: e.g., Ollama, LM studio, etc.

Hardware: CPU/GPU details (VRAM/RAM), OS. If laptop/edge/home servers, mention that.

Workflows where local wins: privacy/offline, data security, coding, huge amount extraction, RAG over your files, agents/tools, screen capture processing—what’s actually sticking for you?

Pain points: quality on complex reasoning, context management, tool reliability, long‑form coherence, energy/thermals, memory, Windows/Mac/Linux quirks.

Favorite app today: the one you actually open daily (and why).

Wishlist: the app you wish existed.

Gotchas/tips: config flags, quant choices, prompt patterns, or evaluation snippets that made a real difference.

If you’re not using local models yet, what’s the blocker—setup friction, quality, missing integrations, battery/thermals, or just “cloud is easier”? Links are welcome, but what helps most is concrete numbers and anecdotes from real use.

A simple reply template (optional):

``` Model(s): Runtime/tooling: Hardware: Use cases that stick: Pain points: Favorite app: Wishlist: ```

Also curious how people think about privacy and security in practice. Thanks!

Comments (1)

incomingpain · 1h ago
Python coding is practically the only usecase for local for me.

Cloud llm are able to run 1 trillion parameters and have all of python knowledge in a transparent rag that's 100gbit or faster. Of course they'll be the bestest on the block.

But when the new GPT coding benchmarks only barely behind grok 4 or gpt5 with high reasoning.

>Model(s) & size: exact name/version, and quantization (e.g., Q4_K_M).

My most reliable setup is Devstral + openhands. unsloth Q6_K_XL, 85,000 context, flash attention, kcache and vcache quant at Q8.

Second most reliable. GPT-OSS-20B + opencode. Default MXFP4, I can only load up 31,000 context or it fails?(still plenty but hoping this bug gets fixed), you cant use flash attention or kv or v quantization or it becomes dumb as rocks. This harmony stuff is annoying.

Still preliminary, just got working today, but testing is really good. Qwen3-30b-a3b-thinking-2507 + roo code or qwencode, 80,000 context, unsloth q4_k_xl, flash attention, kcache and vcache quant at Q8.

>Runtime/tooling: e.g., Ollama, LM studio, etc.

LM studio. I need vulkan for my setup. rocm is just a pain in the ass. They need to support way more linux distros.

24gb vram.