I agree that it's kind of magical that you can download a ~10GB file and suddenly your laptop is running something that can summarize text, answer questions and even reason a bit.
The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.
What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
giancarlostoro · 32m ago
I feel like Apple needs a new CEO, I've felt this way for a long time. If I had been in charge of Apple I would have embraced local LLMs and built an inference engine that optimizes models that are designed for Nvidia, I also would have probably toyed around with the idea of selling server-grade Apple Silicon processors and opening up the GPU spec so people can build against it. Seems like Apple tries to play it too safe. While Tim Cook is good as a COO, he's still running Apple as a COO. They need a man of vision, not a COO at the helm.
GeekyBear · 2h ago
> Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
If you want to convert models to run on the ANE there are tools provided:
> Convert models from TensorFlow, PyTorch, and other libraries to Core ML.
It is less about conversion and more about extending ANE support for transformer-style models or giving developers more control.
The issue is in targeting specific hardware blocks. When you convert with coremltools, Core ML takes over and doesn't provide fine-grained control - run on GPU, CPU or ANE. Also, ANE isn't really designed with transformers in mind, so most LLM inference defaults to GPU.
I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?
“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”
numpad0 · 2h ago
Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences.
Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.
Even Nvidia GPUs often have breaking changes moving from one generation to the next.
montebicyclelo · 2h ago
I think OP is suggesting that Apple / AMD / Intel do the work of integrating their NPUs into popular libraries like `llama.cpp`. Which might make sense. My impression is that by the time the vendors support a certain model with their NPUs the model is too old and nobody cares anyway. Whereas llama.cpp keeps up with the latest and greatest.
bigyabai · 2h ago
Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon.
Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.
ai-christianson · 1h ago
I can run GLM 4.5 Air and gpt-oss-120b both very reasonably. GPT OSS has particularly good latency.
I'm on a 128GB M4 macbook. This is "powerful" today, but it will be old news in a few years.
These models are just about getting as good as the frontier models.
ru552 · 1h ago
You're better served using Apple's MLX if you want to run models locally.
I find surprising that you can also do that from the browser (e.g. WebLLM). I imagine that in the near future we will run these engines locally for many use cases, instead of via APIs.
punitvthakkar · 3h ago
So far I've not run into the kind of use cases that local LLMs can convincingly provide without making me feel like I'm using the first ever ChatGPT from 2022, in that they are limited and quite limiting. I am curious about what use cases the community has found that work for them. The example that one user has given in this thread about their local LLM inventing a Sun Tzu interview is exactly the kind of limitation I'm talking about. How does one use a local LLM to do something actually useful?
jondwillis · 59m ago
I use, or at least try to use local models while prototyping/developing apps.
First, they control costs during development, which depending on what you're doing, can get quite expensive for low or no budget projects.
Second, they force me to have more constraints and more carefully compose things. If a local model (albeit something somewhat capable like gpt-oss or qwen3) can start to piece together this agentic workflow I am trying to model, chances are, it'll start working quite well and quite quickly if I switch to even a budget cloud model (something like gpt-5-mini.)
However, dealing with these constraints might not be worth the time if you can stuff all of the documents in your context window for the cloud models and get good results, but it will probably be cheaper and faster on an ongoing basis to have split the task up.
narrator · 48m ago
I have tried a lot of different LLMs and Gemma3:27b on a 48gb+ Macbook is probably the best for analyzing diaries and personal stuff you don't want to share with the cloud. The China models are comically bad with life advice. For example, I asked Deepseek to read my diaries and talk to me about my life goals and it told me in a very Confucian manner what the proper relationships in my life were for my stage of life and station in society. Gemma is much more western.
dxetech · 2h ago
There are situations where internet access is limited, or where there are frequent outages. An outdated LLM might be more useful than none at all.
For example: my internet is out due to a severe storm, what safety precautions do I need to take?
crazygringo · 2h ago
I see local LLM's being used mainly for automation as opposed to factual knowledge -- for classification, summarization, search, and things like grammar checking.
So they need to be smart about your desired language(s) and all the everyday concepts we use in it (so they can understand the content of documents and messages), but they don't need any of the detailed factual knowledge around human history, programming languages and libraries, health, and everything else.
The idea is that you don't prompt the LLM directly, but your OS tools make use of it, and applications prompt it as frequently as they fetch URL's.
dragonwriter · 2h ago
Well, a lot of what is possible with local models depends on what your local hardware is, but docling is a pretty good example of a library that can use local models (VLMs instead of regular LLMs) “under the hood” for productive tasks.
bityard · 58m ago
I use a local LLM for lots of little things that I used to use search engines for. Defining words, looking up unicode symbols for copy/paste, reminders on how to do X in bash or Python. Sometimes I use it as a starting point for high-level questions and curiosity and then move to actual human content or larger online models for more details and/or fact-checking if needed.
If your computer is somewhat modern and has a decent amount of RAM to spare, it can probably run one of the smaller-but-still-useful models just fine, even without a GPU.
My reasons:
1) Search engines are actively incentivized to not show useful results. SEO-optimized clickbait articles contain long fluffy, contentless prose intermixed with ads. The longer they can keep you "searching" for the information instead of "finding" it, the better is for their bottom line. Because if you actually manage to find the information you're looking for, you close the tab and stop looking at ads. If you don't find what you need, you keep scrolling and generate more ad revenue for the advertisers and search engines. It's exactly the same reasons online dating sites are futile for most people: every successful match made results in two lost customers which is bad for revenue.
LLMs (even local ones in some cases) are quite good at giving you direct answers to direct questions which is 90% of my use for search engines to begin with. Yes, sometimes they hallucinate. No, it's not usually a big deal if you apply some common sense.
2) Most datacenter-hosted LLMs don't have ads built into them now, but they will. As soon as we get used to "trusting" hosted models due to how good they have become, the model developers and operators will figure out how to turn the model into a sneaky salesman. You'll ask it for the specs on a certain model of Dell laptop and it will pretend it didn't hear you and reply, "You should try HP's latest line of up business-class notebooks, they're fast, affordable, and come in 5 fabulous colors to suit your unique personal style!" I want to make sure I'm emphasizing that it's not IF this happens, it's WHEN.
Local LLMs COULD have advertising at some point, but it will probably be rare and/or weird as these smaller models are meant mainly for development and further experimentation. I have faith that some open-weight models will always exist in some form, even if they never rival commercially-hosted models in overall quality.
3) I've made peace with the fact that data privacy in the age of Big Tech is a myth, but that doesn't mean I can't minimize my exposure by keeping some of my random musings and queries to myself. Self-hosted AI models will never be as "good" as the ones hosted in datacenters, but they are still plenty useful.
4) I'm still in the early stages of this, but I can develop my own tools around small local models without paying a hosted model provider and/or becoming their product.
5) I was a huge skeptic about the overall value of AI during all of the initial hype. Then I realized that this stuff isn't some fad that will disappear tomorrow. It will get better. The experience will get more refined. It will get more accurate. It will consume less energy. It will be totally ubiquitous. If you fail to come to speed on some important new technology or trend, you will be left in the dust by those who do. I understand the skepticism and pushback, but the future moves forward regardless.
mentalgear · 45m ago
something like rewind or openRecall can use local LLMs for on-device semantic search.
luckydata · 3h ago
Local models can do embedding very well, which is useful for things like building a screenshot manager for example.
ivape · 2h ago
I use Claude code in the terminal only mostly to figure out what to commit along with what to write for the commit message. I believe a solid 7-8b model can do this locally.
So, that’s at least one small highly useful workflow robot I have a use for (and very easy to cook up on your own).
I also have a use for terminal command autocompletion, which again, a small model can be great for.
Something felt kind really wrong about sending entire folder contents over to Claude online, so I am absolutely looking to create the toolkit locally.
The universe off offline is just getting started, and these big companies literally are telling you “watch out, we save this stuff”.
bigyabai · 2h ago
Qwen3 A3B (in my experience) writes code as-good-as ChatGPT 4o and much better than GPT-OSS.
hu3 · 2h ago
I just tested Qwen3 A3B vs ChatGPE a random prompt from my head and:
> Please write a C# middleware to block requests from browser agents that contain any word in a specified list of words: openai, grok, gemini, claude.
ChatGPT 4o was considerably better. Less verbose and less unnecessary abstractions.
bigyabai · 2h ago
You want the 2507 update of the model, I think the one you used is ~8-10 months out-of-date.
segmondy · 3h ago
The same way you use a cloud LLM.
oblio · 2h ago
I think the point was that for example for programming, people perceive state of the art LLMs as being net positive contributors, at least for mainstream programming languages and tasks, and I guess local LLMs aren't net positive contributors (i.e. an experienced programmer can build the same thing at least as fast when using an LLM).
daoboy · 3h ago
I'm running Hermes Mistral and the very first thing it did was start hallucinating.
I recently started an audio dream journal and want to keep it private. Set up whisper to transcribe the .wav file and dump it in an Obsidian folder.
The plan was to put a local llm step in to clean up the punctuation and paragraphs.
I entered instructions to clean the transcript without changing or adding anything else.
Hermes responded by inventing an intereview with Sun Tzu about why he wrote the Art of War. When I stopped the process it apologized and advised it misunderstood when I talked about Sun Tzu. I never mentioned Sun Tzu or even provided a transcript. Just instructions.
We went around with this for a while before I could even get it to admit the mistake, and it refused to identify why it occurred in the first place.
Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself. This same logic applies to a lot of the areas I'd like to have a local llm for. Hopefully they'll get there soon.
simonh · 3h ago
It’s often been assumed that accuracy and ‘correctness’ would be easy to implement on computers because they operate on logic, in some sense. It’s originality and creativity that would be hard, or impossible because it’s not logical. Science Fiction has been full of such assumptions. Yet here we are, the actual problem is inventing new heavy enough training sticks to beat our AIs out of constantly making stuff up and lying about it.
I suppose we shouldn’t be surprised in hindsight. We trained them on human communicative behaviour after all. Maybe using Reddit as a source wasn’t the smartest move. Reddit in, Reddit out.
root_axis · 1h ago
> It’s often been assumed that accuracy and ‘correctness’ would be easy to implement on computers because they operate on logic, in some sense. It’s originality and creativity that would be hard
More fundamental than the training data is the fact that the generative outputs are statistical, not logical. This is why they can produce a sequence of logical steps but still come to incorrect or contradictory conclusions. This is also why they tackle creativity more easily since the acceptable boundaries of creative output is less rigid. A photorealistic video of someone sawing a cloud in half can still be entertaining art despite the logical inconsistencies in the idea.
smallmancontrov · 2h ago
Pre-training gets you GPT-3, not InstructGPT/ChatGPT. During fine-tuning OpenAI (and everyone else) specifically chose to "beat in" a heavy bias-to-action because a model that just answers everything with "it depends" and "needs more info" is even more useless than a model that turns every prompt into a creative writing exercise. Striking a balance is simply a hard problem -- and one that many humans have not mastered for themselves.
HankStallone · 2h ago
The worst news I've seen about AI was a study that said the major ones get 40% of their references from Reddit (I don't know how they determined that). That explains the cloying way it tries to be friendly and supportive, too.
sandbags · 2h ago
I saw someone reference this today and the question I had was whether this counted the trillions of words accrued from books and other sources. i.e. is it 40%? Or 40% of what they can find a direct attribution link for?
dragonwriter · 2h ago
> It’s often been assumed that accuracy and ‘correctness’ would be easy to implement on computers because they operate on logic, in some sense. It’s originality and creativity that would be hard, or impossible because it’s not logical.
It is easy, comparatively. Accuracy and correctness is what computers have been doing for decades, except when people have deliberately compromised that for performance or other priorities (or used underlying tools where someone else had done that, perhaps unwittingly.)
> Yet here we are, the actual problem is inventing new heavy enough training sticks to beat our AIs out of constantly making stuff up and lying about it.
LLMs and related AI technologies are very much an instance of extreme deliberate compromise of accuracy, correctness, and controllability to get some useful performance in areas where we have no idea how to analytically model the expected behavior but have lots of more or less accurate examples.
JumpCrisscross · 2h ago
I don't think we're anywhere close to running cutting-edge LLMs on our phones or laptops.
What may be around the corner is running great models on a box at home. The AI lives at home. Your thin client talks to it, maybe runs a smaller AI on device to balance latency and quality. (This would be a natural extension for Apple to go into with its Mac Pro line. $10 to 20k for a home LLM device isn't ridiculous.)
Not sure about the Mac Pro, since you pay a lot for the big fancy case. The Studio seems more sensible.
And of course Nvidia and AMD are coming out with options for massive amounts of high bandwidth GPU memory in desktop form factors.
I like the idea of having basically a local LLM server that your laptop or other devices can connect to. Then your laptop doesn’t have to burn its battery on LLM work and it’s still local.
JumpCrisscross · 2h ago
> Not sure about the Mac Pro, since you pay a lot for the big fancy case. The Studio seems more sensible
Oh wow, a maxed out Studio could run a 600B parameter model entirely in memory. Not bad for $12k.
There may be a business in creating the software that links that box to an app on your phone.
brokencode · 57m ago
Perhaps said software could even form an end to end encrypted tunnel from your phone to your local LLM server anywhere over the internet via a simple server intermediary.
The amount of data transferred is tiny and the latency costs are typically going to be dominated by the LLM inference anyway. Not much advantage to doing LAN only except that you don’t need a server.
Though the amount of people who care enough to buy a $3k - $10k server and set this up compared to just using ChatGPT is probably very small.
JumpCrisscross · 54m ago
> amount of people who care enough to buy a $3k - $10k server and set this up compared to just using ChatGPT is probably very small
So I maxed that out, and it’s with Apple’s margins. I suspect you could do it for $5k.
I’d also note that for heavy users of ChatGPT, the difference in energy costs for a home setup and the price for ChatGPT tokens may make this financially compelling for heavy users.
dghlsakjg · 31m ago
That software is an HTTP request, no?
Any number of AI apps allow you to specify a custom endpoint. As long as your AI server accepts connections to the internet, you're gravy.
data-ottawa · 51m ago
This is what I’m doing with my amd 395+.
I’m running docker containers with different apps and it works well enough for a lot of my use cases.
I mostly use Qwen Code and GPT OSS 120b right now.
When the next generation of this tech comes through I will probably upgrade despite the price, the value is worth it to me.
bigyabai · 2h ago
> $10 to 20k for a home LLM device isn't ridiculous.
At that point you are almost paying more than the datacenter does for inference hardware.
JumpCrisscross · 2h ago
> At that point you are almost paying more than the datacenter does for inference hardware
Of course. You and I don't have their economies of scale.
bigyabai · 2h ago
Then please excuse me for calling your one-man $10,000 inference device ridiculous.
JumpCrisscross · 2h ago
> please excuse me for calling your one-man $10,000 inference device ridiculous
It’s about the real price of early microcomputers.
Until the frontier stabilizes, this will be the cost of competitive local inference. Not pretending what we can run on a laptop will compete with a data centre.
simonw · 2h ago
Plenty of hobbies are significantly more expensive than that.
floweronthehill · 1h ago
I believe local llms are the future. It will only get better. Once we get to the level of even last year's state of the art I don't see any reason to use chatgpt/anthropic/other.
We don't even need one big model good at everything. Imagine loading a small model from a collection of dozens of models depending on the tasks you have in mind. There is no moat.
root_axis · 48m ago
It's true that local LLMs are only going to get better, but it's not clear they will become generally practical for the foreseeable future. There have been huge improvements to the reasoning and coding capabilities of local models, but most of that comes from refinements to training data and training techniques (e.g. RLHF, DPO, CoT etc), while the most important factor by far remains the capability to reduce hallucinations to comfortable margins using the raw statistical power you get with massive full-precision parameter counts. The hardware gap between today's SOTA models and what's available to the consumer are so massive that it'll likely be at least a decade before they become practical.
tolerance · 51m ago
DEVONThink 4’s support for local models is great and could possibly contribute to the software’s enduring success for the next 10 years. I’ve found it helpful for summarizing documents and selections of text, but it can do a lot more than that apparently.
It's a crazy upside-down world where the Mac Studio M3 Ultra 512GB is the reasonable option among the alternatives if you intend to run larger models at usable(ish) speeds.
KolmogorovComp · 43m ago
The really though spot is finding a good model for your use case. I’ve a 16Gb MB and have been paralyzed by the many options. I’ve settle for a quantisied 14B Qwen for now, but no idea if this is a good idea.
Damogran6 · 30m ago
Oddly, my 2013 MacPro (Trashcan) runs LLMs pretty well, mostly because 64Gb of old school RAM is, like, $25.
mg · 4h ago
Is anyone working on software that lets you run local LLMs in the browser?
In theory, it should be possible, shouldn't it?
The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an "upload" button that the user can click to select a model from their file system. The button would not upload the model to a server - it would just let the JS code access it to convert it into WebGL and move it into the GPU.
This way, one could download models from HuggingFace, store them locally and use them as needed. Nicely sandboxed and independent of the operating system.
This might be a misunderstanding. Did you see the "button that the user can click to select a model from their file system" part of my comment?
I tried some of the demos of transformers.js but they all seem to load the model from a server. Which is super slow. I would like to have a page the lets me use any model I have on my disk.
paulirish · 1h ago
Beyond all the wasm/webgpu approaches other folks have linked (mostly in the transformers.js ecosystem), there's been a standardized API brewing since 2019: https://webmachinelearning.github.io/webnn-intro/
Yeah, something like that, but without the WebGPU requirement.
Neither FireFox nor Chromium support WebGPU on Linux. Maybe behind flags. But before using a technology, I would wait until it is available in the default config.
Lets see when browsers will bring WebGPU to Linux.
SparkyMcUnicorn · 3h ago
This should be what you're looking for. It doesn't utilize the GPU, but WebGL support is in the TODOs.
This one is pretty cool. Compile the gguf of an OSS LLM directly into an executable. Will open an interface in the browser to chat. Can also launch an OpenAI API style interface hosted locally.
Doesn't work quite as well on Windows due to the executable file size limit but seems great for Mac/Linux flavors.
You don’t need a browser to sandbox something. Easier and more performant to do GOU pass through to a container or VM.
01HNNWZ0MV43FF · 3h ago
Container or VM is a bigger commitment. VMs need root and containers need Docker group and something like docker-compose or a shell script or something.
idk it's just like, do I want to run to the store and buy a 24-pack of water bottles, and stash them somewhere, or do I want to open the tap and have clean drinking water
samsolomon · 3h ago
Is Open WebUI something like you are looking for? The design has some awkwardness, but overall it's incorporated a ton of great features.
No, I'm looking for an html page with a button "Select LLM". After pressing that button and selecting a local LLM from disk, it would show an input field where you can type your question and then it would use the given LLM to create the answer.
I'm not sure what OpenWebUI is, but if it was what I mean, they would surely have the page live and not ask users to install Docker etc.
It's both what you want and not; the chat/question interface is as you describe, lack-of-installation is not. The LLM work is offloaded to other software, not the browser.
I would like to skip maintaining all this crap, though: I like your approach
Jemaclus · 3h ago
You should install it, because it's exactly what you just described.
Edit: From a UI perspective, it's exactly what you described. There's a dropdown where you select the LLM, and there's a ChatGPT-style chatbox. You just docker-up and go to town.
Maybe I don't understand the rest of the request, but I can't imagine a software where a webpage exists and it just magically has LLMs available in the browser with no installation?
craftkiller · 3h ago
It doesn't seem exactly like what they are describing. The end-user interface is what they are describing but it sounds like they want the actual LLM to run in the browser (perhaps via webgpu compute shaders). Open WebUI seems to rely on some external executor like ollama/llama.cpp, which naturally can still be self-hosted but they are not executing INSIDE the browser.
Jemaclus · 2h ago
Does that even exist? It's basically what they described but with some additional installation? Once you install it, you can select the LLM on disk and run it? That's what they asked for.
Maybe I'm misunderstanding something.
craftkiller · 2h ago
Apparently it does, though I'm learning about it for the first time in this thread also. Personally, I just run llama.cpp locally in docker-compose with anythingllm for the UI but I can see the appeal of having it all just run in the browser.
> You should install it, because it's exactly what you just described.
Not OP, but it really isn't what' they're looking for.
Needing to install stuff VS simply going to a web page are two very different things.
mudkipdev · 3h ago
It was done with gemma-3-270m, I hope someone will post a link to it below
vavikk · 3h ago
Not browser but Electron. For the browser you would have to run a local nodejs server and point the browser app to use the local API. I use electron with nodejs and react for UI. Yes I can switch models.
WebGPU is not yet available in the default config of Linux browsers, so WebGL would have been perfect :)
Olshansky · 1h ago
+1 to LM Studio. Helped build a lot of intuition.
Seeing and navigating all the configs helped me build intuition around what my macbook can or cannot do, how things are configured, how they work, etc...
Great way to spend an hour or two.
deepsquirrelnet · 28m ago
I also like that it ships with some cli tools, including an openai compatible server. It’s great to be able to take a model that’s loaded and open up an endpoint to it for running local scripts.
You can get a quick feel for how it works via the chat interface and then extend it programmatically.
noja · 3h ago
I really like On-Device AI on iPhone (also runs on Mac): https://ondevice-ai.app in addition to LM Studio. It has a nice interface, with multiple prompt integration, and a good selection of models. Also the developer is responsive.
LeoPanthera · 3h ago
But it has a paid recurring subscription, which is hard to justify for something that runs entirely locally.
noja · 2h ago
I am using it without one so far. But if they continue to develop it I will upgrade.
jftuga · 1h ago
I have a macbook air M4 with 32 GB. What LM Studio models would you recommend for:
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
jerryliu12 · 1h ago
My main concern with running LLMs locally so far is that it absolutely kills your battery if you're constantly inferencing.
OvidStavrica · 3h ago
By far, the easiest (open source/Mac) is with Pico AI Server with Witsy for a front end:
...and you really want at least 48G RAM to run >24B models.
a-dub · 2h ago
ollama is another good choice for this purpose. it's essentially a wrapper around llamacpp that adds easy downloading and management of running instances. it's great! also works on linux!
The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.
What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
If you want to convert models to run on the ANE there are tools provided:
> Convert models from TensorFlow, PyTorch, and other libraries to Core ML.
https://apple.github.io/coremltools/docs-guides/index.html
The issue is in targeting specific hardware blocks. When you convert with coremltools, Core ML takes over and doesn't provide fine-grained control - run on GPU, CPU or ANE. Also, ANE isn't really designed with transformers in mind, so most LLM inference defaults to GPU.
“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”
Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.
Even Nvidia GPUs often have breaking changes moving from one generation to the next.
Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.
I'm on a 128GB M4 macbook. This is "powerful" today, but it will be old news in a few years.
These models are just about getting as good as the frontier models.
First, they control costs during development, which depending on what you're doing, can get quite expensive for low or no budget projects.
Second, they force me to have more constraints and more carefully compose things. If a local model (albeit something somewhat capable like gpt-oss or qwen3) can start to piece together this agentic workflow I am trying to model, chances are, it'll start working quite well and quite quickly if I switch to even a budget cloud model (something like gpt-5-mini.)
However, dealing with these constraints might not be worth the time if you can stuff all of the documents in your context window for the cloud models and get good results, but it will probably be cheaper and faster on an ongoing basis to have split the task up.
So they need to be smart about your desired language(s) and all the everyday concepts we use in it (so they can understand the content of documents and messages), but they don't need any of the detailed factual knowledge around human history, programming languages and libraries, health, and everything else.
The idea is that you don't prompt the LLM directly, but your OS tools make use of it, and applications prompt it as frequently as they fetch URL's.
If your computer is somewhat modern and has a decent amount of RAM to spare, it can probably run one of the smaller-but-still-useful models just fine, even without a GPU.
My reasons:
1) Search engines are actively incentivized to not show useful results. SEO-optimized clickbait articles contain long fluffy, contentless prose intermixed with ads. The longer they can keep you "searching" for the information instead of "finding" it, the better is for their bottom line. Because if you actually manage to find the information you're looking for, you close the tab and stop looking at ads. If you don't find what you need, you keep scrolling and generate more ad revenue for the advertisers and search engines. It's exactly the same reasons online dating sites are futile for most people: every successful match made results in two lost customers which is bad for revenue.
LLMs (even local ones in some cases) are quite good at giving you direct answers to direct questions which is 90% of my use for search engines to begin with. Yes, sometimes they hallucinate. No, it's not usually a big deal if you apply some common sense.
2) Most datacenter-hosted LLMs don't have ads built into them now, but they will. As soon as we get used to "trusting" hosted models due to how good they have become, the model developers and operators will figure out how to turn the model into a sneaky salesman. You'll ask it for the specs on a certain model of Dell laptop and it will pretend it didn't hear you and reply, "You should try HP's latest line of up business-class notebooks, they're fast, affordable, and come in 5 fabulous colors to suit your unique personal style!" I want to make sure I'm emphasizing that it's not IF this happens, it's WHEN.
Local LLMs COULD have advertising at some point, but it will probably be rare and/or weird as these smaller models are meant mainly for development and further experimentation. I have faith that some open-weight models will always exist in some form, even if they never rival commercially-hosted models in overall quality.
3) I've made peace with the fact that data privacy in the age of Big Tech is a myth, but that doesn't mean I can't minimize my exposure by keeping some of my random musings and queries to myself. Self-hosted AI models will never be as "good" as the ones hosted in datacenters, but they are still plenty useful.
4) I'm still in the early stages of this, but I can develop my own tools around small local models without paying a hosted model provider and/or becoming their product.
5) I was a huge skeptic about the overall value of AI during all of the initial hype. Then I realized that this stuff isn't some fad that will disappear tomorrow. It will get better. The experience will get more refined. It will get more accurate. It will consume less energy. It will be totally ubiquitous. If you fail to come to speed on some important new technology or trend, you will be left in the dust by those who do. I understand the skepticism and pushback, but the future moves forward regardless.
So, that’s at least one small highly useful workflow robot I have a use for (and very easy to cook up on your own).
I also have a use for terminal command autocompletion, which again, a small model can be great for.
Something felt kind really wrong about sending entire folder contents over to Claude online, so I am absolutely looking to create the toolkit locally.
The universe off offline is just getting started, and these big companies literally are telling you “watch out, we save this stuff”.
> Please write a C# middleware to block requests from browser agents that contain any word in a specified list of words: openai, grok, gemini, claude.
I used ChatpGPT 4o from GitHub Copilot inside VSCode. And Qwen3 A3B from here: https://deepinfra.com/Qwen/Qwen3-30B-A3B
ChatGPT 4o was considerably better. Less verbose and less unnecessary abstractions.
I recently started an audio dream journal and want to keep it private. Set up whisper to transcribe the .wav file and dump it in an Obsidian folder.
The plan was to put a local llm step in to clean up the punctuation and paragraphs. I entered instructions to clean the transcript without changing or adding anything else.
Hermes responded by inventing an intereview with Sun Tzu about why he wrote the Art of War. When I stopped the process it apologized and advised it misunderstood when I talked about Sun Tzu. I never mentioned Sun Tzu or even provided a transcript. Just instructions.
We went around with this for a while before I could even get it to admit the mistake, and it refused to identify why it occurred in the first place.
Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself. This same logic applies to a lot of the areas I'd like to have a local llm for. Hopefully they'll get there soon.
I suppose we shouldn’t be surprised in hindsight. We trained them on human communicative behaviour after all. Maybe using Reddit as a source wasn’t the smartest move. Reddit in, Reddit out.
More fundamental than the training data is the fact that the generative outputs are statistical, not logical. This is why they can produce a sequence of logical steps but still come to incorrect or contradictory conclusions. This is also why they tackle creativity more easily since the acceptable boundaries of creative output is less rigid. A photorealistic video of someone sawing a cloud in half can still be entertaining art despite the logical inconsistencies in the idea.
It is easy, comparatively. Accuracy and correctness is what computers have been doing for decades, except when people have deliberately compromised that for performance or other priorities (or used underlying tools where someone else had done that, perhaps unwittingly.)
> Yet here we are, the actual problem is inventing new heavy enough training sticks to beat our AIs out of constantly making stuff up and lying about it.
LLMs and related AI technologies are very much an instance of extreme deliberate compromise of accuracy, correctness, and controllability to get some useful performance in areas where we have no idea how to analytically model the expected behavior but have lots of more or less accurate examples.
What may be around the corner is running great models on a box at home. The AI lives at home. Your thin client talks to it, maybe runs a smaller AI on device to balance latency and quality. (This would be a natural extension for Apple to go into with its Mac Pro line. $10 to 20k for a home LLM device isn't ridiculous.)
You can also string two 512GB Mac Studios together using MLX to load even larger models - here's 671B 8-bit DeepSeek R1 doing that: https://twitter.com/alexocheema/status/1899735281781411907
And of course Nvidia and AMD are coming out with options for massive amounts of high bandwidth GPU memory in desktop form factors.
I like the idea of having basically a local LLM server that your laptop or other devices can connect to. Then your laptop doesn’t have to burn its battery on LLM work and it’s still local.
Oh wow, a maxed out Studio could run a 600B parameter model entirely in memory. Not bad for $12k.
There may be a business in creating the software that links that box to an app on your phone.
The amount of data transferred is tiny and the latency costs are typically going to be dominated by the LLM inference anyway. Not much advantage to doing LAN only except that you don’t need a server.
Though the amount of people who care enough to buy a $3k - $10k server and set this up compared to just using ChatGPT is probably very small.
So I maxed that out, and it’s with Apple’s margins. I suspect you could do it for $5k.
I’d also note that for heavy users of ChatGPT, the difference in energy costs for a home setup and the price for ChatGPT tokens may make this financially compelling for heavy users.
Any number of AI apps allow you to specify a custom endpoint. As long as your AI server accepts connections to the internet, you're gravy.
I’m running docker containers with different apps and it works well enough for a lot of my use cases.
I mostly use Qwen Code and GPT OSS 120b right now.
When the next generation of this tech comes through I will probably upgrade despite the price, the value is worth it to me.
At that point you are almost paying more than the datacenter does for inference hardware.
Of course. You and I don't have their economies of scale.
It’s about the real price of early microcomputers.
Until the frontier stabilizes, this will be the cost of competitive local inference. Not pretending what we can run on a laptop will compete with a data centre.
We don't even need one big model good at everything. Imagine loading a small model from a collection of dozens of models depending on the tasks you have in mind. There is no moat.
https://www.devontechnologies.com/blog/20250513-local-ai-in-...
In theory, it should be possible, shouldn't it?
The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an "upload" button that the user can click to select a model from their file system. The button would not upload the model to a server - it would just let the JS code access it to convert it into WebGL and move it into the GPU.
This way, one could download models from HuggingFace, store them locally and use them as needed. Nicely sandboxed and independent of the operating system.
https://huggingface.co/spaces/webml-community/llama-3.2-webg... loads a 1.24GB Llama 3.2 q4f16 ONNX build
https://huggingface.co/spaces/webml-community/janus-pro-webg... loads a 2.24 GB DeepSeek Janus Pro model which is multi-modal for output - it can respond with generated images in addition to text.
https://huggingface.co/blog/embeddinggemma#transformersjs loads 400MB for an EmbeddingGemma demo (embeddings, not LLMs)
I've collected a few more of these demos here: https://simonwillison.net/tags/transformers-js/
You can also get this working with web-llm - https://github.com/mlc-ai/web-llm - here's my write-up of a demo that uses that: https://simonwillison.net/2024/Nov/29/structured-generation-...
I tried some of the demos of transformers.js but they all seem to load the model from a server. Which is super slow. I would like to have a page the lets me use any model I have on my disk.
Demos here: https://webmachinelearning.github.io/webnn-samples/ I'm not sure any of them allow you to select a model file from disk, but that should be entirely straightforward.
https://github.com/mlc-ai/web-llm-chat
https://github.com/mlc-ai/mlc-llm
https://github.com/mlc-ai/web-llm
Neither FireFox nor Chromium support WebGPU on Linux. Maybe behind flags. But before using a technology, I would wait until it is available in the default config.
Lets see when browsers will bring WebGPU to Linux.
https://github.com/ngxson/wllama
https://huggingface.co/spaces/ngxson/wllama
Doesn't work quite as well on Windows due to the executable file size limit but seems great for Mac/Linux flavors.
https://github.com/Mozilla-Ocho/llamafile
And related is the whisper implementation: https://ggml.ai/whisper.cpp/
idk it's just like, do I want to run to the store and buy a 24-pack of water bottles, and stash them somewhere, or do I want to open the tap and have clean drinking water
https://openwebui.com/
I'm not sure what OpenWebUI is, but if it was what I mean, they would surely have the page live and not ask users to install Docker etc.
I would like to skip maintaining all this crap, though: I like your approach
Edit: From a UI perspective, it's exactly what you described. There's a dropdown where you select the LLM, and there's a ChatGPT-style chatbox. You just docker-up and go to town.
Maybe I don't understand the rest of the request, but I can't imagine a software where a webpage exists and it just magically has LLMs available in the browser with no installation?
Maybe I'm misunderstanding something.
Not OP, but it really isn't what' they're looking for. Needing to install stuff VS simply going to a web page are two very different things.
https://huggingface.co/docs/transformers.js/en/guides/webgpu
eta: its predecessor was using webGL
Seeing and navigating all the configs helped me build intuition around what my macbook can or cannot do, how things are configured, how they work, etc...
Great way to spend an hour or two.
You can get a quick feel for how it works via the chat interface and then extend it programmatically.
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
https://picogpt.app/
https://apps.apple.com/us/app/pico-ai-server-llm-vlm-mlx/id6...
Witsy:
https://github.com/nbonamy/witsy
...and you really want at least 48G RAM to run >24B models.