AI fabricates 21 out of 23 citations lawyer sanctioned reported to state bar [pdf] (www4.courts.ca.gov)

I've run that using both Ollama (easiest) and MLX. Here are the Ollama models: https://ollama.com/library/mistral-small3.1/tags - the 15GB one works fine.

For MLX https://huggingface.co/mlx-community/Mistral-Small-3.1-24B-I... and https://huggingface.co/mlx-community/Mistral-Small-3.1-24B-I... should work, I use the 8bit one like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/Mistral-Small-3.1-Text-24B-Instruct-2503-8bit -a mistral-small-3.1
  llm chat -m mistral-small-3.1

The Ollama one supports image inputs too:

  llm install llm-ollama
  ollama pull mistral-small3.1
  llm -m mistral-small3.1 'describe this image' \
    -a https://static.simonwillison.net/static/2025/Mpaboundrycdfw-1.png

Output here: https://gist.github.com/simonw/89005e8aa2daef82c53c2c2c62207...

indigodaddy · 129d ago

Simon, can you recommend some small models that would be usable for coding on a standard M4 Mac Mini (only 16G ram) ?

simonw · 129d ago

That's pretty tough - the problem is that you need to have RAM left over to run actual applications!

Qwen 3 8B on MLX runs in just 5GB of RAM and can write basic code but I don't know if it would be good enough for anything interesting: https://simonwillison.net/2025/May/2/qwen3-8b/

Honestly though with that little memory I'd stick to running against hosted LLMs - Claude 3.7 Sonnet, Gemini 2.5 Pro, o4-mini are all cheap enough that it's hard to spend much money with them for most coding workflows.

codetrotter · 129d ago

How about on an MacBook Pro M2 Max with 64GB RAM? Any recommendations for local models for coding on that?

I tried to run some of the differently sized DeepSeek R1 locally when those had recently come out, but couldn’t manage at the time to run any of them. And I had to download a lot of data to try those. So if you know a specific size of DeepSeek R1 that will work on 64GB RAM on MacBook Pro M2 Max, or another great local LLM for coding on that, that would be super appreciated

freeqaz · 129d ago

I imagine that this in quantized form would fit pretty well and be decent. (Qwen R1 32b[1] or Qwen 3 32b[2])

Specifically the `Q6_K` quant looks solid at ~27gb. That leaves enough headroom on your 64gb Macbook that you can actually load a decent amount of context. (It takes extra VRAM for every token of context you need)

Rough math, based on this[0] calculator is that it's around ~10gb per 32k tokens of context. And that doesn't seem to change based on using a different quant size -- you just have to have enough headroom.

So with 64gb:

- ~25gb for Q6 quant

- 10-20gb for context of 32-64k

That leaves you around 20gb for application memory and _probably_ enough context to actually be useful for larger coding tasks! (It just might be slow, but you can use a smaller quant to get more speed.)

I hope that helps!

0: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

1: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32...

2: https://huggingface.co/Qwen/Qwen3-32B-GGUF

simonw · 128d ago

I really like Mistral Small 3.1 (I have a 64GB M2 as well). Qwen 3 is worth trying in different sizes too.

I don't know if they'll be good enough for general coding tasks though - I've been spoiled by API access to Claude 3.7 Sonnet and o4-mini and Gemini 2.5 Pro.

aukejw · 128d ago

How do you determine peak memory usage? Just look at activity monitor?

I've yet to find a good overview of how much memory each model needs for different context lengths (other than back of the envelope #weights * bits). LM Studio warns you if a model will likely not fit, but it's not very exact.

simonw · 128d ago

MLX reports peak memory usage at the end of the response. Otherwise I'll use Activity Monitor.

aukejw · 128d ago

I'm also trusting `get_peak_memory` + some small buffer for now.

Still, it reports accurate peak memory usage for tensors living on GPU, but seems to miss some of the non-Metal overhead, however small (https://github.com/aukejw/mlx_transformers_benchmark/issues/...).

aukejw · 128d ago

There are plenty of smaller (quantized) models that fit well on your machine! On a M4 with 24GB it’s already possible to comfortably run 8B quantized models.

Im benchmarking runtime and memory usage for a few of them: https://aukejw.github.io/mlx_transformers_benchmark/

jychang · 129d ago

16GB on a mac with unified memory is too small for good coding models. Anything on that machine is severely compromised. Maybe in ~1 year we will see better models that fit in ~8gb vram, but not yet.

Right now, for a coding LLM on a Mac, the standard is Qwen 3 32b, which runs great on any M1 mac with 32gb memory or better. Qwen 3 235b is better, but fewer people have 128gb memory.

Anything smaller than 32b, you start seeing a big drop off in quality. Qwen 3 14b Q4_K_M is probably your best option at 16gb memory, but it's significantly worse in quality than 32b.

chedabob · 128d ago

What do you use to interface with Qwen?

I have LMStudio installed, and use Continue in VSCode, but it doesn't feel nearly as feature rich compared to using something like Cursor's IDE, or the GitHub Copilot plugin.

Lalabadie · 128d ago

Continue can be your autocomplete provider – and use a smaller and faster model. Something like Cline (or Roo or Kilocode or another fork) would be the more Cursor-like assistant there.

reichardt · 129d ago

With around 4.6 GiB model size the new Qwen3-8B quantized to 4-bit should fit comfortably in 16 GiB of memory: https://huggingface.co/mlx-community/Qwen3-8B-4bit

martin_a · 128d ago

Strange idea, but if I'd like to set up a solid LLM for use in my home network, how much processing power would I need for a multi-purpose model?

A Raspberry Pi? And old ThinkPad? A fully speced-out latest gen Macbook?

edit: One of those old Mac Pros?

wsintra2022 · 128d ago

That’s what I tried initially, an old black tin can Mac Pro, but it couldn’t do it. Next splashed on an m2 ultra 64gb mpro, runs ollama with qwen3 32b - reverse shell into the localhost with open web-ui and automatic111 and voila AI on my home network

martin_a · 128d ago

Hm, that seems like a lot of power use. I thought I could get away with somewhat less.

the_other_mac · 128d ago

Run Mistral 7b in under 4gb ram:

https://github.com/garagesteve1155/Overload

(As announced this morning in the FB group "Dull Men's Club!)

kergonath · 129d ago

> I think this is a game changer, because data privacy is a legitimate concern for many enterprise users.

Indeed. At work, we are experimenting with this. Using a cloud platform is a non-starter for data confidentiality reasons. On-premise is the way to go. Also, they’re not American, which helps.

> Btw, you can also run Mistral locally within the Docker model runner on a Mac.

True, but you can do that only with their open-weight models, right? They are very useful and work well, but their commercial models are bigger and hopefully better (I use some of their free models every day, but none of their commercial ones).

distances · 129d ago

I also kind of don't understand how it seems everyone is using AI for coding. I haven't had a client yet which would have approved any external AI usage. So I basically use them as search engines on steroids, but code can't go directly in or out.

fhd2 · 129d ago

You might be able to get your clients to sign something to allow usage, but if you don't, as you say, it doesn't seem wise to vibe code for them. For two reasons:

1. A typical contract transfers the rights to the work. The ownership of AI generated code is legally a wee bit disputed. If you modify and refactor generated code heavily it's probably fine, but if you just accept AI generated code en masse, making your client think that you wrote it and it is therefore their copyright, that seems dangerous.

2. A typical contract or NDA also contains non disclosure, i.e. you can't share confidential information, e.g. code (including code you _just_ wrote, due to #1) with external parties or the general public willy nilly. Whether any terms of service assurances from OpenAI or Anthropic that your model inputs and outputs will probably not be used for training are legally sufficient, I have doubts.

IANAL, and _perhaps_ I'm wrong about one or both of these, in one or more countries, but by and large I'd say the risk is not worth the benefit.

I mostly use third party LLMs like I would StackOverflow: Don't post company code there verbatim, make an isolated example. And also don't paste from SO verbatim. I tried other ways of using LLMs for programming a few times in personal projects and can't say I worry about lower productivity with these limitations. YMMV.

(All this also generally goes for employees with typical employment contracts: It's probably a contract violation.)

jstummbillig · 129d ago

Nobody is seriously disputing the ownership of AI generated code. A serious dispute would be a considerable, concerted effort to stop AI code generation in any jurisdiction, that provides a contrast to the enormous, ongoing efforts by multiple large players with eye-watering investments to make code generation bigger and better.

Note, that this is not a statement about the fairness or morality of LLM building, but to think that the legality of AI code generation is something to reasonably worry about, is betting against multiple large players and their hundreds of billions of dollars in investment right now, and that likely puts you in a bad spot in reality.

reverius42 · 129d ago

> Nobody is seriously disputing the ownership of AI generated code

From what I've been following it seems very likely that, at least in the US, AI-generated anything can't actually be copyrighted and thus can't have ownership at all! The legal implications of this are yet to percolate through the system though.

staunton · 129d ago

Only if that interpretation lasts despite likely intense lobbying to the contrary.

cess11 · 128d ago

Other forms of LLM output is being seriously challenged however.

https://llmlitigation.com/case-updates.html

Personally I have roughly zero trust in US courts on this type of issue but we'll see how it goes. Arguably there are cases to be made where LLM:s cough up code cribbed from repos with certain licenses without crediting authors and so on. It's probably a matter of time until some aggressively litigious actors do serious, systematic attempts at getting money out of this, producing case law as a by product.

Edit: Oh right, Butterick et al went after Copilot and image generation too.

https://githubcopilotlitigation.com/case-updates.html

https://imagegeneratorlitigation.com/case-updates.html

mistrial9 · 129d ago

this is "Kool-aid" from the supply side of LLMs for coding IMO. Plenty of people are plenty upset about the capture of code at Github corral, fed into BigCorp$ training systems.

parent statement reminds me of smug French in a castle north of London circa 1200, with furious locals standing outside the gates, dressed in rags with farm tools as weapons. One well-equipped tower guard says to another "no one is seriously disputing the administration of these lands"

_joel · 128d ago

Your mother was a hamster and your father smelt of elderberries?

jstummbillig · 128d ago

I think the comparison falls flat, but it's actually really funny. I'll keep it in mind.

distances · 129d ago

Yes these are indeed the points. I don't really care too much, it would make me a bit more efficient but I'm billing by the hour anyway so I'm completely fine playing by the book.

fhd2 · 129d ago

Not sure I can agree with the "I'm billing by the hour" part.

I mean sure, but I think of my little agency providing value, for a price. Clients have budgets, they have limited benefits from any software they build, and in order to be competitive against other agencies or their internal teams, overall, I feel we need to provide a good bang for buck.

But since it's not all that much about typing in code, and since even that activity isn't all that sped up by LLMs, not if quality and stability matters, I would still agree that it's completely fine.

distances · 129d ago

Yes, it's important of course that I'm efficient, and I am. But my coding speed isn't the main differentiating factor why clients like me.

I meant that I don't care enough to spearhead and drive this effort within the client orgs. They have their own processes, and internal employees would surely also like to use AI, so maybe they'll get there eventually. And meanwhile I'll just use it in the approved ways.

_bin_ · 129d ago

This comes down to a question of what one can prove. NNs are necessary not explainable and none of this would have much evidence to show in court.

fhd2 · 128d ago

Sure there's evidence: Your statements about this when challenged. And perhaps to a degree the commit log, at least that can arouse suspicion.

Sure, you can say "I'd just lie about it". But I don't know how many people would just casually lie in court. I sure wouldn't. Ethics is one thing, it takes a lot of guts, considering the possible repercussions.

_bin_ · 128d ago

"I do not recall"

fhd2 · 128d ago

Yup, Gates style would work. But billionaires have a tendency to not get into serious trouble for lying to the public, a court, congress and what not. Commoners very much do.

genghisjahn · 129d ago

What about 10 years ago when we all copied code from SO? Did we worry about copyright then? Maybe we did and I don’t recall.

layer8 · 129d ago

“We” took care to not copy it verbatim (it’s the concrete code form that is copyrighted, not the algorithm), and depending on jurisdiction there is the concept of https://en.wikipedia.org/wiki/Threshold_of_originality in copyright law, which short code snippets on Stack Overflow typically don’t meet.

fhd2 · 129d ago

It's roughly the same, legally, and I was well aware of that.

Legally speaking, you also want to be careful about your dependencies and their licenses, a company that's afraid to get sued usually goes to quite some lengths to ensure they play this stuff safe. A lot of smaller companies and startups don't know or don't care.

From a professional ethics perspective, personally, I don't want to put my clients in that position unless they consciously decide they want that. They hire professionals not just to get work done they fully understand, but to a large part to have someone who tells them what they don't know.

genghisjahn · 129d ago

You raise a good point. It was kinda gray in the SO days. You almost always had to change something to get your code to work. But at lot of LLM's can spit out code that you can just paste in. And, I guess maybe the tests all pass, but if it goes wrong, you, the coder probably don't know where it went wrong. But if you'd written it all yourself, you could probably guess.

I'm still sorting all this stuff out personally. I like LLM's when I work in an area I know well. But vibing in areas of technology that I don't know well just feels weird.

pfannkuchen · 129d ago

SO seems different because the author of the post is republishing it. If they are republishing copyrighted material without notice, it seems like the SO author is the one in violation of copyright.

In the LLM case, I think it’s more of an open question whether the LLM output is republishing the copyrighted content without notice, or simply providing access to copyrighted content. I think the former would put the LLM provider in hot water, while the latter would put the user in hot water.

shmel · 129d ago

How is it different from the cloud? Plenty startups store their code on github, run prod on aws, and keep all communications on gmail anyway. What's so different about LLMs?

simion314 · 129d ago

>How is it different from the cloud? Plenty startups store their code on github, run prod on aws, and keep all communications on gmail anyway. What's so different about LLMs?

Those plenty startups will also use Google, OpenAi or the built in Microsoft AI.

This is clearly for companies that need to keep the sensitive data under their control. I think they also get support with adding more training to the model to be personalized for your needs.

layer8 · 129d ago

It’s not different. If you have a confidentiality requirements like that, you also don’t store your code off-premises. At least not without enforceable contracts about confidentiality with the service provider, approved by the client.

jamessinghal · 129d ago

I think it's a combination of a fundamental distrust of the model makers and a history of them training on user data with and without consent.

The main players all allow some form of zero data retention but I'm sure the more cautious CISO/CIOs flat out don't trust it.

tcoff91 · 129d ago

I think that using something like Claude on Amazon Bedrock makes more sense than directly using Anthropic. Maybe I'm naive but I trust AWS more than Anthropic, OpenAI, or Google to not misuse data.

mark_l_watson · 129d ago

I have good results running Ollama locally with olen models like Gemma 3, Qwen 3, etc. The major drawback is slower inference speed. Commercial APIs like Google Gemini are so much faster.

Still, I find local models very much worth using after taking the time to set them up with Emacs, open-codex, etc.

trollbridge · 129d ago

Most my clients have the same requirement. Given the code bases I see my competition generating, I suspect other vendors are simply violating this rule.

abujazar · 129d ago

You can set up your IDE to use local LLMs through e.g. Ollama if your computer is powerful enough to run a decent model.

crimsoneer · 128d ago

Are your clients not on AWS/Azure/GCP? They all offer private LLMs out of the box now.

ATechGuy · 128d ago

That was my question too.

blitzar · 128d ago

I also kind of don't understand how it seems everyone is using AI for doing their homework. I haven't had a teacher yet which would have approved any AI usage.

Same process, less people being called out for "cheating" in a professional setting.

Pamar · 128d ago

Personally I am trying to see if we can leverage AI to help write design documents instead of code, based on a fairly large library of human (poorly) written design documents and bug reports.

betterThanTexas · 129d ago

I would take any such claim with a heavy rock of salt because the usefulness of AI is going to vary drastically with the sort of work you're tasked with producing.

demarq · 128d ago

Also it’s like saying you can host a database on your Mac.

Unless you have experience hosting and maintaining models at scale and with an enterprise feature set, then I believe what they are offering is beyond (for now) what you’d be able put up on your own.

Tepix · 128d ago

premises, not premise.

https://www.grammar-monster.com/easily_confused/premise_prem...

ATechGuy · 129d ago

Have you tried using private inference that uses GPU confidential computing from Nvidia?

lolinder · 129d ago

Game changer feels a bit strong. This is a new entry in a field that's already pretty crowded with open source tooling that's already available to anyone with the time and desire to wire it all up. It's likely that they execute this better than the community-run projects have so far and make it more approachable and Enterprise friendly, but just for reference I have most of the features that they've listed here already set up on my desktop at home with Ollama, Open WebUI, and a collection of small hand-rolled apps that plug into them. I can't run very big models on mine, obviously, but if I were an Enterprise I would.

The key thing they'd need to nail to make this better than what's already out there is the integrations. If they can make it seamless to integrate with all the key third-party enterprise systems then they'll have something strong here, otherwise it's not obvious how much they're adding over Open WebUI, LibreChat, and the other self-hosted AI agent tooling that's already available.

troyvit · 128d ago

> crowded with open source tooling that's already available to anyone with the time and desire to wire it all up.

Those who don't have the time and desire to wire it all up probably make up a larger part of the market than those who do. It's a long-tail proposition, and that might be a problem.

> I have most of the features that they've listed here already set up on my desktop at home

I think your boss and your boss' boss are the audience they are going for. In my org there's concern over the democratization of locally run LLMs and the loss of data control that comes with it.

Mistral's product would allow IT or Ops or whatever department to set guardrails for the organization. The selling point that it's turn-key means that a small organization doesn't have to invest a ton of time into all the tooling needed to run it and maintain it.

Edit: I just re-read your comment and I do have to agree though. "game-changer" is a bit strong of a word.

abujazar · 129d ago

Actually you shouldn't be running LLMs in Docker on Mac because it doesn't have GPU support. So the larger models will be extremely slow if they'll even produce a single token.

burnte · 129d ago

I have an M4 Mac Mini with 24GB of RAM. I loaded Studio.LM on it 2 days ago and had Mistral NeMo running in ten minutes. It's a great model, I need to figure out how to add my own writing to it, I want it to generate some starter letters for me. Impressive model.

raxxorraxor · 128d ago

I think the the standard setup for vscode continue for ollama is already 99% of ai coding support I need. I think it is even better than commercial offerings like cursor, at least in the projects and languages I use and have tested it.

We had a Mac Studio here nobody was using and it we now use it as a tiny AI station. If we like, we could even embed our codebases, but it wasn't necessary yet. Otherwise it should be easy to just buy a decent consumer PC with a stronger GPU, but performance isn't too bad even for autocomplete.

thepill · 128d ago

Which models are you using?

Palmik · 128d ago

I really don't see the big deal. Gemini also allows on-prem in similar fashion: https://cloud.google.com/blog/products/ai-machine-learning/r...

nicce · 129d ago

> Btw, you can also run Mistral locally within the Docker model runner on a Mac.

Efficiently? I thought macOS does not have API so that Docker could use GPU.

jt_b · 129d ago

I haven't/wouldn't use it because I have a decent K8S ollama/open-webui setup, but docker announced this a month ago: https://www.docker.com/blog/introducing-docker-model-runner

nicce · 129d ago

Hmm, I guess that is not actually running inside container/ there is no isolation. Some kind of new way that mixes llama.cpp , OCI format and docker CLI.

v3ss0n · 129d ago

What's the point when we can run much powerful models now? Qwen3 , Deepseek

_bin_ · 129d ago

It would be short-termist for Americans or euros to use chinese-made models. Increasing their popularity has an indirect but significant cost in the long term. china "winning AI" should be an unacceptable outcome for America or europe by any means necessary.

atwrk · 128d ago

Why would that be? I can see why Americans wouldn't want to do that, but Europeans? In the current political climate, where the US openly claims their desire to annex European territory and so on? I'd rather see them prefer a locally hostable open source solution like DeepSeek.

tigroferoce · 128d ago

My two cents, as European, is that since we are more and more asking to LLMs for information, it wouldn't be wise to let a foreign country, not even truly democratic, to choose the information we get.

jamesblonde · 128d ago

The Chinese don't get any of information if we use self-hosted DeepSeek or Qwen. They are open-source. You can run them in an air-gapped environment that can't phone home.

fennecbutt · 126d ago

But their models are gimped by bad censoring. At least I can still ask chatgpt how many innocent civilians America has bombed.

ulnarkressty · 129d ago

I think many in this thread are underestimating the desire of VPs and CTOs to just offload the risk somewhere else. Quite a lot of companies handling sensitive data are already using various services in the cloud and it hasn't been a problem before - even in Europe with its GDPR laws. Just sign an NDA or whatever with OpenAI/Google/etc. and if any data gets leaked they are on the hook.

boringg · 129d ago

Good luck ever winning that one. How are you going to prove out a data leak with an AI model without deploying excessive amounts of legal spend?

You might be talking about small tech companies that have no other options.

dzhiurgis · 128d ago

How many is many? Literally all of them use cloud services.

ATechGuy · 129d ago

Why not use confidential computing based offerings like Azure's private inference for privacy concerns?

beernet · 128d ago

Mistral really became what all the other over-hyped EU AI start-ups / collectives (Stability, Eleuther, Aleph Alpha, Nyonic, possibly Black Forest Labs, government-funded collaborations, ...) failed to achieve, although many of them existed way before Mistral. Congrats to them, great work.

Palmik · 128d ago

It feels to me they turned into a generic AI consulting & solutions company. That does not mean it's a bad business, especially since they might benefit from the "built in EU" spin (whether through government contracts, regulation, or otherwise).

One can deploy similar solution (on-prem) using better and more cost efficient open-source models and infrastructure already.

What Mistral offers here is managing that deployment for you, but there's nothing stopping other companies doing the same with fully open stack. And those will have the benefit of not wasting money on R&D.

jamesblonde · 128d ago

That's what we do with Hopsworks - EU built platform for developing and operating AI systems. We have customers running DeepSeek-v3 and Llama models. I never thought about slapping a Chat UI on it and selling the Chat app as a ready made product for the sovereign AI market. But why not.

stogot · 128d ago

I’m wondering why. More funding, better talent, strategy, or something else?

agumonkey · 128d ago

i'm an outsider but none of the startups mentioned above ever came to my ears. Mistral suddenly popped after openai/anthropic exploded, and they were rapidly described as the 3rd contender, with emphasis on technical merit. Maybe i was fooled though.

danielbln · 128d ago

Black Forest Labs are the makers of FLUX, which for a while was the best open image model available (and generally a pretty strong image model). That said, now with a wave of Chinese models and the advent of autoregressive image models, I'm not sure how much that will stay true.

bobxmax · 128d ago

is Mistral really doing anything here? Llama models are open source, Cohere runs on prem etc

retinaros · 128d ago

what did they achieve exactly?

beernet · 124d ago

Signs of market traction and executing on product development. All other mentioned companies never made it there.

85392_school · 129d ago

This announcement accompanies the new and proprietary Mistral Medium 3, being discussed at https://news.ycombinator.com/item?id=43915995

Havoc · 129d ago

Not quite following. It seems to talk about features common associated with local servers but then ends with available on gcp

Is this an API point? A model enterprises deploy locally? A piece of software plus a local model?

There is so much corporate synergy speak there I can’t tell what they’re selling

frabcus · 128d ago

They mention Google Cloud Marketplace (not Google Cloud Platform), this seems to be their listing there:

https://console.cloud.google.com/marketplace/product/mistral...

Which says:

"Managed Services are fully hosted, managed and supported by the service providers. Although you register with the service provider to use the service, Google handles all billing."

My assumption is that they're using Google Marketplace for discovery and billing, and they offer a hosted option or an on-prem option.

But agreed, it isn't clear!

tecleandor · 128d ago

Lota of tools offer billing you via Google Marketplace or the AWS equivalent as:

- it joins billing with other stuff

- I guess it's easier to get approval

- and more important (at least in our case), it allows you to reach your Google Cloud (or AWS) contract commitments of expense, and keep your discounts :)

_pdp_ · 129d ago

While I am rooting for Mistral, having access to a diverse set of models is the killer app IMHO. Sometimes you want to code. Sometimes you want to write. Not all models are made equal.

the_clarence · 128d ago

Tbh I think the one general model approach is winning. People don't want to figure out which model is better at what unless its for a very specific task.

_pdp_ · 128d ago

IMHO people want to interact with agents that do things not with models that chat. And agents by definition are specialised which means a specific model and Mistral might not be good for all types of tasks just like the top of line models are not always for everything.

the_clarence · 128d ago

Agents are specialized via prompts and MCP now, and more and more rarely by model

sschueller · 128d ago

Couldn't you could place a very light weight model in front to figure out which model to use?

sReinwald · 128d ago

That’s a perfectly valid idea in theory, but in practice you’ll run into a few painful trade-offs, especially in multi-user environments. Trust me, I'm currently doing exactly that in our fairly limited exploration of how we can leverage local LLMs at work (SME).

Unless you have sufficient VRAM to keep all potential specialized models loaded simultaneously (which negates some of the "lightweight" benefit for the overall system), you'll be forced into model swapping. Constantly loading and unloading models to and from VRAM is a notoriously slow process.

If you have concurrent users with diverse needs (e.g., a developer requiring code generation and a marketing team member needing creative text), the system would have to swap models in and out if they can't co-exist in VRAM. This drastically increases latency before the selected model even begins processing the actual request.

The latency from model swapping directly translates to a poor user experience. Users, especially in an enterprise context, are unlikely to tolerate waiting for a minute or more just for the system to decide which model to use and then load it. This can quickly lead to dissatisfaction and abandonment.

This external routing mechanism is, in essence, an attempt to implement a sort of Mixture-of-Experts (MoE) architecture manually and at a much coarser grain. True MoE models (like the recently released Qwen3-30B-A3B, for instance) are designed from the ground up to handle this routing internally, often with shared parameter components and highly optimized switching mechanisms that minimize latency and resource contention.

To mitigate the latency from swapping, you'd be pressured to provision significantly more GPU resources (more cards, more VRAM) to keep a larger pool of specialized models active. This increases costs and complexity, potentially outweighing the benefits of specialization if a sufficiently capable generalist model (or a true MoE) could handle the workload with fewer resources. And a lot of those additional resources would likely sit idle for most of the time, too.

gustofied · 128d ago

Have you looked into semantic router? It will be a faster way to look up the right model for the right task. I agree that using a llm for routing is not good, takes money, takes time, and can often take the wrong route.

sReinwald · 128d ago

Semantic router is on my radar, but I haven't had a good look at it yet. The primary bottleneck in our current setup, isn't really the routing decision time. The lightweight LLM I chose (Gemma3 4B) handles the task identification fairly well in terms of both speed and accuracy from what I've found.

For some context: this is a fairly limited exploratory deployment which runs alongside other priority projects for me, so I'm not too obsessed with optimizing the decision-making time. Those three seconds are relatively minor when compared with the 20–60 seconds it takes to unload the old and load a new model.

I can see semantic router being really useful in scenarios built around commercial, API-accessed models, though. There, it could yield significant cost savings by, for example, intelligently directing simpler queries to a less capable but cheaper model instead of the latest and greatest (and likely significantly more expensive) model users might feel drawn to. You're basically burning money if you let your employees use Claude 3.7 to format a .csv file.

F-Lexx · 128d ago

Good idea. Then you could place another lighter-weight model in front of THAT, to figure out which model to use in order to find out which model to use.

It,'s LLMs, all the way down.

the_clarence · 128d ago

My guess is that this is basically what AI providers are slowly moving to. And this is what models seem to be doing underneath the surface as well now with Mixture of Experts (MoE).

promiseofbeans · 128d ago

I mean, the general purpose models already do this in a way, routing to a selected expert. It's a pretty fundamental concept for ensemble learning, which is what MOE experts are, effectively.

I don't see any reason you couldn't stack more layers of routing in front, to select the model. However, this starts to seem inefficient.

I think the optimal solution will eventually be companies training and publishing hyper-focused expert models, that are designed to be used with other models and a router. Then interface vendors can purchase different experts and assemble the models themselves, like how a phone manufacter purchases parts from many suppliers, even their compeditors, in order to create the best final product. The bigger players (e.g. Apple for this analogy) might make more parts in house, but even the latest iPhone still has Samsung chips in it in teardowns.

downsplat · 128d ago

Same here. Since I started using LLMs a bit more, the killer step for me was to set up API access to a variety of providers (Mistral, Anthropic, Gemini, OpenAI), and use a unified client to access them. I'm usually coding at the CLI, so I installed 'aichat' from github and it does an amazing job. Switch models on the fly, switch between one-shot and session mode, log everything locally for later access, and ask casual questions with a single quick command.

I think all providers guarantee that they will not use your API inputs for training, it's meant as the pro version after all.

Plus it's dirt cheap, I query them several times per day, with access to high end thinking models, and pay just a few € per month.

Deathmax · 128d ago

Gemini's free tier will absolutely use your inputs for training [1], same with Mistral's free tier [2]. Anthropic and OpenAI let's you opt into data collection for discounted prices or free tokens.

[1]: https://ai.google.dev/gemini-api/terms#data-use-unpaid

[2]: https://mistral.ai/terms#privacy-policy

downsplat · 128d ago

Yeah, I mean paid API access. You put a credit card in, and it's peanuts at the end of the month. Sorry I didn't specify. Good reminder that with free services you are the product!

binsquare · 129d ago

Well that sounds right up the alley of what I built here: www.labophase.com

I_am_tiberius · 129d ago

I really love using le chat. I feel much more save giving information to them than to openai.

victorbjorklund · 129d ago

Why use this instead of an open source model?

_mlbt · 129d ago

> our world-class AI engineering team offers support all the way through to value delivery.

victorbjorklund · 129d ago

Guess that makes sense. But I'm sure they charge good money for it and then you could just use that money for someone helping you with an open source model.

disgruntledphd2 · 128d ago

Presumably one throat to choke logic applies here, particularly in Europe.

starik36 · 129d ago

I don't see any mention of hardware requirements for on prem. What GPUs? How many? Disk space?

tootie · 129d ago

I'm guessing it's flexible. Mistral makes small models capable of running on consumer hardware so they can probably scale up and down based on needs. And what is available from hosts.

rowanajmarshall · 128d ago

I run a Mistral model on my phone!

dr_kretyn · 128d ago

Explain more please? Is that a big phone/tiny laptop with long GPU connector? Is that a tiny model?

adamsiem · 124d ago

Parsing email...

The intro video highlights searching email alongside other tools.

What email clients will this support? Are there related tools that will do this?

guerrilla · 129d ago

Interesting. Europe is really putting up a fight for once. I'm into it.

fortifAI · 128d ago

Mistral isn't really Europe, it's France. Europe has some plans but as far as I can tell their goal isn't to make something that can really compete. The goal is to make EU data stay in the EU for businesses, meanwhile every user that is not forced by their company sends their data to the US or China.

blitzar · 128d ago

Last I checked France is in Europe. It would be like saying Google or Apple are not American because they are in California.

klabb3 · 128d ago

The big picture incongruence in the thread is using terms as patriotic, allegories to US states, which imo is but a US-centric projection. Even proponents don’t think of the EU to be a supreme government with federated states, and they certainly don’t think of ”Europeans” as a unified demographic. At best, the EU protects against stupid shit from other EU countries (tariffs, freedom of movement) and stupid shit against the outside, such as bullying by superpowers like Russia, China and recently also the US, or extremely large corporations who can take on smaller nation states.

The EU is more similar to NAFTA or five eyes, and culturally the loyalty is more similar to the US vs the anglosphere, like how Americans think of Australia, UK and Canada. Well, again, until recently. Things are changing fast.

r0p3 · 128d ago

The NAFTA Parliament

klabb3 · 128d ago

Yes, analogies do have differences. And parliament is better than handshakes behind closed doors, even if it’s not perfect.

Comparing to states is much more far fetched. The UK left the union without any serious retaliation, let alone military conflict. What would happen if Texas or California tried to secede seriously?

resource_waste · 129d ago

Expected this comment.

Mistral has been consistently last place, or at least last place among ChatGPT, Claude, Llama, and Gemini/Gemma.

I know this because I had to use a permissive license for a side project and I was tortured by how miserably bad Mistral was, and how much better every other LLM was.

Need the best? ChatGPT

Need local stuff? Llama(maybe Gemma)

Need to do barely legal things that break most company's TOS? Mistral... although deepseek probably beats it in 2025.

For people outside Europe, we don't have patriotism for our LLMs, we just use the best. Mistral has barely any usecase.

omneity · 129d ago

> Need local stuff? Llama(maybe Gemma)

You probably want to replace Llama with Qwen in there. And Gemma is not even close.

> Mistral has been consistently last place, or at least last place among ChatGPT, Claude, Llama, and Gemini/Gemma.

Mistral held for a long time the position of "workhorse open-weights base model" and nothing precludes them from taking it again with some smart positioning.

They might not currently be leading a category, but as an outside observer I could see them (like Cohere) actively trying to find innovative business models to survive, reach PMF and keep the dream going, and I find that very laudable. I expect them to experiment a lot during this phase, and that probably means not doubling down on any particular niche until they find a strong signal.

drilbo · 129d ago

>You probably want to replace Llama with Qwen in there. And Gemma is not even close.

Have you tried the latest, gemma3? I've been pretty impressed with it. Altho I do agree that qwen3 quickly overshadowed it, it seems too soon to dismiss it altogether. EG, the 3~4b and smaller versions of gemma seem to freak out way less frequently than similar param size qwen versions, tho I haven't been able to rule out quant and other factors in this just yet.

It's very difficult to fault anyone for not keeping up with the latest SOTA in this space. The fact we have several options that anyone can serviceably run, even on mobile, is just incredible.

Anyway, i agree that Mistral is worth keeping an eye on. They played a huge part in pushing the other players toward open weights and proving smaller models can have a place at the table. While I personally can't get that excited about a closed model, it's definitely nice to see they haven't tapped out.

omneity · 129d ago

It's probably subjective to your own use, but for me Gemma3 is not particularly usable (i.e. not competitive or delivering a particular value for me to make use of it).

Qwen 2.5 14B blows Gemma 27B out of the water for my use. Qwen 2.5 3B is also very competitive. The 3 series is even more interesting with the 0.6B model actually useful for basic tasks and not just a curiosity.

Where I find Qwen relatively lackluster is its complete lack of personality.

amelius · 129d ago

I certainly had some opposite experiences lately, where Mistral was outperforming Chatgpt for some hard questions.

tacker2000 · 129d ago

Whats your point here? There is a place for a European LLM, be it “patriotism” or data safety. And dont tell me the Chinese are not “patriotic” about their stuff. Everyone has a different approach. If Mistral fits the market, they will be successful.

byefruit · 129d ago

You are probably getting downvoted because you don't give any model generations or versions ('ChatGPT') which makes this not very credible.

No comments yet

qwertox · 128d ago

This is so fast it took me by surprise. I'm used to wait for ages until the response is finished on Gemini and ChatGPT, but this is instantaneous.

amelius · 128d ago

I'm curious about the ways in which they could protect their IP in this setup.

badmonster · 128d ago

interesting take. i wonder if other LLM competitors would do the same.

mxmilkiib · 128d ago

the site doesn't work with dark mode, the text is dark also

phupt26 · 129d ago

Another new model ( Medium 3) of Mistral is great too. Link: https://newscvg.com/r/yGbLTWqQ

m-hodges · 129d ago

I love that "le chat" translates from French to English as "the cat".

Jordan-117 · 129d ago

Also, "ChatGPT" sounds like chat, j’ai pété ("cat, I farted")

layer8 · 129d ago

Mistral should highlight more in their marketing that it doesn’t make you fart.

foobahhhhh · 128d ago

Instead it disobeys commands, uses up your resources then you find it never belonged to you in the first place.

cryptonector · 128d ago

I came in to say this, and I was sure I'd be the first. This is so appropriate considering how ChatGPT -like all LLMs- hallucinates.

debugnik · 129d ago

Their M logo is a pixelated cat face as well.

AceJohnny2 · 129d ago

I wonder if they mean to reference the Belgian comic Le Chat by Philippe Geluck.

https://en.wikipedia.org/wiki/Le_Chat

caseyy · 129d ago

This will make for some very good memes. And other good things, but memes included.

iamnotagenius · 129d ago

Mistral models though are not interesting as models. Context handling is weak, language is dry, coding mediocre; not sure why would anyone chose it over Chinese (Qwen, GLM, Deepseek) or American models (Gemma, Command A, Llama).

tensor · 129d ago

Command A is Canadian. Also mistral models are indeed interesting. They have a pretty unique vision model for OCR. They have interesting edge models. They have interesting rare language models.

And also another reason people might use a non-American model is that dependency on the US is a serious business risk these days. Not relevant if you are in the US but hugely relevant for the rest of us.

tootie · 129d ago

I flip back and forth with Claude and Le Chat and find them comparable. Le Chat always feels very quick and concise. That's just vibes not benchmarks.

crowcroft · 128d ago

I haven't used it much, but I did find Le Chat to be FAST in a way that I don't always get with ChatGPT.

amai · 129d ago

Data privacy is a thing - in Europe.

FuriouslyAdrift · 129d ago

GPT4All has been running locally for quite a while...

curiousgal · 129d ago

Too little too late, I work in a large European investment bank and we're already using Anthropic's Claude via Gitlab Duo.

croes · 129d ago

Is there are replacement for the Safe Harbor replacement?

Otherwise it could be illegal to transfer EU data to US companies

_bin_ · 129d ago

The law means don’t do what a slow moving regulator can and will prove in court. In this case, the law has no moral valence so I doubt anyone there would feel guilty breaking it. He may mean individuals are using ChatGPT unofficially even if prohibited nominally by management. Such is the case almost everywhere.

sofixa · 128d ago

> In this case, the law has no moral valence

That's not how laws work.

croes · 128d ago

There is a difference if you upload your data or your customers data.

There are countries in the EU where you get sued for less

jagermo · 128d ago

AI data residency is an issue for several of our customers, so I think there is still a big enough market for this.

alwayseasy · 128d ago

Your bank sticks with any tech that comes out first? How is this a cogent argument?

Top free URL shorter+password protection (rrrprourl.blogspot.com)

New Bill Would Allow Rubio to Strip US Citizens' Passports over Political Speech (commondreams.org)

First Impressions of Omarchy (jordangoodman.bearblog.dev)

You don't have to say something about every terrible thing (natesilver.net)

AI fabricates 21 out of 23 citations lawyer sanctioned reported to state bar [pdf] (www4.courts.ca.gov)

Simplenote is no longer in active development (forums.simplenote.com)

Is a new AI paradigm based on raw electromagnetic waves feasible?

How to get samples back from Mars (caseyhandmer.wordpress.com)

Can you help us crack the Dickens Code? (dickenscode.org)

Agpe Account

Visual programming is stuck on the form (interjectedfuture.com)

Inside the Battle to Protect Time (ft.com)

Gambit: An efficient implementation of the Scheme programming language (github.com)

Toxic Fumes Are Leaking into Airplanes, Sickening Crews and Passengers (wsj.com)

Fil's C Compiler (fil-c.org)

The trade-offs of fine-grained progressive rollouts (surfingcomplexity.blog)

Fine-grained HTTP filtering for Claude Code (ammar.io)

PSA: Systemd-networkd segfault regression in Debian 13.1 for some users (lists.debian.org)

Digital Museum of Planetary Mapping (planetarymapping.elte.hu)

Fantasy or faith? One company's AI-generated Bible content stirs controversy (npr.org)

China is ditching the dollar, fast: Officials believe the yuan has come of age (economist.com)

Cuba entirely without power following electric grid collapse (nbcnews.com)

HN Discuss: thoughts on the butlerian jihad

BSD-user-4-Linux project (freebsd.org)

Coroutine prime number sieve (2022) [pdf] (cs.dartmouth.edu)

Redox Development Priorities for 2025/26 (redox-os.org)

Improving Multi-Threaded Applications via a Lightweight Memory Allocation Core (arxiv.org)

Do I Need Kubernetes? (doineedkubernetes.com)

H100 PCIe – 1.86 TB/s memcpy roofline and 8× uplift

Land vs. Expand in the AI Era (guruchahal.substack.com)

AI Made a Movie About Its Own Future [video] (youtube.com)

Email Signatures and the Power of Defaults (buttondown.com)

Video games are taking longer to make, but why? (bbc.com)

Beyond Traditional Pseudorandomness, Tsotchkes' Quantum Random Number Generation (medium.com)

Fixing the Biggest Problem with Mechanical Keyboards [video] (youtube.com)

Writing an operating system kernel from scratch – RISC-V/OpenSBI/Zig (popovicu.com)

How the restoration of ancient Babylon is drawing tourists back to Iraq (theartnewspaper.com)

The Surprise Boomtown: Baghdad (economist.com)

Recreating the US time zone situation (rachelbythebay.com)

Building software that survives contact with reality (complexsystemspodcast.com)

RFC9460: SVCB and HTTPS DNS Records (datatracker.ietf.org)

China's Great Firewall- 500GB source and docs leak online – sold to 3 countries (tomshardware.com)

Alterego's 'Silent Speech' Could Be the Answer to Dictating Text in Public (uploadvr.com)

If my kids excel, will they move away? (jeffreybigham.com)

The Video Lunchbox (computer.rip)

Trump video is 100% AI Generated (youtube.com)

Mass Firings for Charlie Kirk Comments Appear Coordinated (cnn.com)

Show HN: Mystwright – AI generates dynamic murder mysteries you can solve

EFF to court: The Supreme Court must rein in secondary copyright liability (eff.org)

Micro-Font Quilt (craftinatorics.com)

Mistral ships Le Chat – enterprise AI assistant that can run on prem

Comments (158)