Devstral

336 mfiguiere 71 5/21/2025, 2:21:10 PM mistral.ai ↗

Comments (71)

simonw · 6h ago
The first number I look at these days is the file size via Ollama, which for this model is 14GB https://ollama.com/library/devstral/tags

I find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters because I want to know how much RAM I will have left for running other applications.

Anything below 20GB tends not to interfere with the other stuff I'm running too much. This model looks promising!

nico · 2h ago
Any agentic dev software you could recommend that runs well with local models?

I’ve been using Cursor and I’m kind of disappointed. I get better results just going back and forth between the editor and ChatGPT

I tried localforge and aider, but they are kinda slow with local models

jabroni_salad · 2h ago
Do you have any other interface for the model? what kind of tokens/sec are you getting?

Try hooking aider up to gemini and see how the speed is. I have noticed that people in the localllama scene do not like to talk about their TPS.

nico · 1h ago
The models feel pretty snappy when interacting with them directly via ollama, not sure about the TPS

However I've also ran into 2 things: 1) most models don't support tools, sometimes it's hard to find a version of the model that correctly uses tools, 2) even with good TPS, since the agents are usually doing chain-of-thought and running multiple chained prompts, the experience feels slow - this is even true with Cursor using their models/apis

lis · 5h ago
Yes, I agree. I've just ran the model locally and it's making a good impression. I've tested it with some ruby/rspec gotchas, which it handled nicely.

I'll give it a try with aider to test the large context as well.

ericb · 4h ago
In ollama, how do you set up the larger context, and figure out what settings to use? I've yet to find a good guide. I'm also not quite sure how I should figure out what those settings should be for each model.

There's context length, but then, how does that relate to input length and output length? Should I just make the numbers match? 32k is 32k? Any pointers?

lis · 3h ago
For aider and ollama, see: https://aider.chat/docs/llms/ollama.html

Just for ollama, see: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...

I’m using llama.cpp though, so I can’t confirm these methods.

nico · 2h ago
Are you using it with aider? If so, how has your experience been?
johnQdeveloper · 12m ago
*For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:*

total duration: 35.016288581s load duration: 21.790458ms prompt eval count: 1244 token(s) prompt eval duration: 1.042544115s prompt eval rate: 1193.23 tokens/s eval count: 213 token(s) eval duration: 33.94778571s eval rate: 6.27 tokens/s

total duration: 4m44.951335984s load duration: 20.528603ms prompt eval count: 1502 token(s) prompt eval duration: 773.712908ms prompt eval rate: 1941.29 tokens/s eval count: 1644 token(s) eval duration: 4m44.137923862s eval rate: 5.79 tokens/s

Compared to an API call that finishes in about 20% of the time it feels a bit slow without the recommended graphics card and what not is all I'm saying.

In terms of benchmarks, it seems unusually well tuned for the model size but I suspect its just a case of gaming the measurement by testing against it as part of the development of the model which is not bad in and of itself since I suspect every LLM who is in this space marketed to IT folks does the same thing tbh so its objective enough given that as a rough gauge of "Is this usable?" without heavy time expense testing it.

oofbaroomf · 5h ago
The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run this for almost free, this is a very extraordinary model.
AstroBen · 2h ago
extraordinary.. or suspicious that the benchmarks aren't doing their job
falcor84 · 3h ago
Just to confirm, are you referring to Claude 3.7?
oofbaroomf · 3h ago
No. I am referring to Claude 3.5 Sonnet New, released October 22, 2024, with model ID claude-3-5-sonnet-20241022, colloquially referred to as Claude 3.6 Sonnet because of Anthropic's confusing naming.
ttoinou · 1h ago
And it is a very good LLM. Some people complain they don't see an improvement with Sonnet 3.7
Deathmax · 2h ago
Also known as Claude 3.5 Sonnet V2 on AWS Bedrock and GCP Vertex AI
SkyPuncher · 3h ago
> colloquially referred to as Claude 3.6

Interesting. I've never heard this.

simonw · 53m ago
It's the reason Anthropic called their next release 3.7 Sonnet - the 3.6 version number was already being used by some in the community to refer to their 3.5v2.
CSMastermind · 4h ago
I don't believe the benchmarks they're presenting.

I haven't tried it out yet but every model I've tested from Mistral has been towards the bottom of my benchmarks in a similar place to Llama.

Would be very surprised if the real life performance is anything like they're claiming.

idonotknowwhy · 5m ago
I don't believe them either. We really have to test these ourselves imo.

Qwen3 is a step backwards for me for example. And GLM4 is my current goto despite everyone saying it's "only good at html"

The 70b cogito model is also really good for me but doesn't get any attention.

I think it depends on our projects / languages we're using.

Still looking forward to trying this one though :)

Ancapistani · 2h ago
I've worked with other models from All Hands recently, and I believe they were based on Mistral.

My general impression so far is that they aren't quite up to Claude 3.7 Sonnet, but they're quite good. More than adequate for an "AI pair coding assistant", and suitable for larger architectural work as long as you break things into steps for it.

solomatov · 6h ago
It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions.
resource_waste · 6h ago
This is basically the Mistral niche. If you are doing something generally perceived as ethical, you would use Gemma 3 IMO. When you aren't... well there are Apache licensed LLMs for you.
solomatov · 5h ago
IMO, it's not about ethics, it's about legal risks. What if you want to fine tune a model on output related to your usage? Then my understanding is that all these derivatives need to be under the same license. What if G will change their prohibited use policy (the first line there is that they could update it from time to time)? There's really crazy stuff in terms of use of some services, what if G adds something in the same tune there which basically makes your application impossible.

P.S. I am not a lawyer.

Havoc · 4h ago
They're all quite easy to strip of protections and I don't think anyone doing unethical stuff is big on following licenses anyway
orbisvicis · 5h ago
I'm not sure what you're trying to imply... only rogue software developers use devstral?
simonw · 4h ago
What's different between the ethics of Mistral and Gemma?
Philpax · 4h ago
I think their point was more that Gemma open models have restrictive licences, while some Mistral open models do not.
dismalaf · 5h ago
It's not about ethical or not, it's about risk to your startup. Ethics are super subjective (and often change based on politics). Apache means you own your own model, period.
portaouflop · 3h ago
TIL Open Source is only used for unethical purposes
qwertox · 4h ago
Maybe the EU should cover the cost of creating this agent/model, assuming it really delivers what it promises. It would allow Mistral to keep focusing on what they do and for us it would mean that the EU spent money wisely.
Havoc · 4h ago
>Maybe the EU should cover the cost of creating this model

Wouldn't mind some of my taxpayer money flowing towards apache/mit licensed models.

Even if just to maintain a baseline alternative & keep everyone honest. Seems important that we don't have some large megacorps run away with this.

dismalaf · 1h ago
Pretty sure the EU paid for some supercomputers that AI startups can use and Mistral is partner in that program.
ddtaylor · 8h ago
Wow. I was just grabbing some models and I happened to see this one while I was messing with tool support in LLamaIndex. I have an agentic coding thing I threw together and I have been trying different models on it and was looking to throw ReAct at it to bring in some models that don't have tool support and this just pops into existence!

I'm not able to get my agentic system to use this model though as it just says "I don't have the tools to do this". I tried modifying various agent prompts to explicitly say "Use foo tool to do bar" without any luck yet. All of the ToolSpec that I use are annotated etc. Pydantic objects and every other model has figured out how to use these tools.

tough · 4h ago
you can use constrained outptus for enforcing tool schemas any model can get it with a lil help
dismalaf · 5h ago
It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.

Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mistral integration.

ics · 6h ago
Maybe someone here can suggest tools or at least where to look; what are the state-of-the-art models to run locally on relatively low power machines like a MacBook Air? Is there anyone tracking what is feasible given a machine spec?

"Apple Intelligence" isn't it but it would be nice to know without churning through tests whether I should bother keeping around 2-3 models for specific tasks in ollama or if their performance is marginal there's a more stable all-rounder model.

Miraste · 2h ago
The best general model you can run locally is probably some version of Gemma 3 or the latest Mistral Small. On a Windows machine, this is limited by VRAM, since system RAM is too low-bandwidth to run models at usable speeds. On an M-series Mac, the system memory is on-die and fast enough to use. What you can run will be the total RAM, minus whatever MacOS uses and the space you want for other programs.

To determine how much space a model needs, you look at the size of the quantized (lower precision) model on HuggingFace or wherever it's hosted. Q4_K_M is a good default. As a rough rule of thumb, this will be a little over half the size of the parameters, if they were in gigabytes. For Devstral, that's 14.3GB. You will also need 1-8GB more than that, to store the context.

For example: A 32GB Macbook Air could use Devstral at 14.3+4GB, leaving ~14GB for the system and applications. A 16GB Macbook Air could use Gemma 3 12B at 7.3+2GB, leaving ~7GB for everything else. An 8GB Macbook could use Gemma 3 4B at 2.5GB+1GB, but this is probably not worth doing.

thatcherc · 5h ago
I would recommend just trying it out! (as long as you have the disk space for a few models). llama.cpp[0] is pretty easy to download and build and has good support for M-series Macbook Airs. I usually just use LMStudio[1] though - it's got a nice and easy-to-use interface that looks like the ChatGPT or Claude webpage, and you can search for and download models from within the program. LMStudio would be the easiest way to get started and probably all you need. I use it a lot on my M2 Macbook Air and it's really handy.

[0] - https://github.com/ggml-org/llama.cpp

[1] - https://lmstudio.ai/

Etheryte · 3h ago
This doesn't do anything to answer the main question of what models they can actually run.
tuesdaynight · 43m ago
LM Studio will tell you if a specific model is small enough for your available RAM/VRAM.
TZubiri · 1h ago
I feel this is part of a larger and very old business trend.

But do we need 20 companies copying each other and doing the same thing?

Like, is that really competition? I'd say competition is when you do something slightly different, but I guess it's subjective based on your interpretation of what is a commodity and what is proprietary.

To my view, everyone is outright copying and creating commodity markets:

OpenAI: The OG, the Coke of Modern AI

Claude: The first copycat, The Pepsi of Modern AI

Mistral: Euro OpenAI

DeepSeek: Chinese OpenAI

Grok/xAI: Republican OpenAI

Google/MSFT: OpenAI clone as a SaaS or Office package.

Meta's Llama: Open Source OpenAI

etc...

nylonstrung · 47m ago
Deepseek and Mistral are both more open source than Lllama
amarcheschi · 1h ago
I think llama is less open source than this mistral release
bravura · 6h ago
And how do the results compare to hosted LLMs like Claude 3.7?
resource_waste · 6h ago
Eh, different usecase entirely. I don't really compare these.
bufferoverflow · 3h ago
Different class. Same exact use case.
ttoinou · 2h ago
For which kind of coding would you use a subpar LLM ?
troyvit · 1h ago
I'd use a "subpar" LLM for any coding practice where I want to do the bulk of the thinking and where I care about how much coal I'm burning.

It's kind-of like asking, for which kind of road-trip would you use a Corolla hatchback instead of a Jeep Grand Wagoneer? For me the answer would be "almost all of them", but for others that might not be the case.

ttoinou · 33m ago
In that case examples of which trips would be interesting so we can take inspiration from you
gyudin · 6h ago
Super weird benchmarks
avereveard · 6h ago
from what I gather it's finetuned to use OpenHand specifically so shows value on thsoe benchmark that target a whole system as a blackbox (i.e. agent + llm) more than directly target the llm input/outputs
amarcheschi · 1h ago
abrowne2 · 7h ago
Curious to check this out, since they say it can run on a 4090 / Mac with >32 GB of RAM.
yencabulator · 1h ago
"Can run" is pretty easy, it's pretty small and quantized. It runs at 3.7 tokens/second on pure CPU with AMD 8945HS.
ddtaylor · 7h ago
I can run it without issue on a 6800 XT with 64GB of RAM.
jadbox · 4h ago
But how does it compare to deepcoder?
AnhTho_FR · 8h ago
Impressive performance!
YetAnotherNick · 5h ago
The SWE bench is super impressive of model of any size. However just providing one benchmark results and having to do partnership with OpenHands seems like they focused too much on optimizing the number.
ManlyBread · 5h ago
>Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an ideal choice for local deployment and on-device use

This is still too much, a single 4090 costs $3k

Uehreka · 5h ago
> a single 4090 costs $3k

What a ripoff, considering that a 5090 with 32GB of VRAM also currently costs $3k ;)

(Source: I just received the one I ordered from Newegg a week ago for $2919. I used hotstocks.io to alert me that it was available, but I wasn’t super fast at clicking and still managed to get it. Things have cooled down a lot from the craziness of early February.)

knicholes · 45m ago
When I needed 21 3090s and none were available but for ridiculously high prices, I bought Dell Alienware comps, stripped them out, and sold the rest. Definitely made my money back mining for crypto with those cards. Dell surprisingly has a lot of computers with great RTX cards in stock.
IshKebab · 5h ago
That's probably because the 5000 series seems to be a big let-down. It's pretty much identical to the 4000 series in efficiency; they've only increased performance by massively increasing power usage.
ttoinou · 1h ago
I can get the 5090 for 1700 euros on Amazon Spain. But there is 95% chance it is a scammy seller :P
hiatus · 5h ago
I receive NXDOMAIN for that hostname.
jsheard · 5h ago
It's hotstock.io, no plural.
oezi · 5h ago
If it runs on 4090, it also runs on 3090 which are available used for 600 EUR.
threeducks · 5h ago
More like 700 € if you are lucky. Prices are still not back down from the start of the AI boom.

I am hopeful that the prices will drop a bit more with Intel's recently announced Arc Pro B60 with 24GB VRAM, which unfortunately has only half the memory bandwidth of the RTX 3090.

Not sure why other hardware makers are so slow to catch up. Apple really was years ahead of the competition with the M1 Ultra with 800 GB/s memory bandwidth.

fkyoureadthedoc · 5h ago
> a single 4090 costs $3k

I hope not. Mine was $1700 almost 2 years go, and the 5090 is out now...

hnuser123456 · 5h ago
The 4090 went up in price for a while as the 5000 marketing percolated and people wanted an upgrade they could actually buy.
orbisvicis · 5h ago
Is there an equivalence between gpu vram and mac ram?
viraptor · 2h ago
For loading models, it's exactly the same. Mac ram is fully (more or less) shared between CPU/GPU.