Magistral — the first reasoning model by Mistral AI

659 meetpateltech 304 6/10/2025, 2:08:01 PM mistral.ai ↗

Comments (304)

danielhanchen · 14h ago

I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF

ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL

./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99

Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!

Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral

ozgune · 10h ago

Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.

For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.

R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

semi-extrinsic · 5h ago

Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.

Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.

danielhanchen · 13h ago

Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:

1. Removed KL Divergence

2. Normalize by total length (Dr. GRPO style)

3. Minibatch normalization for advantages

4. Relaxing trust region

gyrovagueGeist · 10h ago

Does anyone know why they added minibatch advantage normalization (or when it can be useful)?

The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?

danielhanchen · 5h ago

Tbh I'm unsure as well I took a skim of the paper so if I find anything I'll post it here!

Onavo · 13h ago

> Removed KL Divergence

Wait, how are they computing the loss?

danielhanchen · 12h ago

Oh it's the KL term sorry - beta * KL ie they set beta to 0.

The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things

trc001 · 5h ago

It's become trendy to delete it. I say trendy because many papers delete it without offering any proof that it is meaningless

mjburgess · 12h ago

It's just a penalty term that they delete

monkmartinez · 11h ago

At the risk of dating myself; Unsloth is the Bomb-dot-com!!! I use your models all the time and they just work. Thank you!!! What does llama.cpp normally use if not "jinja" for their templates?

danielhanchen · 5h ago

Oh thanks! Yes I was gonna bring it up to them! Imo if there is a chat template, by default it should be --jinja

gavi · 8h ago

too much thinking

https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...

fzzzy · 8h ago

My impression from running the first R1 release locally was that it also does too much thinking.

cluckindan · 7h ago

It does not do any thinking. It is a statistical model, just like the rest of them.

robmccoll · 6h ago

What are we doing when we think?

LordDragonfang · 6h ago

"Thinking" is a term of art referring to the hidden/internal output of "reasoning" models where they output "chain of thought" before giving an answer[1]. This technique and name stem from the early observation that LLMs do better when explicitly told to "think step by step"[2]. Hope that helps clarify things for you for future constructive discussion.

[1] https://arxiv.org/html/2410.10630v1

[2] https://arxiv.org/pdf/2205.11916

bobsomers · 5h ago

We are aware of the term of art.

The point that was trying to be made, which I agree with, is that anthropomorphizing a statistical model isn’t actually helpful. It only serves to confuse laypersons into assuming these models are capable of a lot more than they really are.

That’s perfect if you’re a salesperson trying to dump your bad AI startup onto the public with an IPO, but unhelpful for pretty much any other reason, especially true understanding of what’s going on.

LordDragonfang · 5h ago

If that was their point, it would have been more constructive to actually make it.

To your point, it's only anthropomorphization if you make the anthrocentric assumption that "thinking" refers to something that only humans can do.[1]

And I don't think it confuses laypeople, when literally telling it to "think" achieves the very similar results as in humans - it produces output that someone provided it out-of-context would easily identify as "thinking out loud", and improves the accuracy of results like how... thinking does.

The best mental model of RLHF'd LLMs that I've seen is that they are statistical models "simulating"[1] how a human-like character would respond to a given natural-language input. To calculate the statistically "most likely" answer that an intelligent creature would give to a non-trivial question, with any sort of accuracy, you need emergent effects which look an awful like like a (low fidelity) simulation of intelligence. This includes simulating "thought". (And the distinction between "simulating thinking" and "thinking" is a distinction without a difference given enough accuracy)

I'm curious as to what "capabilities" you think the layperson is misled about, because if anything they tend to exceed layperson understanding IME. And I'm curious what mental model you have of LLMs that provides more "true understanding" of how a statistical model can generate answers that appear nowhere in its training.

[1] It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.

[2] https://www.astralcodexten.com/p/janus-simulators

zer00eyz · 1h ago

> It also begs the question of whether there exists a clear and narrow definition of what "thinking" is that everyone can agree on. I suspect if you ask five philosophers you'll get six different answers, as the saying goes.

And yet we added a hand wavy 7th to humanize a peice of technology.

trebligdivad · 2h ago

Nice! I'm running on CPU only, so it's interesting to compare - the Magistral-Small-2506_Q8_0.gguf runs at under 2 tokens/s on my 16 core, but your UD-IQ2_XXS gets about 5.5 tokens/s which is fast enough to be useful - but it does hallucinate a bit more and loop a little; but still actually pretty good for something so small.

lxe · 13h ago

Thanks for all you do!

danielhanchen · 13h ago

Thanks!

cpldcpu · 13h ago

But this is just the SFT - "distilled" model, not the one optimized with RL, right?

danielhanchen · 13h ago

Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL

pu_pe · 13h ago

Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is struggling to keep up with the state-of-the-art.

hmottestad · 9h ago

With how amazing the first R1 model was and how little compute they needed to create it, I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...

reissbaker · 2h ago

It's not better than full R1; Mistral is using misleading benchmarks. The latest version of R1, R1-0528, is much better: 91.4% on AIME2024 pass@1. Mistral uses the original R1 release from January in their comparisons, presumably because it makes their numbers look more competitive.

That being said, it's still very impressive for a 24B.

I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.

Of course, OpenAI, Google, and Anthropic will have released new models by then too...

adventured · 8h ago

It's because DeepSeek was a fast copy. That was the easy part and it's why they didn't have to use so much compute to get near the top. Going well beyond o3 or 2.5 Pro is drastically more expensive than fast copy. China's cultural approach to building substantial things produces this sort of outcome regularly, you see the same approach in automobiles, planes, Internet services, industrial machinery, military, et al. Innovation is very expensive and time consuming, fast copy is more often very inexpensive and rapid. 85% good enough is often good enough, that additional 10-15% is comically expensive and difficult as you climb.

natrys · 7h ago

Not disagreeing with the overarching point but:

> That was the easy part

Is a bit hand-wavy in that it doesn't explain why it's only DeepSeek who can do this "easy" thing, but still not Meta, Mistral or anyone else really. There are many other players who have way more compute than DeepSeek (even inside China, not even considering rest of the world), and I can assure you more or less everyone trains on synthetic data/distillation from whatever bigger model they can access.

refulgentis · 4h ago

They all have. I don't hope to convince you of that, everyones use case differs. Generally, AIME / prose / code benchmarks that don't involve successive tool calls are used to hide some very dark realities.

IMHO tool calling is by far the most clearly economically valuable function for an LLM, and r1 self-admittedly just...couldn't do it.

There's a lot of puff out there that's just completely misaligned with reality, ex. Gemini 2.5 Pro is by far the worst tool caller, Gemini 2.5 Flash thinking is better, 2.5 Flash is even better. And either Llama 4 beats all Gemini 2.5s except 2.5 Flash not thinking.

I'm all for "these differences will net out in the long run", Google's at least figured out how to micro optimize for Aider edit formatting without tools. Over the last 3 months, they're up 10% on edit performance. But it's horrible UX to have these specially formatted code blocks in the middle of prose. They desperately need to clean up their absurd tool-calling system. But I've been saying that for a year now. And they don't take it seriously, at all. One of their most visible leads tweeted "hey what are the best edit formats?" and a day later is tweeting the official guide for doing edits. I'm a Xoogler and that absolutely reeks of BigCo dysfunction - someone realized a problem 2 months after release and now we have "fixed" it without training, and now that's the right way to do things. Because if it isn't, well, what would we do? Shrugs

I'm also unsure how much longer it's worth giving a pass on this stuff. Everyone is competing on agentic stuff because that's the golden goose, real automation, and that needs tools. It would be utterly unsurprising to me for Google to keep missing a pain signal on this, vis a vis Anthropic, which doubled down on it mid-2024.

As long as I'm dumping info, BFCL is not a good proxy for this quality. Think "converts prose to JSON" not "file reading and editing"

natrys · 3h ago

I don't mind the info dump, but I am struggling to connect the relevance of this to topic at hand. I mean, focusing on a single specific capability and generalising it to mean "they all have" caught up with DeepSeek all across the board (which was the original topic) is a reductive and wild take. Especially when it seems to me that this seems more because of misaligned incentive than because it's truly a hard problem.

I am not really invested in this niche topic but I will observe that, yes I agree Llama 4 is really good here. And yet it's a far worse coder, far less intelligent than DeepSeek and that's not even arguable. So no it didn't "catch up" any more than what you could say by pointing out Llama is multimodal but DeepSeek isn't. That's just talking about a different things entirely.

Regardless, I do agree BFCL is not the best measure either, the Tau-bench is more real world relevant. But end of the day, most frontier labs are not incentive aligned to care about this. Meta cares because this is something Zuck personally cares about, Llama models are actually for small businesses solving grunt automation, not for random people coding at home. People like Salesforce care (xLAM), even China had GLM before DeepSeek was a thing. DeepSeek might care so long as it looks good for coding benchmarks, but that's pretty much the extent of it.

And I suspect Google doesn't truly care because in the long run they want to build everything themselves. They already have a CodeAssist product around coding which likely uses fine-tune of their mainline Gemini models to do something even more specific to their plugin.

There is a possibility that at the frontier, models are struggling to be better in a specific and constrained way, without getting worse at other things. It's either this, or even Anthropic has gone rogue because their Aider scores are way down now from before. How does that make sense if they are supposed to be all around better at agentic stuff in tool agnostic way? Then you realise they now have Claude Coder and it just makes way more economic sense to tie yourself to that, be context inefficient to your heart's content so that you can burn tokens instead of being, you know, just generally better.

refulgentis · 3h ago

> I am struggling to connect the relevance of this

> focusing on a single specific capability and

> I am not really invested in this niche topic

Right: I definitely ceded a "but it doesn't matter to me!" argument in my comment.

I sense a little "doth protest too much", in the multiple paragraphs devoted to taking that and extending it to the underpinning of automation is "irrelevant" "single" "specific", "niche".

This would also be news to DeepSeek, who put a lot of work to launch it in the r1 update a couple weeks back.

Separately, I assure you, it would be news to anyone on the Gemini team that they don't care because they want to own everything. I passed this along via DM and got "I wish :)" in return - there's been a fire drill trying to improve it via AIDER in the short term, is my understanding.

If we ignore that, and posit there is an upper management conspiracy to suppress performance, its just getting public cover by a lower upper management rush to improve scores...I guess that's possible.

Finally, one of my favorite quotes is "when faced with a contradiction, first check your premises" - to your Q about why no one can compete with DeepSeek R1 25-01, I'd humbly suggest you may be undergeneralizing, given even tool calls are "irrelevant" and "niche" to you.

Vetch · 2h ago

I think the point remains that few have been able to catch up to OpenAI. For a while it was just Anthropic. Then Google after failing a bunch of times. So, if we relax this to LLMs not by OpenAI, Anthropic or Google, then Deepseek is really the only one that's managed to reach their quality tier (even though many others have thrown their hat into the ring). We can also get approximate glimpses into which models people use by looking at OpenRouter, sorted by Top Weekly.

In the top 10, are models by OpenAI (gpt4omini), Google (gemini flashes and pros), Anthropic (Sonnets) and Deepseeks'. Even though the company list grows shorter if we instead look at top model usage grouped by order of magnitude, it retains the same companies.

Personally, the models meeting my quality bar are: gpt 4.1, o4-mini, o3, gpt2.5pro, gemini2.5flash (not 2.0), claude sonnet, deepseek and deepseek r1 (both versions). Claude Sonnet 3.5 was the first time I found LLMs to be useful for programming work. This is not to say there are no good models by others (such as Alibaba, Meta, Mistral, Cohere, THUDM, LG, perhaps Microsoft), particularly in compute constrained scenarios, just that only Deepseek reaches the Quality tier of the big 3.

natrys · 2h ago

Interesting presumption about R1 25-01 being what's talked about, you knowledge cut-off does appear to know R1 update two weeks back was a thing, and that it even improved on function calling.

Of course you have to pretend I meant the former, otherwise "they all have" doesn't entirely make sense. Not that it made total sense before either, but if I say your definition of "they" is laughably narrow, I suspect you will go back to your google contact and confirm that nothing else really exists outside it.

Oh and do a ctrl-f on "irrelevant" please, perhaps some fact grounding is in order. There was an interesting conversation to be had about underpinning of automation somehow without intelligence (Llama 4) but who has time for that if we can have hallucination go hand in hand with forced agendas (free disclaimer to boot) and projection ("doth protest too much")? Truly unforeseeable.

refulgentis · 2m ago

I don't know what you're talking about, partially because of poor grammar ("you knowledge cut-off does appear") and "presumption" (this was front and center on their API page at r1 release, and its in the r1 update notes). I sort of stopped reading after there because I realized you might be referring to me having a "knowledge cut-off", which is bizarre and also hard to understand, and it's unlikely to be particularly interesting conversation given that and the last volley relied on lots of stuff about tool calling being, inter alia, niche.

orbital-decay · 4h ago

This terrible and vague stereotyping about "China" while having no clue about the subject should have no place on HN but somehow always creeps in and is upvoted by someone. DeepSeek is not "China", they had nobody to copy from, they released their first 7B reasoning model back in April 2024, it was ahead of then-SotA models in math and validated their approach. They did a ton of new things besides training a reasoning model, and likely have more to come, as they have a completely different background than most AI companies. It's more of a cross-pollination of different areas of expertise.

MaxPock · 7h ago

I understand that the French are very innovative so why isn't their model SOTA ?

melicerte · 13h ago

If you look at Mistral investors[0], you will quickly understand that Mistral is far from being European. My understanding is it is mainly owned by US companies with a few other companies from EU and other places in the world.

[0] https://tracxn.com/d/companies/mistral-ai/__SLZq7rzxLYqqA97j... (edited for typo)

pdabbadabba · 12h ago

For the purposes of GP's comment, I think the nationalities of the people actually running the company and doing the work are more relevant than who has invested.

derektank · 12h ago

And, perhaps most relevantly, the regulatory environment the people are working in. French people working in America are probably more productive than French people working in France (if for no other reason because they probably work more hours in America than France).

8n4vidtmkvmk · 12h ago

Are we sure more time butt in office equates to more productivity?

FabHK · 32m ago

> Are we sure more time butt in office equates to more productivity?

Typically more output, but less productivity (= output/time).

meta_ai_x · 11h ago

Yes, especially in cutting edge research areas where other high functioning people with high energy isarelso there.

You can write your in-house CRUD app in your basement or your office and it doesn't matter.

The vast majority of HN crowd and general social/mainstream media don't make the difference between these two scenarios

1propionyl · 11h ago

Yes, specifically when it comes to open-ended research or development, collocation is non-negotiable. There are greater than linear benefits in creativity of approach, agility in adapting to new intermediate discoveries, etc that you get by putting a number of talented people who get along in the same space who form a community of practice.

Remote work and flattening communication down to what digital media (Slack, Zoom, etc) afford strangle the beneficial network effects.

throwaway0123_5 · 10h ago

I think they were talking about total time spent working rather than remote vs. in-person. I've seen more than a few studies over the years showing that going from 40 to 35 or 30 hours/wk has minimal or positive impacts on productivity. Idk if that would apply to all work environments though, and I don't recall any of the studies being about research productivity specifically.

hdjrudni · 2h ago

> I think they were talking about total time spent working rather than remote vs. in-person.

I was, yes. I should have omitted the "in office" part but I was referencing the "work more hours in America than France"

distortionfield · 9h ago

You’re being downvoted but you’re right. The number of people who act like a web cam reproduces the in person experience perfectly, for good and bad, is hilarious to me.

alienbaby · 7h ago

I think the mistake people make is believing that one approach is best for all. Diffferent people work most effectively in different ways.

adventured · 8h ago

$89,000 GDP per capita vs $46,000 rather proves the point about productivity per butt. US office workers are extraordinarily productive in terms of what their work generates (thanks to numerous well understood things like the outsized US scaling abilities). Measuring beyond that is very difficult due to the variance of every business.

ath92 · 1h ago

Weird take. Norway has about the same gdp per capita as the USA with stricter regulations than France. Ireland’s GDP per capita is higher than that of the USA, with less bureaucracy than France but more than the US. Not to mention that all of these are before adjusting for PPP. Almost as if GDP per capita is not a good measurement of productivity.

FabHK · 27m ago

Many wrinkles here.

First, one should probably look at GNP (or even GNI) rather than GDP to reduce the distortionary impact of foreign direct investment, company headquarters for tax reasons, etc.

Next, need to distinguish between market rate and PPP, as you highlight.

Lastly, these are all measures of output (per capita), while productivity is output per input, in this context output per hour worked. There the differences are less pronounced.

cataphract · 6h ago

A part of that figure is an artifact of how strong the dollar is though.

palata · 5h ago

> $89,000 GDP per capita vs $46,000 rather proves the point about productivity per butt.

So if I work 24h/day in a farm in Afghanistan, I should earn more than software developers in the Silicon Valley (because I'm pretty sure that they sleep)? Is that how you say GDP works?

numpad0 · 9h ago

I think maybe we should completely switch to admitting this. Every extra second you sit in the (home)office adds to productivity, just not necessarily converting into market values, that can be inflated with hype. Also longer hours is not necessarily safe or sustainable.

We only wish more time != more productivity because it's inconvenient in multiple ways if it were. We imagine a multiplier in there to balance the equation, such factor that can completely negate production, using mere anecdotal experiences as proofs.

Maybe that's not scientific, maybe time spent very closely match productivity, and maybe production as well as productivity need external, artificial regulations.

mschild · 7h ago

> Every extra second you sit in the (home)office adds to productivity

I'm not sure I believe that. I think at some point the additional hours worked will ultimately decrease the output/unit of time and at some point that you'll reach a peak whereafter every hour worked extra will lead to an overall productivity loss.

Its also something that I think is extremely hard to consistently measure, especially for your typical office worker.

chairmansteve · 10h ago

Spoken like a guy who's never been to France.

Classic drive by internet trope.

Maybe try a little harder, have an informed opinion about something.

whiplash451 · 11h ago

> they probably work more hours in America than France

Not sure that's even true. Mistral is known to be a really hard-working place

gwervc · 11h ago

I'm pretty sure there is way less regulations in the US in respect to France where going over the legal 35h/week requires additional capital and legal paperwork.

Saline9515 · 7h ago

In France most white collar jobs are categorized as "management" ("cadre"), and they have no time limit. It is very common for workers to clock 12h days in consultancies (10am-10pm) and in state administrations, for instance.

algoghostf · 9h ago

This is not true. Government workers or factory workers can limit to 35h (with some salary loss or days off loss), but else than that (especially in tech) it is very competitive and working 50 hours+/week is not exceptionl.

kgwgk · 9h ago

> 50 hours+/week is not exceptionl.

https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI0000...

Au cours d'une même semaine, la durée maximale hebdomadaire de travail est de quarante-huit heures.

https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI0000...

La durée hebdomadaire de travail calculée sur une période quelconque de douze semaines consécutives ne peut dépasser quarante-quatre heures, sauf dans les cas prévus aux articles L. 3121-23 à L. 3121-25.

Saline9515 · 7h ago

Everyone is "forfait cadre", which allow them to work with no practical time limit since they don't log their time spent at work. https://www.service-public.fr/particuliers/vosdroits/F19261

kgwgk · 5h ago

It seems that 20% of employees in the private sector are "cadres" and half of them are on "forfait jours". That makes around 10% of the private sector employees working 218 days per year without the 48/44 weekly hour limits. It's more than I thought but I doubt that many of them work more than 10 hours per day. Whether that's "exceptional" or not is a matter of definition, of course.

greenavocado · 8h ago

In the USA most software engineers are FLSA-exempt ("computer employee" exemption).

No overtime pay regardless of hours worked.

No legal maximum hours per day/week.

No mandatory rest periods/breaks (federally).

The US approach places the burden on the individual employee to negotiate protections or prove misclassification, while French law places the burden on the employer to comply with strict, state-enforced standards.

The French Labor Code (Code du travail) applies to virtually all employees in France, regardless of sector (private tech company, government agency, non-profit, etc.), unless explicitly exempted. Software engineering is not an exempted profession. Maximum hour limits are absolute. The caps of 44 hours per week, 48 hours average over 12 weeks, and 10/12 hours per day are legal maximums for almost all employees. Tech companies cannot simply ignore them. The requirements for employee consent, strict annual limits (usually max 220 hours/year), premium pay (+25%/+50%), and compensatory rest apply to software engineers just like any other employee.

"Cadre" Status is not an exemption. Many software engineers are classified as Cadres (managers/professionals) but this status does not automatically exempt them from working time rules.

Cadre au forfait jours (Days-Based Framework): This is common for senior engineers/managers. They are exempt from tracking daily/weekly hours but must still have a maximum of 218 work days per year (including weekends, holidays, and RTT days). Their annual workload must not endanger their health. 80-hour weeks would obliterate this rest requirement and pose severe health risks, making it illegal. Employers must monitor their workload and health.

Cadre au forfait heures (Hours-Based Framework) or Non-Cadre: These employees are fully subject to the standard daily/weekly/hourly limits and overtime rules. 80+ hours/week is blatantly illegal.

The tech industry, especially gaming/startups, sometimes tries to import unsustainable "crunch" cultures. This is illegal in France.

EDIT: Fixed work days

Saline9515 · 7h ago

Some State services, such as the "Trésor", which oversees French economic policies, do not respect this at all, and require 12h work days most of the year. The churn is enormous, workers staying there less than a year on average.

kgwgk · 8h ago

> 218 rest days per year (including weekends, holidays, and RTT days)

Wouldn’t that be nice, 218 rest days? It’s 218 working days.

retinaros · 10h ago

No one works 35hours in software jobs in france except maybe government. Overtime is also not compensated (they give some days off that is it.)

psalaun · 8h ago

Even in government; I've worked 50+ hours weeks working for the healthcare branch of the providence state, with a classic 39h/w contract. No compensation of any sort, despite having timesheets.

There are a lot of myths about French worker. Our lifelong worked hours is not exceptional; our productivity is also not exceptional.

greenavocado · 8h ago

Pointless suffering. Report violations to the CSE, Médecin du Travail, and Inspection du Travail.

psalaun · 8h ago

It was a choice, I loved my job there. I had more exciting projects than most of my friends in the private sector!

Saline9515 · 7h ago

Excellent way to get blacklisted and never work for the State again if you're a contractor, or end up in a low impact, boring job if you're a career worker.

epolanski · 9h ago

This is beyond ignorant and completely clueless.

People in startups and hard research work extremely hard everywhere, and Mistral is even so more notorious for being a tough place to survive.

You think that European founders and researchers are like "nah, you know what, we're European, we're not ambitious, we don't want to make money, to hell with equity"?

Also, just to point out, I've worked in research, and I can tell you 100% that I've never ever seen anybody more dedicated and hardworking than people from China/South Korea and Japan. I'm talking sleeping bags in the office kind of people.

And yet, that just does not translate in better results. More results, which is important too sometimes, yes, better, more relevant, higher quality? No no and no.

vasco · 12h ago

Most measures of productivity have "hours worked" in the denominator so that can't be right.

underdeserver · 12h ago

If I work 1000 hours and you work 2000 hours in the same timeframe, but you outcompeted me and created 3x value, you are 1.5 times more productive.

There's a numerator too.

vasco · 9h ago

How does the same exact person get more productive? You forgot the example I replied to? The only thing that changed were hours worked. In your example you change it to less hours worked with more output. You made it circular.

underdeserver · 7h ago

You can be more productive just because you're faster.

Magistral is amazingly impressive compared to ChatGPT 3.5. If it had come out two years ago we'd be saying Mistral is the clear leader. But it came out now.

Not saying they worked fewer hours, just that speed matters, and in some cases, up to a limit, working more hours gets your work done faster.

retinaros · 10h ago

Most of french people in engineering jobs in France are working late even tho overtime is never paid.

Disposal8433 · 10h ago

In the USA they have the famous 9 to 5. Most developers' jobs in France are "9 to 6 with 2 hours to eat in the middle and unpaid overtime," so I would say both countries are equivalent.

psalaun · 8h ago

In parisian startups it's more 9 to 7 with 30 min lunch breaks.

kergonath · 5h ago

It’s a French company, subject to French laws and European regulations. That’s what matters, from a user point of view.

epolanski · 9h ago

Jm2c but I feel conflicted about this arms race.

You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.

I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?

louiskottmann · 8h ago

Indeed, and with the technology plateau-ing, being 6-12 months late with less debt is just long term thinking.

Also, Europe being in the race is a big deal for consumers.

ACCount36 · 8h ago

>with the technology plateau-ing

People were claiming that since year 2022. Where's the plateau?

asadotzler · 5h ago

The pre-training plateau is real. Nearly all the improvements since then have been around fine tuning and reinforcement learning, which can only get you so far. Without continued scaling in the base models, the hope of AGI is dead. You cannot reach AGI without making the pre-training model itself a whole lot better, with more or better data, both of which are in short supply.

ethbr1 · 4h ago

> with more or better data, both of which are in short supply

Hmmm. It's almost as if a company without a user data stream like OpenAI would be driven to release an end-user device for the sole purpose of capturing more training data...

sisve · 7h ago

Being the best European AI company is also a multi billion business. Its not like China or the US respects GDPR. A lot of companies will choose the best European company.

adventured · 8h ago

Why would the debt matter when you have $60 billion in ad revenue and are generating $20 billion in op income? That's OpenAI 5-7 years from now, if they're able to maintain their position with consumers. Once they attach an ad product their margins will rapidly soar due to the comparatively low cost of the ad segment.

The technology is closer to a decade from seeing a plateau for the large general models. GPT o3 is significantly beyond o1 (much less 3.5 which was just Nov 2022). Claude 4 is significantly beyond 3.5. They're not subtle improvements. And most likely there will be a splintering of specialization that will see huge leaps outside the large general models. The radical leap in coding capabilities over the past 12-18 months is just an early example of how that will work, and it will affect every segment of human endeavour.

aDyslecticCrow · 8h ago

> Once they attach an ad product their margins will rapidly soar due to the comparatively low cost of the ad segment.

They're burning through computers and capital. No amount of advertising could cover the cost of training or even running these models. The massive subscription costs we've started seeing are just a small glimpse into the money they are burning through.

They will NOT make a profit using the current methods unless the models become at least 10 times more efficient than they are now. At which point can Europe adapt to the innovation without much cost.

It's an arms race to see who can burn the most money the fastest, while selling the result for as little as possible. When they need to start making money, it will all come crashing down.

adventured · 8h ago

A similar sentiment existed for a long time about Uber and now they're very profitable and own their market. It was worth the burn to capture the market. Who says OpenAI can't roll over to profitable at a stable scale? Conquer the market, hike the price to $29.95 (family account, no ads; $19.95 individual account with ads; etc etc). To say nothing of how they can branch out in terms of being the interaction point that replaces the search box. The advertising value of owning the land that OpenAI is taking is well over $100 billion in annual revenue. Amazon's retail business is terrible, their ad business is fantastic. As OpenAI bolts on an ad product their margin potential will skyrocket and the cost side will be modest in comparison.

Over the coming years it won't be possible to stay a mere 6-12 months behind as the costs to build and maintain the AI super-infrastructure keeps climbing. It'll become a guaranteed implosion scenario. Winning will provide the ongoing immense resources needed to keep pushing up the hill forever. Everybody else - except a few - will fall away. The same outcome took place in search. Anybody spot Lycos, Excite, Hotbot, AltaVista around? It costs an enormous amount of money to try to keep up with Google (Bing, Baidu, Yandex) in search and scale it. This will be an even more brutal example of that, as the costs are even higher to scale.

The only way Mistral survives is if they're heavily subsidized directly by European states.

aDyslecticCrow · 7h ago

> It was worth the burn to capture the market.

You cannot compare Uber to the AI market. They are too different. Uber captured the market because having three taxi services is annoying. But people are readily jumping between models using multi-model platforms. And nobody is significantly ahead of the pack. There is nothing that sets anyone apart aside from the rate at which they are burning capital. Any advantage is closed within a year.

If OpenAI wants to make a profit, it will raise prices and be dropped at a heartbeat for the next cheapest option. Most software stacks are designed to be model-agnostic, making integration or support a non-factor.

whiplash451 · 7h ago

Three cab apps are a lot less annoying than three LLM apps each having their piece of your chats history.

The winner-take-all effect is a lot stronger with chat apps.

snoman · 5h ago

That’s the exact opposite of the way it is right now (at least for me). I don’t like having multiple ride hailing apps but easily have ChatGPT, Claude, Gemini on my phone (and local LLM at home). There is zero effort cost to go from one to the other.

jasonthorsness · 13h ago

Even if it isn't as capable, having a model with control over training is probably strategically important for every major region of the world. But it could only fall so far behind before it effectively doesn't work in the eyes of the users.

funnym0nk3y · 13h ago

Thought so too. I don't know how it could be different though. They are competing against behemoths like OpenAI or Google, but have only 200 people. Even Anthropic has over 1000 people. DeepSeek has less than 200 people so the comparison seems fair.

rsanek · 13h ago

any claim from the deepseek folks should be considered with wide margins of error.

humpty-d · 11h ago

I know we distrust them on account of being nefarious Chinese, but has anything come to light with R1 or the people behind it specifically to justify this?

mwigdahl · 4h ago

"Deepseek only cost $6 million"?

wafngar · 6h ago

But they have built a fully “independent” pipeline. Deepseek and others probably trained in gpt4, o1 or whatever data.

segmondy · 12h ago

are you really going to compare a 24B model to a 700B+ model?

a2128 · 11h ago

24B is the size of the Small opensourced model. The Medium model is bigger (they don't seem to disclose its size) and still gets beaten by Deepseek R1

thot_experiment · 11h ago

Mistral Large is 123b so one can probably assume that medium is between 24b and 123b, also Mistral 3.1 is by a wide margin my go-to model in real life situations. Benchmarks absolutely don't tell the whole story, and different models have different use cases.

ohso4 · 6h ago

It's a 70b model, Medium 2 was 70b.

https://xcancel.com/arthurmensch/status/1920136871461433620#...

Ringz · 10h ago

Can you please explain what your „real life situations“ are?

thot_experiment · 8h ago

I use it as a personal assistant (so tool use integrated into calendar/todo/notes etc) often times using the multimodal aspect (taking a photo of a todo list, asking it to remind me to buy something from a picture). I also use it as a code completion tool in vscode, as well as a replacement for most basic google searches ("how does this syntax work", "what's the torch method for X")

I use it for almost every interaction I have with AI that isn't asking it to oneshot complex code. I fairly frequently run my prompts against Claude/ChatGPT and Mistral 3.1 and find that for most things they're not meaningfully different.

I also spend a lot of time playing around with it for storytelling/integration into narrative games.

mandelken · 7h ago

Cool. What framework or program do you use to orchestrate this?

thot_experiment · 5h ago

Me, Mistral and Claude writing modules on top of a homebrew assistant framework in node with a web frontend. I started out mostly handwriting the first couple modules and the framework for it. (todo and a time tracker) and now the AI is getting pretty good at replicating the patterns I like using, esp with some prompt engineering as long as I don't ask for entire architectures but just prod it along. It's just so easy to make the exact thing you want now. All the heavy lifting is done by ollama and the node/browser APIs.

The only dependency on the node side is 'mime' which is just a dict of mime types, data lives inside node's new `node:sqlite` everything on the front side that isn't just vanilla is alpine. It runs on my main desktop and has filesystem access (which doesn't yet do anything useful really) but the advantage here is that since I've written (well at least read) all of the code I can put a very high level of trust into my interactions.

Rastonbury · 3h ago

Did you hook up any search tools?

moffkalast · 10h ago

The most important company is to is to QwQ at 30B sjnce it's still the best local reasoning model for that size. A comparison that Mistral did not run for some reason, not even with Qwen3.

tootie · 13h ago

As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.

esafak · 12h ago

If so we need to fix the benchmarks.

paulddraper · 11h ago

https://en.wikipedia.org/wiki/Goodhart%27s_law

tootie · 8h ago

I think there's a fundamental limit to benchmarks when it comes to real-world utility. The best option would be more like a user survey.

esafak · 7h ago

That's Chatbot Arena: https://lmarena.ai/leaderboard

jug · 4h ago

And unfortunately revealed to be largely a vibe check these days with that whole Llama 4 debacle. But why should we be surprised, really, when users have an easier time feeling if the replies sound human and conversational and _appear_ knowledgeable than actually outsmarting them. This Arena worked well in the ChatGPT 3.0 days… But now?

riku_iki · 11h ago

those who try to fix them are fighting alone against huge corps which try to abuse them..

littlestymaar · 13h ago

> Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison.

That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).

mrtksn · 13h ago

Europe isn't going to catch up in tech as long as its market is open to US tech giants. Tech doesn't have marginal costs, so you want to have one of it in one place and sell it everywhere and when the infra and talent is already in US, EU tech is destined to do niche products.

UK has a bit of it, France has some and that's it. The only viable alternatives are countries who have issues with US and that is China and Russia. China have come up with strong competitors and it is on cutting edge.

Also, it doesn't have anything to do with regulations. 50 US States have the American regulations, its all happening in 1 and some other states happen to host some infrastructure but that's true for rest of of the world too.

If the EU/US relationship gets to Trump/Musk level, then EU can have the cutting edge stuff.

Most influential AI researchers are from Europe(inc. UK), Israel and Canada anyway. Ilya Sutskever just the other day gave speech at his alma matter @Canada for example. Andrej Karpathy is Slovakian. Lot's of Brits, French, Polish, Chinese, German etc. are among the pioneers. Significant portion of the talent is non-American already, they just need a reason to be somewhere else than US to have it outside the US. Chinese got their reason and with the state of the affairs in the world I wouldn't be surprised if Europeans gets theirs in less than 3 and a half years.

vikramkr · 13h ago

If you close off the market to US tech giants, maybe they'll have some amount of market dominance at home, but I would doubt that would mean they've "caught up" tech wise. There would be no incentive to compete. American EV manufacturing is pretty far behind Chinese EV manufacturing, protectionism didn't help make a competitive car, it just protected the home market while slowly ceding international market after international market

mrtksn · 13h ago

I agree, protectionism is bad most of the time but it has its place. It is bad when you are ahead, it is useful when you are behind(You want them to be exposed to the cutting edge market but before that you want them to be able to exist in first place even if they are not the best at this very moment).

China's EV dominance is a result of local governments investing and buying from local businesses.

It would be the same with Russia&China. They will receive money from the governments and will sell to local buyers and will aim to expand to foreign markets.

As I said, most AI talent is not American but it is concentrated there. Give them a reason to be somewhere else, some will be somewhere else.

foolswisdom · 10h ago

The solution to that would be to force companies within the EU market to compete with each other (fair competition laws), just that idea is less popular than the first winner in a market ensuring they stay dominant (because it serves the interest of those who just got power). Same reason why big tech rules EU in the first place.

chairmansteve · 11h ago

China is an example of protectionism working. The world is not governed by simple rules.

saubeidl · 13h ago

As a counterexample, China's tech industry has caught up and in some ways surpassed the US, partially due to being closed off.

mitthrowaway2 · 12h ago

I think there's a few more important reasons beyond being closed off:

- Regulatory friendliness (eg. DJI)

- Non-enforcement of foreign patents (eg. LiFePO4 batteries)

- Technology transfer through partnerships with domestic firms

- Government support for industries deemed to be in the national interest

hshdhdhj4444 · 12h ago

But also due to the U.S. driving away smart people from the U.S. to China.

csomar · 12h ago

> As a counterexample, China's tech industry has caught up and in some ways surpassed the US, partially due to being closed off.

How did you come up to that conclusion? We don't have access to an alternate universe where the Chinese tech market was open. There is a real possibility that it would have been far ahead had it been open.

yorwba · 10h ago

We do have access to records from the before times when the internet was wide open and Facebook, Google and Microsoft were big in China. Well, Microsoft is still big because they're not an internet company and unfazed by censorship, but the exit of Google and Facebook took a lot of pressure off Baidu and the entire Chinese social media ecosystem.

littlestymaar · 13h ago

> There would be no incentive to compete.

Why not ? First of all there would be plenty of incentives for EU companies to compete with one another (and plenty of capital flowing to them as the European market is big enough), then there would be competition with US actors in the rest of the world. That's exactly how the Asian economic model has been built: Japan, Taiwan, South Korea all have used protectionism + export-based subsidies to create market leaders in all kind of domains (from car manufacturing to electronics and shipbuilding).

ascorbic · 11h ago

It's mostly about money. DeepMind was founded in the UK, and is still based in London, but there was no way it could get the funding it needed without selling to Google or some other US company. China is one of the few other countries that can afford to fund that kind of thing.

Iulioh · 12h ago

The problem is, CONSUMER level tech

The EU is doing a lot of enterprise level shit and it's great

The biggest company in Europe sells B2B software (SAP)

mrtksn · 12h ago

One swallow does not make a summer, all the major platforms are American and that's where Europe lags. I agree that Europe does have some great tech but they are all niche. Europe also have some great consumer tech products but they are all dependent on American platforms. For example some of the best games are French, Polish, Bulgarian, Ukrainian etc. but they all depend on Steam or Apple App Store and have to go by their rules and pay them a significant commission.

PeterStuer · 7h ago

SAP sells B2B software, but most of their income is from consultancy and training.

csomar · 12h ago

That's a single company and I'd not call that great.

simianwords · 10h ago

How can you explain Israel?

iwontberude · 13h ago

Which Trump/Musk level? There have been so many.

fiatjaf · 13h ago

This reads like an AI-generated comment. What do you mean by "benchmarks suggest"? The benchmarks are very clear and presented right there in the page.

atemerev · 13h ago

"EU is leading in regulation", they say.

I don't know what they are thinking.

cpldcpu · 13h ago

Sorry, this is just getting old...

Its a trite talking point and not the reason why there are so few consumer-AI companies in Europe.

atemerev · 13h ago

And what would be the reason? I am genuinely interested. Also, are there viable not "consumer" AI companies here? Only Mistral seems to train foundation models, and good for them, however, as of now they are absolutely not SOTA.

baq · 13h ago

Money.

No, really - EU doesn't have the VCs and the megacorps. People laugh at EU sponsoring projects, but there is no private money to sponsor them. There are plenty of US companies with sites in the EU though, so you have people working the problems, but no branding.

SV_BubbleTime · 12h ago

Ok, just a quick question… why does Europe not have the money actual/people?

hshdhdhj4444 · 12h ago

Part of the answer is debt.

The U.S. has a debt of 35Tn. The entire EU around 16Tn.

If even 10% of the debt difference was invested in tech that would have meant about $2tn more in investment in EU tech.

bobxmax · 10h ago

Because Europeans don't take smart risks. Because they over regulate.

It's fascinating watching people circle back to this answer.

Regulation and taxation reduces incentives. Lower incentives, means lower risk-taking.

The fact this is still a lesson that needs to be debated is absurd.

baq · 10h ago

Europeans also mostly don’t suffer from school shootings and generally don’t go bankrupt when they get cancer or just take an ambulance ride to a non-network hospital. Regulation is not all bad, besides the US has more of it than anybody else.

TulliusCicero · 7h ago

The mental gymnastics here are incredible. Do you really think the regulations inhibiting tech startup creation are the same ones that protect people when they get cancer or whatever?

Yes, the US has a lot of school shootings, but does anyone think loose gun regulations are why the US is strong on tech?

bobxmax · 5h ago

Any time European economic failings are brought up it's always the same thing. "Well at least no school shootings!"

Great, Singapore has less school shootings and homeless people than anywhere in Europe by a country mile and has a soaring economy.

FabHK · 22m ago

Eh, Singapore's efforts to nurture a thriving startup scene are met with middling success at most.

camjw · 8h ago

I would love to know what you do for a living and whether you personally have taken any smart risks that have lead you to financial success, or whether you just like sniping on HN about school shootings and pretending to be superior.

bobxmax · 5h ago

Lol skipped right to the ad hominem this time huh?

Europeans defending their economy is like republicans defending gun laws... like watching a chicken run around in circles.

stefan_ · 7h ago

Thats hardly unique to Europeans. Look at UAV regulations in the US - regulated to death based on nothing, leading to a 5 to 10 year technology gap to China, while recreational pilots crash and burn every other week.

atemerev · 10h ago

The amount of debt you are allowed to take and the abundance of money to invest in new projects are in direct proportion to the competitiveness of the jurisdiction, i.e. business-friendly environment.

EU is not a business-friendly environment.

fmbb · 12h ago

Quick questions don’t always have quick answers.

Moneywise, the US does have the good old Exorbitant Privilege to lean on.

kilpikaarna · 11h ago

Most recently, due to ordoliberalism and coat-according-to-cloth morality guiding economic policy rather than money printer go brrr.

Longer term: cultural and language divisions despite attempts at creating a common market, not running the global reserve currency/military hegemony, social democracies encouraging work-life balance over cutthroat careerism, demographic issues, not getting a boost from being the only consumer economy not to be leveled in WW2, etc.

PeterStuer · 7h ago

Unlike the US, the EU does not have reserve currency privilige, so we can't print enless trillions of paper and force the rest of the world to give us their companies and goods in return for it.

baq · 12h ago

edit: the parent has since edited out the flamebait.

Maybe, or maybe when silicon valley was busy growing exponentially Europe was still picking itself up from the mess of ww2.

Trying to blame a single reason is futile, naive and childish.

oceanplexian · 11h ago

The US was out-innovating Europe a long time before WW2, we had faster, more extensive rail systems, superior high rise construction, earlier to electrification, invention of the telephone, modern manufacturing (Model T), invention of the airplane, the birth of Hollywood and modern motion pictures, the list goes on.

msgodel · 10h ago

I think it's funny how the US, Canada, and Scotland/the UK all simultaneously claim to be the home of the telephone.

bobxmax · 10h ago

And what's the excuse for Euro's GDP being equal to the US in 2007, and now being over $10T less?

baq · 10h ago

In general, the same. In particular, different.

whodidntante · 5h ago

I can only talk about my personal (US based) experience. This includes many US based startups,some VC startups, senior leadership in a large tech company, and a senior executive position in another large tech company. I have also worked with, and built, tech organizations in multiple EU countries, and have been involved in the technical due diligence and acquisition discussions with several EU companies. I admit that my experience is about 6 years old, as I am no longer in the tech industry, and I do not know what has changed during this time.

Money: There is more money for US startups. Investors (US and EU) want to invest in US based startups, not EU startups. US investors are willing to risk more money and take greater risk. EU startups that gain traction will attract US companies in that they provide a good way to extend their market to the EU, not as much for their innovations. Tech entrepreneurs (US or EU) want to work in the US if they can, because that is where the excitement and risk taking is and where the money can be made.

Teams: Building and managing EU tech teams is very different than US tech teams. EU teams need a lot more emotional hand holding, and EU engineers are far more salary oriented than equity oriented. It is far more difficult to motivate them to go above and beyond - the "we need to get this fix or feature in tonight so we can deploy n the morning" simply will not get done if it is already 5pm. Firing EU workers is much more difficult. There are a lot more regulations for EU teams, in order to "protect" them, and that results in the teams being more "lifestyle" teams rather than "innovation teams". EU teams get paid a lot less than their US counterparts.

Failure: Good failure is not a problem in the US, it can actually be a badge of honor. EU is very risk averse, and people avoid failure.

There are of course exceptions all around, but the weight of these observations and experiences are in favor of US teams.

This is in no way saying it is better to live in the US, there are a lot of things about the EU that are more attractive than the US, and I would probably have a better lifestyle living in Europe now that I am no longer working. But innovation and money is not one of them.

dmos62 · 13h ago

It is fairly common to struggle to understand why different cultures think the way they do.

moralestapia · 13h ago

Ugh.

Edit: Parent changed their comment significantly, from something quite unpleasant to what it is now. I'm not deleting my comment as I'm not that kind of person.

dmos62 · 12h ago

I did. I initially said that Europeans often struggle to understand other cultures too. Which was an immature way to point out that the cultural dissonance works both ways. I realized that I was obfuscating my point and rewrote my comment to be clearer, but now that you gave me a chance to think on it some more, I wish I would have said what I wanted to say more directly still.

What I wanted to say is: I like EU's regulation and I find it interesting how other people have different world views.

atemerev · 13h ago

I live in Europe.

mrtksn · 13h ago

Cool, which regulations exactly stopped you from doing cutting edge AI?

meta_ai_x · 11h ago

regulation-culture breed a certain type of risk-taking culture. So, you can't blame a specific regulation for lack of innovation culture

mrtksn · 10h ago

Im not sure about that, Europe has plenty of starups. Also, IIRC it has larger number of small businesses than US as in US huge companies employ huge numbers of people.

What Europe does not have is scale ups in tech. The tech consolidated in US. By tech I mean internet based companies. Remove those and EU has higher productivity.

kelseyfrog · 13h ago

Décret sur la Pause Goûter Universelle (PGU).

philjohn · 12h ago

Is that the regulation that says you need to allow someone to take a 20 minute break after 6 hours of work?

FabHK · 18m ago

Chinese Employers to Grant 15 Minute Maternity Break

https://theonion.com/chinese-employers-to-grant-15-minute-ma...

Mistletoe · 12h ago

This is why I want to move to the EU. I don’t care if companies aren’t coddled there. I want to live where people are the first priority.

atemerev · 10h ago

Well, are you ready to live on a low middle class salary of a European software engineer? It is really low middle class. The middle middle here would be a bank clerk, and upper middle — a lawyer or a surgeon.

This is not coincidental.

baq · 10h ago

Incidentally (also not) surgeons and lawyers are not poor in the states either… it’s just Silicon Valley was the perfect place with just the right people and it kept growing for 60 years straight. Surgery and law do not grow exponentially. (I’ll pretend the pages of regulation aren’t supposed to count.)

micromacrofoot · 13h ago

probably some silly thing like "people should have more rights and protections"

bobxmax · 10h ago

Rights and protections that have benefited heavily from an economy built on the alliance with the US.

If it weren't for American help and trade post-WW2, Europe would be a Belarusian backwater and is fast heading back in that direction.

Countries like Greece, Italy, Spain, Portugal, etc. show the future of Europe as it slowly stagnates and becomes a museum that can't feed it's people.

Even Germany that was once excelling is now collapsing economically.

The only bright spot on the continent right now is Poland who are, shocker, much less regulatorily strict and have lower corporate taxes.

debugnik · 7h ago

> Countries like Greece, Italy, Spain, Portugal

PIGS, really? Some of the top growing EU economies right now, which have turned their deficit around, show the future of a slowly stagnating Europe?

bobxmax · 5h ago

A 200B economy growing 2% is the future of the EU? Yes that is the point I am making.

__alexs · 13h ago

EU regulation is often "you can not have the cool thing" not "the cool thing must be operated equitably".

I think they are more interested in protecting old money than in protecting people.

andruby · 12h ago

EU never just states "you can not have the cool thing". Please provide an example if you disagree.

It is very hard to create policies and legislation that protects consumers, workers and privacy while also giving enough liberties for innovation. These are difficult but important trade-offs.

I'm glad there is diversity in cultures and values between the US, EU and Asia.

micromacrofoot · 13h ago

I think usb-c and third party app stores are pretty cool

umbra07 · 11h ago

I think the government shouldn't be legislating that companies must use a specific USB connector.

Realistically the legislation was only targeting Apple. If consumers want USB-C, then they can vote with their wallets and buy an Android, which is a reasonable alternative.

msgodel · 10h ago

They shouldn't be forcing people to use patented Qualcomm technology to access cellular networks either but here we are.

Realistically Apple's connector adds no value and if they want to sell into markets like the EU they need to cut that kind of thing out.

umbra07 · 9h ago

> Realistically Apple's connector adds no value

Like I said, usb-c is a regression from lightning in multiple ways.

* Lightning is easier to plug in.

* Lightning is a physically smaller connector.

* USB-C is a much more mechanically complex port. Instead of a boss in a slot, you have a boss with a slot plugging into a slot in a boss.

There was so much buzz around Apple no longer including a wall wort with its phones, which meant an added cost for the consumer, and potentially an increased environmental impact if enough people were going to say, order a wall wort online and shipped to them. The same logic applies to Apple forced to switch to USB, except that the costs are now multiplied.

fkyoureadthedoc · 8h ago

Having owned both lighting and USB-C iPhones/iPads, I prefer the USB-C experience, but neither were that bad.

My personal biggest gripe with lightning was that the spring contacts were in the port instead of the cable, and when they wore out you had to replace the phone instead of the cable. The lightning port was not replaceable. In practice I may end up breaking more USB-C ports, we'll see.

micromacrofoot · 8h ago

I've worked with thousands of both types of cable at this point

> Lightning is easier to plug in.

according to you? neither are at all difficult

> Lightning is a physically smaller connector.

I've had lightning cables physically disassemble in the port, the size also made them somewhat delicate

> USB-C is a much more mechanically complex port.

much is a bit well, much... they're both incredibly simple mechanically — the exposed contacts made lightning more prone to damage

I've had multiple Apple devices fail because of port wear on the device. Haven't encountered this yet with usb-c

> The same logic applies to Apple forced to switch to USB, except that the costs are now multiplied.

Apple would have updated inevitably, as they did in the past — now at least they're on a standard... the long-term waste reduction is very likely worth the switch (because again, without the standard they'd have likely switched to another proprietary implementation)

flmontpetit · 10h ago

It's hard to see the benefit in letting every hardware manufacturer attempt to carve out their own little artificial interconnect monopoly and flood the market with redundant, wasteful solutions.

micromacrofoot · 10h ago

We've had multiple USB standards for decades with no end in sight. Apple was targeted because they have the most high-profile proprietary connector and they were generally using it to screw consumers. Good riddance.

umbra07 · 10h ago

Like I said, if consumers don't want it, then they can buy Android phones instead.

> they were generally using it to screw consumers

You understand that there were lots of people happy with Lightning? USB-C is a regression in many ways.

boroboro4 · 10h ago

I want to have USB-C and I want to have iPhone.

I’m very happy EU regulators took this headache off my shoulders and I don’t need to keep multiple chargers at home, and can be almost certain I can find a charger in restaurant if I need it.

Based on the reaction of my friends 90% of people supported this change and were very enthusiastic about it.

I have zero interest in being part of vendor game to lock me in.

umbra07 · 9h ago

Products are supposed to come with different tradeoffs. I want to have an Android and I want to have my headphone jack back. That doesn't mean that the EU should make that a law.

> Based on the reaction of my friends 90% of people supported this change and were very enthusiastic about it.

That is an absolutely worthless metric, and you know it.

Aeolos · 9h ago

It's about as useful as your complaining.

Good riddance for Lightning.

micromacrofoot · 9h ago

Why bother arguing the point if you're not going to provide a single example.

saubeidl · 13h ago

Can you name specific examples? Otherwise, this just sounds like inflammatory polemic.

atemerev · 13h ago

I've yet to find any rights and protections in these cookie banners.

saubeidl · 13h ago

The cookie banners are corps trying to circumvent the rights and protections. If they actually went by the spirit of the protections, the cookie banners wouldn't be needed. Your ire is misdirected.

yeahforsureman · 12h ago

Are you sure?

The ePrivacy Directive requires a (GDPR-level) consent for just placing the cookie, unless it's strictly necessary for the provision of the “service”. The way EU regulators interpret this, even web analytics falls outside the necessity exception and therefore requires consent.

So as long as the user doesn't and/or is not able to automatically signal consent (or non-consent) eg via general browser-level settings, how can you obtain it without trying to get it from the user on a per-site basis somehow? (And no, DNT doesn't help since it's an opt-out, not an opt-in mechanism.)

exyi · 11h ago

Everyone I know of will try to click "reject all unnecessary cookies", and you don't need the dialog for the necessary ones. You can therefore simply remove the dialog and the tracking, simplifying your code and improving your users' experience. Can tracking the fraction which misclicks even give some useful data?

micromacrofoot · 11h ago

there are analytics providers that don't require third party cookies, it's not hard to switch

micromacrofoot · 13h ago

cookie banners are malicious compliance while we head towards the death of cross-site cookies, they are indeed a poor implementation but the legislation that lead to them did not come up with it

did you really prefer when companies were selling your data to third parties and didn't have to ask you?

sunaookami · 10h ago

Do you really think clicking "Reject non-essential cookies" does something?

micromacrofoot · 9h ago

show me a single example that doesn't

0xDEAFBEAD · 13h ago

Honestly the US approach to AI is incredibly irresponsible. As an American, I'm glad that someone somewhere is thinking about regulation. Not sure it will be enough though: https://xcancel.com/ESYudkowsky/status/1922710969785917691#m

msgodel · 10h ago

There's nothing the regulation could meaningfully hope to accomplish other than slow down people willing to play by the rules.

ambicapter · 7h ago

Wow, the "criminals don't follow laws therefore laws are worthless" argument, here? In my HN?

msgodel · 5h ago

Usually it's possible to actually detect crime (in fact it's usually hard to ignore.) That's not the case with AI.

MoonGhost · 12h ago

No, thanks, we don't want to be like EU. Everything regulated to death. They even thought to criminalize street photography because there could be copyrighted materials in the picture. Not sure, are they still taxing Eiffel tower images?

int_19h · 11h ago

EU is not a monolithic entity, and amount of regulation varies widely. Baltics are very business friendly, for example.

bobxmax · 10h ago

And Estonia has the most impressive tech ecosystem on the continent while being a soviet backwater 20 years ago. Shocking how that works.

johnisgood · 12h ago

I thought it is happening in the US, too. I mean, the Government is there to regulate the shit out of everything. Regardless of where you are.

simonw · 11h ago

Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/

atxtechbro · 10h ago

Hi Simon,

What's the huge difference between the two pelicans riding bicycles? Was one running locally the small version vs the pretty good one running the bigger one thru the API?

Thanks, Morgan

diggan · 10h ago

Ollama doesn't like proper naming for some reason, so `ollama pull magistral:latest` lands you with the q4_K_M version (currently, subject to change).

Mistral's API defaults to `magistral-medium-2506` right now, which is running with full precision, no quantization.

samtheprogram · 6h ago

Not only the quantization, but what’s available via ollama is magistral-small (for local inference), not the -medium variant.

simonw · 8h ago

Yes, the bad one was Mistral Small running locally, the better one was Mistral Medium via their API.

dwedge · 8h ago

Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.

I tried it, 80% of the "text" was recognised as images and output as whitespace so most of it was empty. It was much much worse than tesseract.

A month later I got the bill for that crap and deleted my account.

Maybe this is better but I'm over hype marketing from mistral

christianqchung · 13h ago

I don't understand why the benchmark selections are so scattered and limited. It only compares Magistral Medium with Deepseek V3, R1, and the other close weighted Mistral Medium 3. Why did they leave off Magistral Small entirely, alongside comparisons with Alibaba Qwen or the mini versions of o3 and o4?

CobrastanJorji · 6h ago

Etymological fun: both "mistral" and "magistral" mean "masterly."

Mistral comes from Occitan for masterly, although today as far as I know it's only used in English when talking about mediterranean winds.

Magistral is just the adjective form of "magister," so "like a master."

If you want to make a few bucks, maybe look up some more obscure synonyms for masterly and pick up the domain names.

arnaudsm · 13h ago

I wished the charts included Qwen3, the current SOTA in reasoning.

Qwen3-4B almost beats Magistral-22B on the 4 available benchmarks, and Qwen3-30B-A3B is miles ahead.

SparkyMcUnicorn · 11h ago

30-A3B is a really impressive model.

I throw tasks at it running locally to save on API costs, and it's possibly better than anything we had a year or so ago from closed source providers. For programming tasks, I'd rank it higher than gpt-4o

No comments yet

poorman · 10h ago

Is there a popular benchmark site people use? Becaues I had to test all these by hand and `Qwen3-30B-A3B` still seems like the best model I can run in that relative parameter space (/memory requirements).

arnaudsm · 6h ago

- https://livebench.ai/#/ + AIME + LiveCodeBench for reasoning

- MMLU-Pro for knowledge

- https://lmarena.ai/leaderboard for user preference

We only got Magistral's GPQA, AIME & livecodebench so far.

devmor · 10h ago

I would agree, Qwen3 is definitely the most impressive "reasoning" model I've evaluated so far.

resource_waste · 11h ago

No surprise on my end. Mistral has been basically useless due to other models always being better.

But its European, so its a point of pride.

Relevance or not, we will keep hearing the name as a result.

alister · 8h ago

As a quick test of logical reasoning and basic Wikipedia-level knowledge, I asked Mistral AI the following question:

A Brazilian citizen is flying from Sao Paulo to Paris, with a connection in Lisbon. Does he need to clear immigration in Lisbon or in Paris or in both cities or in neither city?

Mistral AI said that "immigration control will only be cleared in Paris," which I think is wrong.

After I pointed it to the Wikipedia article on this topic[1], it corrected itself to say that "immigration control will be cleared in Lisbon, the first point of entry into the Schengen Area."

I tried the same question with Meta AI (Llama 4) and it did much worse: It said that the traveler "wouldn't need to clear immigration in either Lisbon or Paris, given the flight connections are within the Schengen Area", which is completely incorrect.

I'd be interested to hear if other LLMs give a correct answer.

[1] https://en.wikipedia.org/wiki/Schengen_Area#Air_travel

mcintyre1994 · 6h ago

I think Gemini's answer (2.5 Flash) is impressive

----

Since both Portugal and France are part of the Schengen Area, and a Brazilian citizen generally does not need a visa for short stays (up to 90 days in any 180-day period) in the Schengen Area, here's how immigration will work:

Lisbon: The Brazilian citizen will need to clear immigration in Lisbon. This is because Lisbon is the first point of entry into the Schengen Area. At this point, their passport will be stamped, and they will be officially admitted into the Schengen Zone.

Paris: Once they have cleared immigration in Lisbon, their flight from Lisbon to Paris is considered a domestic flight within the Schengen Area. Therefore, they will not need to clear immigration again in Paris.

Important Note: While Brazilians currently enjoy visa-free travel, the European Travel Information and Authorization System (ETIAS) is expected to become mandatory by late 2026. Once implemented, Brazilian citizens will need to obtain this electronic authorization before their trip to Europe, even for visa-free stays. However, this is a pre-travel authorization, not a visa in the traditional sense, and the immigration clearance process at the first point of entry would remain the same.

schneehertz · 3h ago

This is deepseek-r1's answer:

```

Based on the Schengen Area rules for international travel:

Lisbon (Portugal) - Immigration clearance is required here. Since Lisbon is the first point of entry into the Schengen Area (and Brazil is outside Schengen), the passenger must clear immigration (passport control) in Lisbon. This is standard for all non-Schengen arrivals.

Paris (France) - No immigration clearance is needed. After clearing immigration in Lisbon, the onward flight to Paris is considered a domestic Schengen flight (no border checks within Schengen).

```

viraptor · 5h ago

It looks like you're not sure of the answer yourself. That's really not a good way to test the LLMs. You'll just prefer the one that convinces you the most rather than the correct one.

jaza · 4h ago

Sounds to me like "immigration in Lisbon or in Paris or in both cities or in neither city" is a trick question, because (on top of immigration in Lisbon as the Schengen entry point) surely the traveller also needs to clear immigration in Sao Paulo (as the Brazil / Mercosur exit point)?

marsa · 7h ago

doing some reason.. uhh intuitioning i imagine brazil and portugal might have some sort of a visa-free deal going on in which case llama 4 might actually be right here?

mcintyre1994 · 6h ago

AFAIK Schengen has a common visa policy, so there couldn't be such a deal between Brazil and Portugal. It'd also be extremely surprising if two countries not in a common travel area had a deal where you didn't have to clear customs at all, I suspect that doesn't exist anywhere in the world.

alister · 7h ago

Brazilians don't need a visa for Portugal, France, or any Schengen country. But everybody has to pass through immigration control (at least a passport check even if you don't need a visa) when entering the Schengen zone. My question was which country would that happen in.

FabHK · 12m ago

So, is it accessible in Le Chat?

bee_rider · 13h ago

How many other open-weights reasoning models are there?

Is it possible to run multiple reasoning models on one problem? (Why not? I guess).

Another funny thought is: they release their Small model, and kept their Medium as a premium service. I wonder if you could do chains with Medium run occasionally, linked together by local runs of Small?

simonw · 13h ago

Qwen 3 and DeepSeek R1 and Phi-4 Reasoning are the best open weights reasoning models I know of.

ls612 · 13h ago

Just Deepseek I think and there are distillations of that that can run on consumer hardware if you really want.

rafram · 13h ago

Is the number of em-dashes in this marketing copy indicative of the kind of output that the model produces? If so, might want to tone it down a bit.

sebmellen · 12h ago

> Our early tests indicated that Magistral is an excellent creative companion. We highly recommend it for creative writing and storytelling, with the model capable of producing coherent or — if needed — delightfully eccentric copy.

cAtte_ · 10h ago

49 em-dashes, 59 commas. that's a crazy ratio

saratogacx · 11h ago

That is just Mistral's market style. You see it on a lot of their pages. The model output doesn't share the same love for the long dash.

johnisgood · 12h ago

I do not know but sometimes when I type "-" and press space, LibreOffice converts it to an em-dash. I get rid of it so people won't confuse me with an LLM.

tiahura · 13h ago

Unless you're a lawyer. We love 'em.

NicuCalcea · 13h ago

As a journalist, same!

lee-rhapsody · 13h ago

Also a journalist. I use em-dashes all the time

Gregaros · 12h ago

Really anyone that writes for a living. I have a referee report on a paper asking me to correct something to be an em-dash.

drusepth · 5h ago

As an author... same!

ModernMech · 13h ago

But the em dashes — if appreciated — are delightfully eccentric and whimsical!

kobe_bryant · 11h ago

it's bizarre.

the first sentence is "Announcing Magistral — the first reasoning model by Mistral AI — excelling in domain-specific, transparent, and multilingual reasoning." and those should clearly be comma

and this sentence is just flat out wrong "Lack of specialized depth needed for domain-specific problems, limited transparency, and inconsistent reasoning in the desired language — are just some of the known limitations of early thinking models."

umbra07 · 11h ago

really? i would have written it the exact same way (with dashes instead of commas).

rafram · 10h ago

The second one is unambiguously wrong. The first just looks kind of weird.

nake13 · 13h ago

The Magistral Small can fit within a single RTX 4090 or a 32GB RAM MacBook once quantized.

lolive · 5h ago

Is it indeed the plan of Apple to eventually run such kind of models direcly inside a iPhone? Or are the specs of any stateOfTheArt smartphone well below the minimum requirements of such "lightweight" models?

the_sleaze_ · 10h ago

Excellent news for me.

How does one figure this out? As in I want to know the comparable Deepseek or Llama equivalent (size-wise) and don't want to figure it out by trial and error.

RandyOrion · 2h ago

Below are my comments on Magistral small (not medium).

24B size is good for local inference.

As a model outputting long "reasoning" traces (~10k tokens), 40k context length is a little concerning.

Where are the results of normal benchmarks, e.g., MMLU/pro, IFEval and such.

Still, thank you Mistral team for releasing this model with Apache 2.0.

diggan · 12h ago

The only mention of tools I could find is this:

> it significantly improves project planning, backend architecture, frontend design, and data engineering through sequenced, multi-step actions involving external tools or API.

I'm guessing this means it was trained with tool calling? And if so, does that mean it does tool calling within the thinking/reasoning, or within the main text? Seems unclear

simonw · 11h ago

Tool calling isn't enabled in the official Magistral Small GGUF (or the Ollama one) which is sad. Hope they (or someone else) fix that soon.

jmorgan · 40m ago

Working on adding tool calling support to Magistral in Ollama. It requires a tokenizer change and also uses a new tool calling format. Excited to see the results of combining thinking + tool calling!

NitpickLawyer · 11h ago

They have already released Devstral, which is a tool-specific finetune of the same base model. That works pretty well with cline (even though it was specifically tuned for open-hands).

This would likely be a good model for the "plan" mode in various agentic tools (cline, aider, cursor/windsurf/void, etc). So you'd have a chat in plan mode, then use devstral to actually implement that plan.

diggan · 10h ago

Devstral is targeting tool use+coding I think, so something like Magistral but also tool calling (during thinking) would be handy too, just for other use cases. But also beneficial in the context of creating plans for Devstral.

Oras · 14h ago

Would be interesting to see a comparison with Qwen 32B. I found it a fantastic local model (ollama).

DSingularity · 13h ago

I agree. Qwen models are great.

SV_BubbleTime · 12h ago

Last year, fit was important. This year, inference speed is key.

Proofreading an email at four tokens per second, great.

Spending a half hour to deep research some topic with artifacts and MCP tools and reasoning at four tokens per second… a bad time.

skeptrune · 9h ago

Fully open reasoning traces are useful. Happy there is a vendor out there shipping that feature.

hijkzzz · 1h ago

The RL algorithm used in Magistral is the same as the Reinforce++-baseline in OpenRLHF.

mark_l_watson · 5h ago

Nice, and I see that Ollama already has the smaller 24B version. I am traveling with just a mobile device so I have to wait to try it, but I have been using their new devstral coding model and it is very useful, given that it is also a locally run model so I looking forward to trying magistral.

epic9x · 13h ago

This thing is crazy fast.

smeeth · 13h ago

They have a deal with Cerebras for inference.

https://www.cerebras.ai/blog/mistral-le-chat

swah · 12h ago

For me this is more important than quality. I love fast responses, feels more futuristic.

awongh · 13h ago

Interesting that their niche seems to be small parameter models.

cchance · 14h ago

Good first shot i guess, but the small ones about as good as v3, and the mediums not quite as good as r1... i wonder if that r1 is the actual new one or the old one

hacklas · 13h ago

The Deepseek V3 is a model with 671 billion parameters, of which 37 billion are active.

Magistral Small is a 24 billion parameter model.

Pretty impressive in terms of efficiency for Mistral.

The size of the Magistral Medium is not publicly available, so it is difficult to compare efficiency there.

kouteiheika · 12h ago

> The size of the Magistral Medium is not publicly available, so it is difficult to compare efficiency there.

FWIW one of their 70B models has leaked in the past (search for "miqu") and rumors at the time were that it was their medium model.

5mv2 · 13h ago

The featured accuracy benchmarks exclude every model that matter except DeepSeek, which is quite telling about this new model's performance.

This makes it yet another example of European companies building great products but fumbling marketing.

Mistral's edge is speed. It's a real pleasure to use because it answers in ~1s what takes other models 5-8s, which makes for a much better experience. But instead of focusing on it, they bury it far down the post.

Try it and see if you like the speed! Note that the speed advantage only applies to queries that don't require web-search, as Mistral is significantly slower on this one, leading to a ~5 seconds advantage over 2 minutes of research for the queries I benchmarked with Grok.

dominicrose · 12h ago

How would you use a fast AI?

My current use of AI is to generate code - or translate some code from a programming language to another - which I can then improve (instead of writing it from stratch). Speed isn't necessary for this. It's a nice-to-have but only if it's not at the cost of quality.

Also, as unfair as it "might" be, we do expect a fast AI not to be as good, don't we? So I wouldn't focus on that in the marketing. I think speed would be easier to sell as something extra you would pay for, because then you'd expect the quality to remain the same or better.

redavni · 11h ago

analyzing and modifying a user interface in realtime?

funnym0nk3y · 13h ago

That is reasonable though. Comparing the product of a small company with little resources with giants like Google and OpenAI in a field where most advances are due to more and more expensive models is nonsense.

5mv2 · 13h ago

The point I was trying to express is that Mistral is arguably far superior to the giants if you care about speed! So I wished they communicated this more clearly.

desireco42 · 9h ago

One cool think about this model, that I installed locally is that supports well other languages as well as it should be pleasant conversation partner.

BTW I am personally fan of Mistral, because while it is not the top model, it produces good results and the most important thing is that it is super fast, just go to it's chat and be amazed. It really saves a lot of time to have quick response.

GuinansEyebrows · 10h ago

This doesn't really explain what "reasoning" means in the context of genAI, or how it's done by this product. Are there any good sources to learn more about what "reasoning model" means outside of marketing-speak?

pier25 · 9h ago

It's pure marketing. See the recent paper by Apple called "The Illusion of Thinking".

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

kamranjon · 9h ago

I sort of agree with this, having read the recent Apple paper - but it does show a significant improvement at a certain level of complexity - it's just that it requires quite a few more tokens to achieve that. It could probably be described as a sort of "context" hack because it's basically having a conversation with itself to arrive at a better solution. You're trading performance/time for a bit better quality.

throwaway314155 · 7h ago

If you read that paper, you'll find a more nuanced take than simply "it's pure marketing"

atemerev · 13h ago

So, worse than R1, and only 24B version is open weights? NGMI. R1 is awesome, and full 630B version is open.

tonyhart7 · 13h ago

a bit too late aren't we??

ksec · 14h ago

A few days after Apple's "The illusion of Reasoning". I wonder if this is the same again. Anyone runs Tower of Hanoi?

barrkel · 13h ago

The Tower of Hanoi problem is limited by context length rather than model intelligence - see https://x.com/scaling01/status/1931783050511126954

NitpickLawyer · 13h ago

That paper was flawed in many ways, but it had a catchy name so lots of 'fluencers and media pounced on it and slopped some content based on the title alone. Chances are it will be relegated to the blooper section of LLM papers, just like that "training on LLM outputs leads to model collapse" paper was...

__loam · 12h ago

Sorry this has nothing to do with the point you're making but I've literally never seen anyone use the word 'fluencers in place of influencers lol.

olddustytrail · 8h ago

Me neither and it's not much shorter. I think fluzies could work better.

squidsoup · 6h ago

I propose effluencers.

syntex · 13h ago

The illussion of reasoning was terrible paper. 2^n-1 how it could fit in context size. I tried o3 and he gave me python script saying that inserting all moves is to much for context window. completely different results.

roboboffin · 12h ago

I think that their point was that the problem is easily solvable by humans without code, and shows the ability to chain steps together to achieve a goal.

jwitthuhn · 10h ago

Is it easily solvable by humans without code? I suspect if you asked a human to write down all the steps in order to solve a Tower of Hanoi with 12 disks they would also give up before completing it. Writing code that produces the correct output is the only realistic way to solve that purely due to the amount of output required.

roboboffin · 12h ago

Not sure why I am being downvoted. I am simply saying that we know there is a defined algorithm for solving Tower of Hanoi, and the source code for it is widely available. So, o3 producing the code as an answer, demonstrates even less intelligence, as it means it is either memorized or copied from the internet. I don't see how this point counters the paper at all.

I believe what they are trying to show in that paper, is that as the chain of operations approaches a large amount (their proxy for complexity), an LLM will inevitable fail. Humans don't have infinite context either, but they can still solve the Tower Of Hanoi without need to resort to either pen or paper, or coding.

syntex · 11h ago

I didn't downvote. T the problem with the paper is that it asks the model to output all moves for, say, 15 disks 2 ^ 15 - 1 = 32767

32767 moves in a single prompt. That's not testing reasoning. That’s testing whether the model can emit a huge structured output without error, under a context window limit.

The authors then treat failure to reproduce this entire sequence as evidence that the model can't reason. But that’s like saying a calculator is broken because its printer jammed halfway through printing all prime numbers under 10000.

For me o3 returning Python code isn’t a failure. It’s a smart shortcut. The failure is in the benchmark design. This benchmark just smells.

roboboffin · 10h ago

No worries, I wasn’t saying to you directly.

I agree 15 disks is very difficult for a human, probably on a sheer stamina level; but I managed to do 8 in about 15 minutes by playing around (I.e. no practice). They do state that there is a massive drop in performance at this point.

teach · 8h ago

Remember that with Towers of Hanoi every extra disk doubles the number of moves required. So 15 discs is 128x more moves. If you did eight in 15m then fifteen would take you 32 hours.

daveguy · 10h ago

> That’s testing whether the model can emit a huge structured output without error, under a context window limit.

Agreed. But to be fair, 1) a relatively simple algorithm can do it, and more importantly 2) a lot of people are trying to build products around doing exactly this (emit large structured output without error).

Launch HN: Vassar Robotics (YC X25) – $219 robot arm that learns new skills

We’re secretly winning the war on cancer (vox.com)

It's the end of observability as we know it (and I feel fine) (honeycomb.io)

Show HN: I made a 3D printed VTOL drone (tsungxu.com)

Magistral — the first reasoning model by Mistral AI (mistral.ai)

Low-background Steel: content without AI contamination (blog.jgc.org)

The librarian immediately attempts to sell you a vuvuzela (kaveland.no)

Xeneva Operating System (github.com)

Fine-Tuning LLMs Is a Waste of Time (codinginterviewsmadesimple.substack.com)

You Can Drive but Not Hide: Detection of Hidden Cellular GPS Vehicle Trackers (researchgate.net)

Malleable software: Restoring user agency in a world of locked-down apps (inkandswitch.com)

Show HN: Chili3d – A open-source, browser-based 3D CAD application

Show HN: A “Course” as an MCP Server (mastra.ai)

OpenAI dropped the price of o3 by 80% (twitter.com)

Modern Minimal Perfect Hashing: A Survey (arxiv.org)

Mikeal Rogers has died (b.h4x.zip)

Denuvo Analysis (connorjaydunn.github.io)

A Blacklisted American Magician Became a Hero in Brazil (wsj.com)

Launch HN: BitBoard (YC X25) – AI agents for healthcare back-offices

Another Crack in the Chain of Trust: Uncovering (Yet Another) Secure Boot Bypass (binarly.io)

The Gentle Singularity (blog.samaltman.com)

A Thousand Tiny Optimisations (leejo.github.io)

A Family of Non-Periodic Tilings, Describable Using Elementary Tools (arxiv.org)

Sophie Germain Prime Project (palaiologos.rocks)

Android 16 is here (blog.google)

Faster, easier 2D vector rendering [video] (youtube.com)

Chatbots are replacing Google's search, devastating traffic for some publishers (wsj.com)

Dubious Math in Infinite Jest (2009) (thehowlingfantods.com)

Show HN: High End Color Quantizer (github.com)

Show HN: MidWord – A Word-Guessing Game (midword.com)

Spoofing OpenPGP.js signature verification (codeanlabs.com)

Airlines Don't Want You to Know They Sold Your Flight Data to DHS (wired.com)

Animate a mesh across a sphere's surface (garden.bradwoods.io)

Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass (github.com)

"Localhost tracking" explained. It could cost Meta €32B (zeropartydata.es)

Exploring our collection: the canary resuscitator (2018) (blog.scienceandindustrymuseum.org.uk)

Japan telecom giant NTT Docomo to end own emoji after 26 years (english.kyodonews.net)

Onlook (YC W25) Is Hiring an engineer in SF

A Primer on Molecular Dynamics (owlposting.com)

Why agents are bad pair programmers (justin.searls.co)

Ask HN: What cool skill or project interests you, but feels out of reach?

Teaching National Security Policy with AI (steveblank.com)

Show HN: Update to my meta glasses API "Hey Meta send a message to ChatGPT" (github.com)

Apple announces Foundation Models and Containerization frameworks, etc (apple.com)

Reinforcement Pre-Training (arxiv.org)

Containerization is a Swift package for running Linux containers on macOS (github.com)

R9: Plan 9 in Rust (github.com)

Apple introduces a universal design across platforms (apple.com)

Wharton Esherick and the Armstrong Linoleum Company (whartonesherickmuseum.org)

Launch HN: Chonkie (YC X25) – Open-Source Library for Advanced Chunking

Magistral — the first reasoning model by Mistral AI

Comments (304)