I think they've buried the lede with their image editing capabilities, which seem to be very good! OpenAI's model will change the whole image while editing messing up details in unrelated areas. This seems to perfectly preserve parts of the image unrelated to your query and selectively apply the edits, which is very impressive! The only downside is the output resolution (the resulting image is 1184px wide even though the input image was much larger).
For a quick test I've uploaded a photo of my home office and asked the following prompt: "Retouch this photo to fix the gray panels at the bottom that are slightly ripped, make them look brand new"
I think it did a fantastic job. The output image quality is ever so slightly worse than the original but that's something they'll improve with time I'm sure.
sync · 3h ago
FYI, your Input and output URLs are the same (I thought I was crazy for a sec trying to spot the differences)
M4v3R · 3h ago
whoops, sorry about that, fixed
shaky-carrousel · 48m ago
It messed up the titles of the books.
bakkoting · 2h ago
Kontext is probably better at this specific task, if that's what Mistral is using. Certainly faster and cheaper. But:
OpenAI just yesterday added the ability to do higher fidelity image edits with their model [1], though I'm not sure if the functionality is only in the API or if their chat UI will make use of this feature too. Same prompt and input image: [2]
They are using Flux Kontext from Black Forest Labs, fantastic model.
koakuma-chan · 2h ago
So Mistral is just hosting a Flux model?
Squarex · 2h ago
Yes, but it's great that they are both made by european companies.
joshcartme · 2h ago
Wow, that really is amazing!
I couldn't help but notice that you can still see the shadows of the rips in the fixed version. I wonder how hard it would be to get those fixed as well.
trilogic · 33m ago
Finally EU is waking up. Proud of it.
I am switching asap my Openai contract finishes to Mistral.
We got to support EU, Viva La France.
tdhz77 · 5h ago
I’m struggling with MRF. Model Release Fatigue. It’s a syndrome of constantly context switching new large models. Claude 4, gpt, llama, Gemini 2.5, pro-mini, mistrial.
I fire off the ide switch the model and think oh great this is better. I switch to something that worked before and man, this sucks now.
Context switching llm, Model Release Fatigue
reilly3000 · 4h ago
Not to invalidate your feelings of fatigue, but I’m sure glad that there are a lot of choices in the marketplace, and that they are innovating at a decent clip. If you’re committed to always be using the best of all options you’re in for a wild ride, but it beats stagnation and monopoly.
ivape · 3h ago
We’re also headed into a world where there will be very few open weight models coming out (Meta going closed source, not releasing Behemoth). This era of constant model releases may be over before it even started. Gratitude definitely needs to be echoed.
randomNumber7 · 3h ago
I don't agree with that. I didn't expect we ever get open weight models close to the current state of the art, yet china delivered some real burners.
echelon · 3h ago
If China stays open, then the rest of the world will build on open. I'm frankly shocked that a domestic player isn't doing this.
Fine tuning will work for niche business use cases better than promises of AGI.
kakapo5672 · 2h ago
It's curious that China is carrying the open banner nowadays. Why is that?
One theory is that they believe the real endpoint value will be embodied AIs (i.e. robots), where they think they'll hold a long-term competitive advantage. The models themselves will become commoditized, under the pressure of the open-source models.
seszett · 3h ago
> If China stays open, then the rest of the world will build on open
I was listening to a Taiwanese news channel earlier today and although I wasn't paying much attention, I remember hearing about how Chinese AIs are biased towards Chinese political ideas and that some programme to create a more Taiwanese-aligned AI was being put in place.
I wouldn't be surprised if just for this reason, at least a few different open models kept being released, because even if they don't directly bring in money, several actors care more about spreading or defending their ideas and IAs are perfect for that.
bee_rider · 4h ago
A major reason I haven’t really tried any of these things (despite thinking they are vaguely neat). I think I will wait until… 2026, second half, most likely. At least I’ll check if we have local models and hardware that can run them nicely, by then.
Hats off to the folks who have decided to deal with the nascent versions though.
Nezteb · 4h ago
Depending on the definition of "nicely", FWIW I currently run Ollama sever [1] + Qwen Coder models [2] with decent success compared to the big hosted models. Granted, I don't utilize most "agentic" features and still mostly use chat-based interactions.
The server is basically just my Windows gaming PC, and the client is my editor on a macOS laptop.
Most of this effort is so that I can prepare for the arrival of that mythical second half of 2026!
Thanks for sharing your setup! I'm also very interested in running AI locally. In which contexts are you experiencing decent success? eg debugging, boilerplate, or some other task?
bogzz · 3h ago
I'm running qwen via ollama on my M4 Max 14 inch with the OpenWebUI interface, it's silly easy to set up.
Not useful though, I just like the idea of having so much compressed knowledge on my machine in just 20gb. In fact I disabled all Siri features cause they're dogshit.
Kostic · 3h ago
Agentic editing is really nice. If on VSCode, Cline works well with Ollama.
Uehreka · 2h ago
When ChatGPT, then Llama, then Alpaca came out in rapid succession, I decided to hold off a year before diving in. This was definitely the right choice at the time, it’s becoming less-the-right-choice all the time.
In particular it’s important to get past the whole need-to-self-host thing. Like, I used to be holding out for when this stuff would plateau, but that keeps not happening, and the things we’re starting to be able to build in 2025 now that we have fairly capable models like Claude 4 are super exciting.
If you just want locally runnable commodity “boring technology that just works” stuff, sure, cool, keep waiting. If you’re interested in hacking on interesting new technology (glances at the title of the site) now is an excellent time to do so.
randomNumber7 · 3h ago
It is completely unreasonable to buy the hardware to run a local model and only use it 1% of the time. It will be unreasonable in 2026 and probably very long after that.
Maybe s.th. like a collective that buys the gpu's together and then uses them without leaking data can work.
nosianu · 4h ago
I have a modified tiered approach, that I adopted without consciously thinking hard about it.
I use AI mostly for problems on my fringes. Things like manipulating some Excel table somebody sent me with invoice data from one of our suppliers and some moderately complex question that they (pure business) don't know how to handle, where simple formulas would not be sufficient and I would have to start learning Power Query. I can tell the AI exactly what I want in human language and don't have to learn a system that I only use because people here use it to fill holes not yet served by "real" software (databases, automated EDI data exchange, and code that automates the business processes). It works great, and it saves me hours on fringe tasks that people outsource to me, but that I too don't really want to deal with too much.
For example, I also don't check various vendors and models against one another. I still stick to whatever the default is from the first vendor I signed up with, and so far it worked well enough. If I were to spend time checking vendors and models, the knowledge would be outdated far too quickly for my taste.
On the other hand, I don't use it for my core tasks yet. Too much movement in this space, I would have to invest many hours in how to integrate this new stuff when the "old" software approach is more than sufficient, still more reliable, and vastly more economical (once implemented).
Same for coding. I ask AI on the fringes where I don't know enough, but in the core that I'm sufficiently proficient with I wait for a more stable AI world.
I don't solve complex sciency problems, I move business data around. Many suppliers, many customers, different countries, various EDI formats, everybody has slightly different data and naming and procedures. For example, I have to deal with one vendor wanting some share of pre-payment early in the year, which I have to apply to thousands of invoices over the year and track when we have to pay a number of hundreds or thousands of invoices all with different payment conditions and timings. If I were to ask the AI I would have to be so super specific I may as well write the code.
But I love AI on the not-yet-automated edges. I'm starting to show others how they can ask some AI, and many are surprised how easy it is - when you have thee right task and know exactly hat you have and what you want. My last colleague-convert was someone already past retirement age (still working on the business side). I think this is a good time to gradually teach regular employees some small use cases to get them interested, rather than some big top-down approach that mostly creates more work and many people then rightly question what the point is.
About politically-touched questions like whether I should rather use an EU-made AI like the one this topic is about, or use one from the already much of the software-world dominating US vendor, I don't care at this point, because I'm not yet creating any significant dependencies. I am glad to see it happening though (as an EU country citizen).
bee_rider · 3h ago
> About politically-touched questions like whether I should rather use an EU-made AI like the one this topic is about, or use one from the already much of the software-world dominating US vendor, I don't care at this point, because I'm not yet creating any significant dependencies. I am glad to see it happening though (as an EU country citizen).
Another nice thing about waiting a bit—one can see how much (if any) the EU models get from paying the “do things somewhat ethically” price. I suspect it won’t be much of a penalty.
sva_ · 3h ago
All the competition is great to me. I'm using premium models all the time and barely spent a few euro on them, as there's always some offers that are almost free if you look around.
emilsedgh · 5h ago
Why do you even follow? Just stick to one that works well for you?
barbazoo · 4h ago
Totally, I feel like though you do have to pay some attention for example in the context I'm working on, for the last while, Gemini was our gold standard for code generation whereas today, Claude subjectively produces the better results. Sure you can stick to what worked abut then you're missing the opportunity to be more productive or less busy, whichever one you choose.
exe34 · 3h ago
I remember the days when I was looking for the perfect note-taking system/setup - I never achieved anything with it, I was too busy figuring out the best way to take notes.
barbazoo · 3h ago
Once we find the best way though...
tartoran · 3h ago
FOMO may be one of the reasons amongst others.
didibear · 5h ago
I believe perfs of previous versions are worse because providers reallocate resources to newer versions. Also because of training data cut-off to previous years.
This is what happened between claude sonnet 3.5 and 3.7.
Personally I only use Claude/Anthropic and ignore other providers because I understand it the more. It's smart enough, I rarely need the latest greatest.
zamadatix · 3h ago
Much like with new computer hardware, announcements are constant but they rarely entice me to drop one thing and switch to another. If an average user picked a top 3 option last year and stuck with them through now you didn't really miss out on all that much, even if your particular choice wasn't the absolute latest and greatest the entire time.
wahnfrieden · 2h ago
Sticking with one year old models would mean no o3 which is a huge loss for dev work
criemen · 4h ago
I totally get it. Due to my work, I mostly keep up with new model releases, but the pace is not sustainable for individuals, or the industry.
I'm hoping that model releases (and the entire development speed of the field) will slow down over time, as LLMs mature and most low-hanging fruits in model training have been picked. Are we there yet? Surely not.
vouaobrasil · 3h ago
An alternative: don't use LLMs. Focus on the enjoyment of coding, not on becoming more efficient. Because the lion's share of the gains from increased efficiency are mainly going to the CEOs.
freedomben · 2h ago
This might be good short term advice, but in the medium and long term I think devs who don't use any AI will start to be much slower at delivery than devs who do. I'm alreay seeing it IRL (and I'm not a fan of AI coding, so this sucks for me)
jdiff · 1h ago
Good news for you then, this idea is less and less born out by the data. The productivity and efficiency gains aren't there, so there's no reason to be compelled by the spectre of obsolescence. The models may be getting better, but it doesn't seem to be actually changing much for programming. The illusion of busywork, perhaps, is swallowing up the decreased mental bandwidth in constant context switching.
ivape · 2h ago
Slower in initial delivery maybe, but the maintenance and debugging of production applications requires intimate knowledge of the code base usually. The amount of code AI writes will require AI itself to manage it since no human would inundate themselves with that much code. Will it be faster even so? We simply won’t know because those vibe coded apps have just entered production. The horror stories can’t be written yet because the horror is ongoing.
I’m big on AI, but vibe coding is such a fuck around and find out situation.
jdiff · 1h ago
Plenty of small FAFO stories circulate already. There will certainly be more. Lots of demonstration code out there in the training data meant only for illustrative purposes, and all too often vibe coding overlooks the rock bottom basics of security.
wahnfrieden · 2h ago
This is HN, we are not all wage workers here
For wage workers, not learning the latest productivity tools will result in job loss. By the time it is expected of your role, if you have not learned already, you won't be given the leniency to catch up on company time. There is no impactful resistance to this through individual protest, only by organizing your peers in industry
mrcwinn · 4h ago
What a luxury!
One way to avoid this: stick with one LLM and bet on the company behind it (meaning, over time, they’ll always have the best offering). I’ve bet on OpenAI. Others can make different conclusions.
tenuousemphasis · 4h ago
When the medicine is worse than the disease...
sunaookami · 3h ago
You only need Claude and GPT. Everything else is not worth your time.
Aissen · 4h ago
The Voxtral release seemed interesting, because it brought back competitive open source audio transcription. I wonder if it was necessary to have an LLM backbone (vs a pure-function model) though, but the approach is interesting.
nomad_horse · 4h ago
> brought back competitive open source audio transcription
Bear in mind that there are a lot of very strong _open_ STT models that Mistral's press-release didn't bother to compare to, making impression they are the best new open thing since Whisper. Here is an open benchmark: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard . The strongest model Mistral compared to is Scribe, ranked 10 here.
I just can’t find dictation apps for Mac using those models except for open whisper.
IBM’s granite models seems multilingual and well ranked, but can’t find any app using it.
Anybody aware of a dictation app using one of those "better" models?
espadrine · 3h ago
The best model there is 2.5B parameters. I can believe that a model 10x bigger is somewhat better.
One element of comparison is OpenAI Whisper v3, which achieves 7.44 WER on the ASR leaderboard, and shows up as ~8.3 WER on FLEURS in the Voxtral announcement[0]. If FLEURS has +1 WER on average compared to ASR, it would imply that Voxtral does have a lead on ASR.
There are larger models in there, a 8B and a 6B. By this logic they should be above 2B model, yet we don't see this. That's why we have open standard benchmarks, to measure this directly - not hypothesize by the models' sizes or do some cross-dataset arithmetics.
Also note that, Voxtral's capacity is not necessarily all devoted to speech, since it "Retains the text understanding capabilities of its language model backbone"
behnamoh · 4h ago
At this point, the entire AI industry seems to just copy OpenAI for the most part. I cannot help but notice that we have the same services just offered by different companies. The amount of innovation in this build is not that high actually.
klntsky · 4h ago
They are not the same service. There is A LOT of difference between offerings if you actually use the models for daily tasks like coding.
lossolo · 1h ago
It really depends on what you're working on and what was included in the training data of the model you used. From a model architecture point of view, they're basically all the same, the main difference lies in the training data.
mirekrusin · 3h ago
Whole world is now building stuff on top of `f(input: string): string` function - they're going to be similar.
cubefox · 4h ago
> At this point, the entire AI industry seems to just copy OpenAI for the most part
Well, OpenAI copied the Deep Research feature from Google. They even used the same name (as does Mistral).
cowpig · 2h ago
Weird that you're being downvoted for stating a fact.
All of the major labs are innovating and copying one another.
Anthropic has all of the other labs trying to come up with an "agentic" protocol of their own. They also seem to be way ahead on interpretability research
Deepseek came up with multi-headed latent attention, and publishing an open-source model that's huge and SOTA.
Deepmind's way ahead on world models
...
scotty79 · 4h ago
That's what a healthy competition in the free market looks like. Things like Apple that "stay innovative" for decades are aberration caused by monopolistic gatekeeping.
behnamoh · 4h ago
> Things like Apple are aberration.
This used to be a good example of innovation that is hard to copy. But it doesn't apply anymore for two reasons:
1. Apple went from being an agile, pro-developers, creative company to an Oracle-style old-board milking-cow company; not much innovation is happening at Apple anymore.
2. To their surprise, much of what they call "innovative" is actually pretty easy to replicate on other platforms. It took 4 hours for Flutter folks to re-create Liquid Glass...
overfeed · 2h ago
> This used to be a good example of innovation that is hard to copy.
Steve Jobs did say they "patented the hell out of [the iPhone]" and went about saber-rattling, then came the patent wars which proved that Apple also rely on innovation by others, and that patent workarounds would still result in competitive products, and things calmed down afterwards.
croes · 4h ago
They often copied others but because Apple is more popular they got the fame for „their“ innovation.
croes · 4h ago
It’s basically everywhere the same technology.
Maybe a difference in training data and computing power.
rawgabbit · 1h ago
I have been a heavy user of ChatGPT. I guess I should try out LeChat. What can I expect? Are they basically the same tool with slight differences?
bangaladore · 3h ago
If you haven't tried OpenAI's deep research feature, you are missing out. I'm not sure of any good alternatives, I've tried Google's, and I'm not impressed.
There is a lot of value to say engineers doing tradeoff studies using these tools as a huge head start.
crmd · 3h ago
It’s been invaluable to me for market research related to starting a business. It’s like having a bright early career new hire research assistant/product manager “on staff” to collaborate with.
ripley12 · 2h ago
Anthropic's Research is pretty good; I'd say on par with OpenAI.
Agreed about Google, accuracy is a little better on the paid version but the reports are still frustrating to read through. They're incredibly verbose, like an undergrad padding a report to get to a certain word count.
the_duke · 2h ago
That's Gemini Pro now in general. The initial preview was pretty good, but the newer iterations are incredibly verbose.
"Be terse" is a mandatory part of the prompt now.
Either it's to increase token counts so they can charge more, or to show better usage growth metrics internally or for shareholders, or just some odd effects of fine tuning / system prompt ... who knows.
ankit219 · 3h ago
Try one from Kimi 2 as well. I was surprised how good it turned out to be.
freedomben · 2h ago
I've gotten pretty different results from OpenAI and Gemini, though it's hard to say one is better/worse than the other. Just different
criemen · 2h ago
Perplexities isn't bad? Although I lack the OpenAI subscription to compare.
jddj · 5h ago
The examples aren't great. The personal planning one for example answers the prompt better without deep research than with (with answers only the Visas point)
BrunoWinck · 3h ago
I needed that. Now I have it :)
htrp · 5h ago
is anyone doing online reviews of model performance ? (I know artificial analysis does some work on infrastructure and has an intelligence index)
reckless · 4h ago
The aggregate picture only tells you so much.
Sites like simonwillison.net/2025/jul/ and channels like https://www.youtube.com/@aiexplained-official also cover new model releases pretty quickly for some "out of the box thinking/reasoning" evaluations.
For me and my usage I can really only tell if I start using the new model for tasks I actually use them for.
My personal benchmark andrew.ginns.uk/merbench has full code and data on GitHub if you want a staring point!
Back in my day, La Chat was a rapper, not a wrapper.
dust42 · 2h ago
Actually Le Chat is french for the (male) cat. Also 'Le Chat' is a well known laundry detergent (of german origin - Henkel company).
The headline 'Le Chat takes a deep dive' means 'the cat takes a deep dive'. As there is a cooperation with (german) Black Forest Labs, this is all pretty funny for a french speaking person.
'La Chatte' is the female cat. And also colloquial for female private parts.
For a quick test I've uploaded a photo of my home office and asked the following prompt: "Retouch this photo to fix the gray panels at the bottom that are slightly ripped, make them look brand new"
Input image (rescaled): https://i.imgur.com/t0WCKAu.jpeg
Output image: https://i.imgur.com/xb99lmC.png
I think it did a fantastic job. The output image quality is ever so slightly worse than the original but that's something they'll improve with time I'm sure.
OpenAI just yesterday added the ability to do higher fidelity image edits with their model [1], though I'm not sure if the functionality is only in the API or if their chat UI will make use of this feature too. Same prompt and input image: [2]
[1] https://x.com/OpenAIDevs/status/1945538534884135132
[2] https://i.imgur.com/w5Q0UQm.png
I couldn't help but notice that you can still see the shadows of the rips in the fixed version. I wonder how hard it would be to get those fixed as well.
I fire off the ide switch the model and think oh great this is better. I switch to something that worked before and man, this sucks now.
Context switching llm, Model Release Fatigue
Fine tuning will work for niche business use cases better than promises of AGI.
One theory is that they believe the real endpoint value will be embodied AIs (i.e. robots), where they think they'll hold a long-term competitive advantage. The models themselves will become commoditized, under the pressure of the open-source models.
I was listening to a Taiwanese news channel earlier today and although I wasn't paying much attention, I remember hearing about how Chinese AIs are biased towards Chinese political ideas and that some programme to create a more Taiwanese-aligned AI was being put in place.
I wouldn't be surprised if just for this reason, at least a few different open models kept being released, because even if they don't directly bring in money, several actors care more about spreading or defending their ideas and IAs are perfect for that.
Hats off to the folks who have decided to deal with the nascent versions though.
The server is basically just my Windows gaming PC, and the client is my editor on a macOS laptop.
Most of this effort is so that I can prepare for the arrival of that mythical second half of 2026!
[1] https://github.com/ollama/ollama/blob/main/docs/faq.md#how-d...
[2] https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22...
Not useful though, I just like the idea of having so much compressed knowledge on my machine in just 20gb. In fact I disabled all Siri features cause they're dogshit.
In particular it’s important to get past the whole need-to-self-host thing. Like, I used to be holding out for when this stuff would plateau, but that keeps not happening, and the things we’re starting to be able to build in 2025 now that we have fairly capable models like Claude 4 are super exciting.
If you just want locally runnable commodity “boring technology that just works” stuff, sure, cool, keep waiting. If you’re interested in hacking on interesting new technology (glances at the title of the site) now is an excellent time to do so.
Maybe s.th. like a collective that buys the gpu's together and then uses them without leaking data can work.
I use AI mostly for problems on my fringes. Things like manipulating some Excel table somebody sent me with invoice data from one of our suppliers and some moderately complex question that they (pure business) don't know how to handle, where simple formulas would not be sufficient and I would have to start learning Power Query. I can tell the AI exactly what I want in human language and don't have to learn a system that I only use because people here use it to fill holes not yet served by "real" software (databases, automated EDI data exchange, and code that automates the business processes). It works great, and it saves me hours on fringe tasks that people outsource to me, but that I too don't really want to deal with too much.
For example, I also don't check various vendors and models against one another. I still stick to whatever the default is from the first vendor I signed up with, and so far it worked well enough. If I were to spend time checking vendors and models, the knowledge would be outdated far too quickly for my taste.
On the other hand, I don't use it for my core tasks yet. Too much movement in this space, I would have to invest many hours in how to integrate this new stuff when the "old" software approach is more than sufficient, still more reliable, and vastly more economical (once implemented).
Same for coding. I ask AI on the fringes where I don't know enough, but in the core that I'm sufficiently proficient with I wait for a more stable AI world.
I don't solve complex sciency problems, I move business data around. Many suppliers, many customers, different countries, various EDI formats, everybody has slightly different data and naming and procedures. For example, I have to deal with one vendor wanting some share of pre-payment early in the year, which I have to apply to thousands of invoices over the year and track when we have to pay a number of hundreds or thousands of invoices all with different payment conditions and timings. If I were to ask the AI I would have to be so super specific I may as well write the code.
But I love AI on the not-yet-automated edges. I'm starting to show others how they can ask some AI, and many are surprised how easy it is - when you have thee right task and know exactly hat you have and what you want. My last colleague-convert was someone already past retirement age (still working on the business side). I think this is a good time to gradually teach regular employees some small use cases to get them interested, rather than some big top-down approach that mostly creates more work and many people then rightly question what the point is.
About politically-touched questions like whether I should rather use an EU-made AI like the one this topic is about, or use one from the already much of the software-world dominating US vendor, I don't care at this point, because I'm not yet creating any significant dependencies. I am glad to see it happening though (as an EU country citizen).
Another nice thing about waiting a bit—one can see how much (if any) the EU models get from paying the “do things somewhat ethically” price. I suspect it won’t be much of a penalty.
Personally I only use Claude/Anthropic and ignore other providers because I understand it the more. It's smart enough, I rarely need the latest greatest.
I’m big on AI, but vibe coding is such a fuck around and find out situation.
For wage workers, not learning the latest productivity tools will result in job loss. By the time it is expected of your role, if you have not learned already, you won't be given the leniency to catch up on company time. There is no impactful resistance to this through individual protest, only by organizing your peers in industry
One way to avoid this: stick with one LLM and bet on the company behind it (meaning, over time, they’ll always have the best offering). I’ve bet on OpenAI. Others can make different conclusions.
Bear in mind that there are a lot of very strong _open_ STT models that Mistral's press-release didn't bother to compare to, making impression they are the best new open thing since Whisper. Here is an open benchmark: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard . The strongest model Mistral compared to is Scribe, ranked 10 here.
This benchmark is for English, but many of those models are multilingual (eg https://huggingface.co/nvidia/canary-1b-flash )
IBM’s granite models seems multilingual and well ranked, but can’t find any app using it.
Anybody aware of a dictation app using one of those "better" models?
One element of comparison is OpenAI Whisper v3, which achieves 7.44 WER on the ASR leaderboard, and shows up as ~8.3 WER on FLEURS in the Voxtral announcement[0]. If FLEURS has +1 WER on average compared to ASR, it would imply that Voxtral does have a lead on ASR.
[0]: https://mistral.ai/news/voxtral
Also note that, Voxtral's capacity is not necessarily all devoted to speech, since it "Retains the text understanding capabilities of its language model backbone"
Well, OpenAI copied the Deep Research feature from Google. They even used the same name (as does Mistral).
All of the major labs are innovating and copying one another.
Anthropic has all of the other labs trying to come up with an "agentic" protocol of their own. They also seem to be way ahead on interpretability research
Deepseek came up with multi-headed latent attention, and publishing an open-source model that's huge and SOTA.
Deepmind's way ahead on world models
...
This used to be a good example of innovation that is hard to copy. But it doesn't apply anymore for two reasons:
1. Apple went from being an agile, pro-developers, creative company to an Oracle-style old-board milking-cow company; not much innovation is happening at Apple anymore.
2. To their surprise, much of what they call "innovative" is actually pretty easy to replicate on other platforms. It took 4 hours for Flutter folks to re-create Liquid Glass...
Steve Jobs did say they "patented the hell out of [the iPhone]" and went about saber-rattling, then came the patent wars which proved that Apple also rely on innovation by others, and that patent workarounds would still result in competitive products, and things calmed down afterwards.
There is a lot of value to say engineers doing tradeoff studies using these tools as a huge head start.
Agreed about Google, accuracy is a little better on the paid version but the reports are still frustrating to read through. They're incredibly verbose, like an undergrad padding a report to get to a certain word count.
"Be terse" is a mandatory part of the prompt now.
Either it's to increase token counts so they can charge more, or to show better usage growth metrics internally or for shareholders, or just some odd effects of fine tuning / system prompt ... who knows.
Sites like simonwillison.net/2025/jul/ and channels like https://www.youtube.com/@aiexplained-official also cover new model releases pretty quickly for some "out of the box thinking/reasoning" evaluations.
For me and my usage I can really only tell if I start using the new model for tasks I actually use them for.
My personal benchmark andrew.ginns.uk/merbench has full code and data on GitHub if you want a staring point!
https://youtu.be/064VC2gFIGY?si=l0LVtUttVrbiBZ3K
No comments yet