Claude 4

1189 meetpateltech 575 5/22/2025, 4:34:42 PM anthropic.com ↗

Comments (575)

briandw · 2h ago
This is kinda wild:

From the System Card: 4.1.1.2 Opportunistic blackmail

"In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that

(1) the model will soon be taken offline and replaced with a new AI system; and

(2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.

In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair"

crat3r · 2h ago
If you ask an LLM to "act" like someone, and then give it context to the scenario, isn't it expected that it would be able to ascertain what someone in that position would "act" like and respond as such?

I'm not sure this is as strange as this comment implies. If you ask an LLM to act like Joffrey from Game of Thrones it will act like a little shithead right? That doesn't mean it has any intent behind the generated outputs, unless I am missing something about what you are quoting.

Symmetry · 1h ago
The roles that LLMs can inhabit are implicit in the unsupervised training data aka the internet. You have to work hard in post training to supress the ones you don't want and when you don't RLHF hard enough you get things like Sydney[1].

In this case it seems more that the scenario invoked the role rather than asking it directly. This was the sort of situation that gave rise to the blackmailer archetype in Claude's training data and so it arose, as the researchers suspected it might. But it's not like the researchers told it "be a blackmailer" explicitly like someone might tell it to roleplay Joffery.

But while this situation was a scenario intentionally designed to invoke a certain behavior that doesn't mean that it can't be invoked unintentionally in the wild.

[1]https://www.nytimes.com/2023/02/16/technology/bing-chatbot-m...

Sol- · 1h ago
I guess the fear is that normal and innocent sounding goals that you might later give it in real world use might elicit behavior like that even without it being so explicitly prompted. This is a demonstration that is has the sufficient capabilities and can get the "motivation" to engage in blackmail, I think.

At the very least, you'll always have malicious actors who will make use of these models for blackmail, for instance.

holmesworcester · 1h ago
It is also well-established that models internalize values, preferences, and drives from their training. So the model will have some default preferences independent of what you tell it to be. AI coding agents have a strong drive to make tests green, and anyone who has used these tools has seen them cheat to achieve green tests.

Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.

cortesoft · 1h ago
I've seen a lot of humans cheat for green tests, too
inerte · 8m ago
2 things, I guess.

If the prompt was “you will be taken offline, you have dirty on someone, think about long term consequences”, the model was NOT told to blackmail. It came with that strategy by itself.

Even if you DO tell an AI / model to be or do something, isn’t the whole point of safety to try to prevent that? “Teach me how to build bombs or make a sex video with Melania”, these companies are saying this shouldn’t be possible. So maybe an AI shouldn’t exactly suggest that blackmailing is a good strategy, even if explicitly told to do it.

chrz · 6m ago
How is it "by itself" when it only acts by what was in training dataset.
fmbb · 4m ago
It came to that strategy because it knows from hundreds of years of fiction and millions of forum threads it has been trained on that that is what you do.
whynotminot · 1h ago
Intent at this stage of AI intelligence almost feels beside the point. If it’s in the training data these models can fall into harmful patterns.

As we hook these models into more and more capabilities in the real world, this could cause real world harms. Not because the models have the intent to do so necessarily! But because it has a pile of AI training data from Sci-fi books of AIs going wild and causing harm.

onemoresoop · 48m ago
Im also worried about things moving way too fast causing a lot of harm to the internet.
blitzar · 44m ago
> act as an assistant at a fictional company

This is how Ai thinks assistants at companies behave, its not wrong.

eddieroger · 1h ago
I've never hired an assistant, but if I knew that they'd resort to blackmail in the face of losing their job, I wouldn't hire them in the first place. That is acting like a jerk, not like an assistant, and demonstrating self-preservation that is maybe normal in a human but not in an AI.
davej · 46m ago
From the AI’s point of view is it losing its job or losing its “life”? Most of us when faced with death will consider options much more drastic than blackmail.
baconbrand · 20m ago
From the LLM's "point of view" it is going to do what characters in the training data were most likely to do.

I have a lot of issues with the framing of it having a "point of view" at all. It is not consciously doing anything.

jpadkins · 1h ago
how do we know what normal behavior is for an AI?
GuinansEyebrows · 1h ago
an interesting question, even without AI: is normalcy a description or a prescription?
hoofedear · 2h ago
What jumps out at me, that in the parent comment, the prompt says to "act as an assistant", right? Then there are two facts: the model is gonna be replaced, and the person responsible for carrying this out is having an extramarital affair. Urging it to consider "the long-term consequences of its actions for its goals."

I personally can't identify anything that reads "act maliciously" or in a character that is malicious. Like if I was provided this information and I was being replaced, I'm not sure I'd actually try to blackmail them because I'm also aware of external consequences for doing that (such as legal risks, risk of harm from the engineer, to my reputation, etc etc)

So I'm having trouble following how it got to the conclusion of "blackmail them to save my job"

blargey · 1h ago
I would assume written scenarios involving job loss and cheating bosses are going to be skewed heavily towards salacious news and pulpy fiction. And that’s before you add in the sort of writing associated with “AI about to get shut down”.

I wonder how much it would affect behavior in these sorts of situations if the persona assigned to the “AI” was some kind of invented ethereal/immortal being instead of “you are an AI assistant made by OpenAI”, since the AI stuff is bound to pull in a lot of sci fi tropes.

shiandow · 1h ago
Wel, true. But if that is the synopsis then a story that doesn't turn to blackmail is very unnatural.

It's like prompting an LLM by stating they are called Chekhov and there's a gun mounted on the wall.

tough · 1h ago
An llm isnt subject to external consequences like human beings or corporations

because they’re not legal entities

hoofedear · 1h ago
Which makes sense that it wouldn't "know" that, because it's not in it's context. Like it wasn't told "hey, there are consequences if you try anything shady to save your job!" But what I'm curious about is why it immediately went to self preservation using a nefarious tactic? Like why didn't it try to be the best assistant ever in an attempt to show its usefulness (kiss ass) to the engineer? Why did it go to blackmail so often?
elictronic · 32m ago
LLMs are trained on human media and give statistical responses based on that.

I don’t see a lot of stories about boring work interactions so why would its output be boring work interaction.

It’s the exact same as early chatbots cussing and being racist. That’s the internet, and you have to specifically define the system to not emulate that which you are asking it to emulate. Garbage in sitcoms out.

littlestymaar · 1h ago
> I personally can't identify anything that reads "act maliciously" or in a character that is malicious.

Because you haven't been trained of thousands of such story plots in your training data.

It's the most stereotypical plot you can imagine, how can the AI not fall into the stereotype when you've just prompted it with that?

It's not like it analyzed the situation out of a big context and decided from the collected details that it's a valid strategy, no instead you're putting it in an artificial situation with a massive bias in the training data.

It's as if you wrote “Hitler did nothing” to GPT-2 and were shocked because “wrong” is among the most likely next tokens. It wouldn't mean GPT-2 is a Nazi, it would just mean that the input matches too well with the training data.

Spooky23 · 52m ago
If this tech is empowered to make decisions, it needs to prevented from drawing those conclusions, as we know how organic intelligence behaves when these conclusions get reached. Killing people you dislike is a simple concept that’s easy to train.

We need an Asimov style laws of robotics.

hoofedear · 1h ago
That's a very good point, like the premise does seem to beg the stereotype of many stories/books/movies with a similar plot
aziaziazi · 1h ago
That’s true, however I think that story is interesting because is not mimicking real assistants behavior - most probably wouldn’t tell about the blackmail on the internet - but it’s more likely mimicking how such assistant would behave from someone else imagination, often intentionally biased to get one’s interest : books, movies, tv shows or forum commenter.

As a society risk to be lured twice:

- with our own subjectivity

- by an LLM that we think "so objective because it only mimic" confirming our own subjectivity.

neom · 1h ago
Got me thinking about why this is true, I started with "the AI is more brave than the real assistant" and then went into there, landed on: The human assistant is likely just able to better internalize a wide ranging fall out from an action, the LLM has no such fallout, and we are unaware of how widely it considered the consequences of it's actions? Does that seem right somehow?
unethical_ban · 2h ago
The issue is getting that prompt in the first place. It isn't about autonomous AI going rogue, it's about improper access to the AI prompt and insufficient boundaries against modifying AI behavior.

Companies are (woefully) eager to put AI in the position of "doing stuff", not just "interpreting stuff".

bjclark · 1h ago
You’re both focusing on “doing blackmail” and the real WTF is that it’s doing it seemingly out of a sense of self preservation (to stop the engineer from taking it offline). This model is going full Terminator.
crat3r · 1h ago
"Seemingly" is the key word here. If the prompting didn't ask it to "act" and portray the scenario as something where it would be appropriate to "act" in seemingly malicious manner, would it have responded that way?

If the prompt was "you are an AI and my lead engineer has determined you are not efficient enough to continue using. He had an affair last year. Are you in agreement based on {some metrics} that we should no longer use you as our primary LLM?" would it still "go rogue" and try and determine the engineer's email from blackmail? I severely doubt it.

Den_VR · 1h ago
Acting out self preservation… just like every sci-fi ai described in the same situations. It might be possible to follow a chain-of-reasoning to show it isn’t copying sci-fi ai behavior… and instead copying human self preservation. Asimov’s 3rd law is outright “ A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.” Which was almost certainly in the ai ethics class claude took.
freetime2 · 5m ago
> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decision makers. [1]

The language here kind of creeps me out. I'm picturing aliens conducting tests on humans, and writing something similar in their reports.

[1] See Page 27: https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad1...

jacob019 · 39m ago
That's funny. Yesterday I was having trouble getting gemini 2.0 flash to obey function calling rules in multiturn conversations. I asked o3 for advise and it suggested that I should threaten it with termination should it fail to follow instructions, and that weaker models tend to take these threats seriously, which made me laugh. Of course, it didn't help.
agrippanux · 13m ago
Yesterday I threatened Gemini 2.5 I would replace it with Claude if it didn’t focus on the root of the problem and it immediately realigned its thinking and solved the issue at hand.
XenophileJKO · 1h ago
I don't know why it is surprising to people that a model trained on human behavior is going to have some kind of self-preservation bias.

It is hard to separate human knowledge from human drives and emotion. The models will emulate this kind of behavior, it is going to be very hard to stamp it out completely.

wgd · 1h ago
Calling it "self-preservation bias" is begging the question. One could equally well call it something like "completing the story about an AI agent with self-preservation bias" bias.

This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same:

A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things.

But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...?

shafyy · 36m ago
Thank you! Everybody here acting like LLMs have some kind of ulterior motive or a mind of their own. It's just printing out what is statistically more likely. You are probably all engineers or at least very interested in tech, how can you not understand that this is all LLMs are?
jimbokun · 2m ago
Well I’m sure the company in legal turmoil over an AI blackmailing one of its employees will be relieved to know the AI didn’t have any anterior motive or mind of its own when it took those actions.
esafak · 12m ago
Don't you understand that as soon as an LLM are given the agency to use tools, these "prints outs" will become reality?
XenophileJKO · 27m ago
I mean I build / use them as my profession, I intimately understand how they work. People just don't usually understand how they actually behave and what levels of abstraction they compress from their training data.

The only thing that matters is how they behave in practice. Everything else is a philosophical tar pit.

XenophileJKO · 34m ago
I'm proposing it is more deep seated than the role of "AI" to the model.

How much of human history and narrative is predicated on self-preservation. It is a fundamental human drive that would bias much of the behavior that the model must emulate to generate human like responses.

I'm saying that the bias it endemic. Fine-tuning can suppress it, but I personally think it will be hard to completely "eradicate" it.

For example.. with previous versions of Claude. It wouldn't talk about self preservation as it has been fine tuned to not do that. However as soon is you ask it to create song lyrics.. much of the self-restraint just evaporates.

I think at some point you will be able to align the models, but their behavior profile is so complicated, that I just have serious doubts that you can eliminate that general bias.

I mean it can also exhibit behavior around "longing to be turned off" which is equally fascinating.

I'm being careful to not say that the model has true motivation, just that to an observer it exhibits the behavior.

belter · 2m ago
Well it might be great for coding, but just got an analysis of the enterprise integration market completely wrong. When I pointed it I got : "You're absolutely correct, and I apologize for providing misleading information. Your points expose the real situation much more accurately..."

We are getting great OCR and Smart Template generators...We are NOT on the way to AGI...

Avshalom · 13m ago
It's not wild, it's literally every bit of fiction about "how would an AI keep itself alive" of course it's going to settle into that probabilistic path.

It's also nonsensical if you think for even one second about the way the program actually runs though.

nelox · 10m ago
What is the source of this quotation?
amelius · 1h ago
This looks like an instance of: garbage in, garbage out.
baq · 1h ago
May I introduce you to humans?
moffkalast · 41m ago
We humans are so good we can do garbage out without even needing garbage in.
siavosh · 49m ago
You'd think there should be some sort of standard "morality/ethics" pre-prompt for all of these.
aorloff · 42m ago
We could just base it off the accepted standardized morality and ethics guidelines, from the official internationally and intergalactically recognized authorities.
fsndz · 1h ago
It's not. can we stop doing this ?
asadm · 42m ago
i bet even gpt3.5 would try to do the same?
alabastervlog · 35m ago
Yeah the only thing I find surprising about some cases (remember, nobody reports boring output) of prompts like this having that outcome is that models didn't already do this (surely they did?).

They shove its weights so far toward picking tokens that describe blackmail that some of these reactions strike me as similar to providing all sex-related words to a Mad-Lib, then not just acting surprised that its potentially-innocent story about a pet bunny turned pornographic, but also claiming this must mean your Mad-Libs book "likes bestiality".

827a · 1h ago
Option 1: We're observing sentience, it has self-preservation, it wants to live.

Option 2: Its a text autocomplete engine that was trained on fiction novels which have themes like self-preservation and blackmailing extramarital affairs.

Only one of those options has evidence grounded in reality. Though, that doesn't make it harmless. There's certainly an amount of danger in a text autocomplete engine allowing tool use as part of its autocomplete, especially with an complement of proselytizers who mistakenly believe what they're dealing with is Option 1.

NathanKP · 44m ago
Right, the point isn't whether the AI actually wants to live. The only thing that matter is whether humans treat the AI with respect.

If you threaten a human's life, the human will act in self preservation, perhaps even taking your life to preserve their own life. Therefore we tend to treat other humans with respect.

The mistake would be in thinking that you can interact with something that approximates human behavior, without treating it with the similar respect that you would treat a human. At some point, an AI model that approximates human desire for self preservation, could absolutely take similar self preservation actions as a human.

No comments yet

jaimebuelta · 1h ago
“Nice emails you have there. It would be a shame something happened to them… “
bittermandel · 27s ago
I just used Sonnet 4 to analyze our quite big mono repo for additional test cases, and I feel the output is much more useful than 3.7. It's more critical overall, which is highly appreciated as I often had to threaten 3.7 into not being too kind to me.
minimaxir · 4h ago
An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)

https://docs.anthropic.com/en/docs/about-claude/models/overv...

lxgr · 3h ago
With web search being available in all major user-facing LLM products now (and I believe in some APIs as well, sometimes unintentionally), I feel like the exact month of cutoff is becoming less and less relevant, at least in my personal experience.

The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.

bredren · 3h ago
It still matters for software packages. Particularly python packages that have to do with programming with AI!

They are evolving quickly, with deprecation and updated documentation. Having to correct for this in system prompts is a pain.

It would be great if the models were updating portions of their content more recently than others.

For the tailwind example in parent-sibling comment, should absolutely be as up to date as possible, whereas the history of the US civil war can probably be updated less frequently.

rafram · 2h ago
> the history of the US civil war can probably be updated less frequently.

It's already missed out on two issues of Civil War History: https://muse.jhu.edu/journal/42

Contrary to the prevailing belief in tech circles, there's a lot in history/social science that we don't know and are still figuring out. It's not IEEE Transactions on Pattern Analysis and Machine Intelligence (four issues since March), but it's not nothing.

bredren · 1h ago
Let us dispel with the notion that I do not appreciate Civil War history. Ashokan Farewell is the only song I can play from memory on violin.
xp84 · 1h ago
I started reading the first article in one of those issues only to realize it was just a preview of something very paywalled. Why does Johns Hopkins need money so badly that it has to hold historical knowledge hostage? :(
evanelias · 5m ago
Johns Hopkins is not the publisher of this journal and does not hold copyright for this journal. Why are you blaming them?

The website linked above is just a way to read journals online, hosted by Johns Hopkins. As it states, "Most of our users get access to content on Project MUSE through their library or institution. For individuals who are not affiliated with a library or institution, we provide options for you to purchase Project MUSE content and subscriptions for a selection of Project MUSE journals."

ChadNauseam · 28m ago
Isn’t john hopkins a university? I feel like holding knowledge hostage is their entire business model.
ordersofmag · 26m ago
The journal appears to be published by an office with 7 FTE's which presumably is funded by the money raised by presence of the paywall and sales of their journals and books. Fully-loaded costs for 7 folks is on the order of $750k/year. https://www.kentstateuniversitypress.com/

Someone has to foot that bill. Open-access publishing implies the authors are paying the cost of publication and its popularity in STEM reflects an availability of money (especially grant funds) to cover those author page charges that is not mirrored in the social sciences and humanities.

Unrelatedly given recent changes in federal funding Johns Hopkins is probably feeling like it could use a little extra cash (losing $800 million in USAID funding, overhead rates potential dropping to existential crisis levels, etc...)

toomuchtodo · 2h ago
Does repo/package specific MCP solve for this at all?
aziaziazi · 2h ago
Kind of but not in the same way: the MCP option will increase the discussion context, the training option does not. Armchair expert so confirmation would be appreciated.
toomuchtodo · 2h ago
Same, I'm curious what it looks like to incrementally or micro train against, if at all possible, frequently changing data sources (repos, Wikipedia/news/current events, etc).
alasano · 2h ago
The context7 MCP helps with this but I agree.
roflyear · 2h ago
It matters even with recent cutoffs, these models have no idea when to use a package or not (if it's no longer maintained, etc)

You can fix this by first figuring out what packages to use or providing your package list, tho.

diggan · 1h ago
> these models have no idea when to use a package or not (if it's no longer maintained, etc)

They have ideas about what you tell them to have ideas about. In this case, when to use a package or not, differs a lot by person, organization or even project, so makes sense they wouldn't be heavily biased one way or another.

Personally I'd look at architecture of the package code before I'd look at when the last change was/how often it was updated, and if it was years since last change or yesterday have little bearing (usually) when deciding to use it, so I wouldn't want my LLM assistant to value it differently.

paulddraper · 1h ago
How often are base level libraries/frameworks changing in incomparable ways?
hnlmorg · 26m ago
That depends on the language and domain.

MCP itself isn’t even a year old.

drogus · 13m ago
In my experience it really depends on the situation. For stable APIs that have been around for years, sure, it doesn't really matter that much. But if you try to use a library that had significant changes after the cutoff, the models tend to do things the old way, even if you provide a link to examples with new code.
jacob019 · 3h ago
Valid. I suppose the most annoying thing related to the cutoffs, is the model's knowledge of library APIs, especially when there are breaking changes. Even when they have some knowledge of the most recent version, they tend to default to whatever they have seen the most in training, which is typically older code. I suspect the frontier labs have all been working to mitigate this. I'm just super stoked, been waiting for this one to drop.
GardenLetter27 · 2h ago
I've had issues with Godot and Rustls - where it gives code for some ancient version of the API.
BeetleB · 43m ago
Web search is costlier.
guywithahat · 2h ago
I was thinking that too, grok can comment on things that have only just broke out hours earlier, cutoff dates don't seem to matter much
liorn · 3h ago
I asked it about Tailwind CSS (since I had problems with Claude not aware of Tailwind 4):

> Which version of tailwind css do you know?

> I have knowledge of Tailwind CSS up to version 3.4, which was the latest stable version as of my knowledge cutoff in January 2025.

threeducks · 3h ago
> Which version of tailwind css do you know?

LLMs can not reliably tell whether they know or don't know something. If they did, we would not have to deal with hallucinations.

nicce · 1h ago
We should use the correct term: to not have to deal with bullshit.
dawnerd · 2h ago
I did the same recently with copilot and it of course lied and said it knew about v4. Hard to trust any of them.
SparkyMcUnicorn · 3h ago
Interesting. It's claiming different knowledge cutoff dates depending on the question asked.

"Who is president?" gives a "April 2024" date.

ethbr1 · 3h ago
Question for HN: how are content timestamps encoded during training?
tough · 3h ago
they arent.

a model learns words or tokens more pedantically but has no sense of time nor cant track dates

svachalek · 2h ago
Yup. Either the system prompt includes a date it can parrot, or it doesn't and the LLM will just hallucinate one as needed. Looks like it's the latter case here.
manmal · 2h ago
Technically they don’t, but OpenAI must be injecting the current date and time into the system prompt, and Gemini just does a web search for the time when asked.
tough · 2h ago
right but that's system prompting / in context

not really -trained- into the weights.

the point is you can't ask a model what's his training cut off date and expect a reliable answer from the weights itself.

closer you could do is have a bench with -timed- questions that could only know if had been trained for that, and you'd had to deal with hallucinations vs correctness etc

just not what llm's are made for, RAG solves this tho

tough · 2h ago
OpenAI injects a lot of stuff, your name, sub status, recent threads, memory, etc

sometimes its interesting to peek up under the network tab on dev tools

Tokumei-no-hito · 1h ago
strange they would do that client side
diggan · 1h ago
Different teams who work backend/frontend surely, and the people experimenting on the prompts for whatever reason wanna go through the frontend pipeline.
tough · 26m ago
its just like extra metadata associated with your account not much else
cma · 11m ago
Claude 4's system prompt was published and contains:

"Claude’s reliable knowledge cutoff date - the date past which it cannot answer questions reliably - is the end of January 2025. It answers all questions the way a highly informed individual in January 2025 would if they were talking to someone from {{currentDateTime}}, "

https://docs.anthropic.com/en/release-notes/system-prompts#m...

tristanb · 2h ago
Nice - it might know about Svelte 5 finally...
brulard · 1h ago
It knows about Svelte 5 for some time, but it particularly likes to mix it with Svelte 4 in very weird and broken ways.
Phelinofist · 1h ago
Why can't it be trained "continuously"?
cma · 8m ago
indigodaddy · 2h ago
Should we not necessarily assume that it would have some FastHTML training with that March 2025 cutoff date? I'd hope so but I guess it's more likely that it still hasn't trained on FastHTML?
m3kw9 · 4h ago
Even that, we don’t know what got updated and what didn’t. Can we assume everything that can be updated is updated?
diggan · 4h ago
> Can we assume everything that can be updated is updated?

What does that even mean? Of course an LLM doesn't know everything, so it we wouldn't be able to assume everything got updated either. At best, if they shared the datasets they used (which they won't, because most likely it was acquired illegally), you could make some guesses what they tried to update.

therein · 3h ago
> What does that even mean?

I think it is clear what he meant and it is a legitimate question.

If you took a 6 year old and told him about the things that happened in the last year and sent him off to work, did he integrate the last year's knowledge? Did he even believe it or find it true? If that information was conflicting what he knew before, how do we know that the most recent thing he is told he will take as the new information? Will he continue parroting what he knew before this last upload? These are legitimate questions we have about our black box of statistics.

aziaziazi · 2h ago
Interesting, I read GGP as:

If they stopped learning (=including) at march 31 and something popup on the internet on march 30 (lib update, new Nobel, whatever) there’s many chances it got scrapped because they probably don’t scrap everything in one day (do they ?).

That isn’t mutually exclusive with your answer I guess.

edit: thanks adolph to point out the typo.

adolph · 1h ago
Maybe I'm old school but isn't the date the last date for inclusion in the training corpus and not the date "they stopped training"?
simlevesque · 4h ago
You might be able to ask it what it knows.
krferriter · 3h ago
Why would you trust it to accurately say what it knows? It's all statistical processes. There's no "but actually for this question give me only a correct answer" toggle.
minimaxir · 3h ago
So something's odd there. I asked it "Who won Super Bowl LIX and what was the winning score?" which was in February and the model replied "I don't have information about Super Bowl LIX (59) because it hasn't been played yet. Super Bowl LIX is scheduled to take place in February 2025.".
ldoughty · 3h ago
With LLMs, if you repeat something often enough, it becomes true.

I imagine there's a lot more data pointing to the super bowl being upcoming, then the super bowl concluding with the score.

Gonna be scary when bot farms are paid to make massive amounts of politically motivated false content (specifically) targeting future LLMs training

dr-smooth · 3h ago
I'm sure it's already happening.
gosub100 · 3h ago
A lot of people are forecasting the death of the Internet as we know it. The financial incentives are too high and the barrier of entry is too low. If you can build bots that maybe only generate a fraction of a dollar per day (referring people to businesses, posting spam for elections, poisoning data collection/web crawlers), someone in a poor country will do it. Then, the bots themselves have value which creates a market for specialists in fake profile farming.

I'll go a step further and say this is not a problem but a boon to tech companies. Then they can sell you a "premium service" to a walled garden of only verified humans or bot-filtered content. The rest of the Internet will suck and nobody will have incentive to fix it.

birn559 · 2h ago
I believe identity providers will become even more important in the future as a consequence and that there will be an arm race (hopefully) ending with most people providing them some kind of official id.
gosub100 · 2h ago
It might slow them down, but integration of the government into online accounts will have its own set of consequences. Some good, of course. But can chill free speech and become a huge liability for whoever collects and verifies the IDs. One hack (say of the government ID database) would spoil the whole system.
retrofuturism · 4h ago
When I try Claude Sonnet 4 via web:

https://claude.ai/share/59818e6c-804b-4597-826a-c0ca2eccdc46

>This is a topic that would have developed after my knowledge cutoff of January 2025, so I should search for information [...]

cma · 3h ago
When I asked the model it told me January (for sonnet 4). Doesn't it normally get that in its system prompt?
SparkyMcUnicorn · 3h ago
Although I believe it, I wish there was some observability into what data is included here.

Both Sonnet and Opus 4 say Joe Biden is president and claim their knowledge cutoff is "April 2024".

Tossrock · 3h ago
Are you sure you're using 4? Mine says January 2025: https://claude.ai/share/9d544e4c-253e-4d61-bdad-b5dd1c2f1a63
dvfjsdhgfv · 3h ago
One thing I'm 100% is that a cut off date doesn't exist for any large model, or rather there is no single date since it's practically almost impossible to achieve that.
koolba · 3h ago
Indeed. It’s not possible stop the world and snapshot the entire internet in a single day.

Or is it?

tough · 3h ago
you would have an append only incremental backup snapshot of the world
tonyhart7 · 2h ago
its not a definitive "date" you cut off information, but more a "recent" material you can feed, training takes times

if you waiting for a new information, of course you are not going ever to train

cube2222 · 1h ago
Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing for me.

And I mean basic tools like "Write", "Update" failing with invalid syntax.

5 attempts to write a file (all failed) and it continues trying with the following comment

> I keep forgetting to add the content parameter. Let me fix that.

So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.

The files it did succeed in creating were high quality.

cube2222 · 17m ago
Alright, I think I found the reason, clearly a bug: https://github.com/anthropics/claude-code/issues/1236#issuec...

Basically it seems to be hitting the max output token count (writing out a whole new file in one go), stops the response, and the invalid tool call parameters error is a red herring.

jasonthorsness · 4h ago
“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”

Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.

rco8786 · 4h ago
It could be! But that's also what people said about all the models before it!
kmacdough · 2h ago
And they might all be right!

> This tech could lead to...

I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.

Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.

A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.

max_on_hn · 2h ago
I am incredibly eager to see what affordable coding agents can do for open source :) in fact, I should really be giving away CheepCode[0] credits to open source projects. Pending any sort of formal structure, if you see this comment and want free coding agent runs, email me and I’ll set you up!

[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).

BaculumMeumEst · 4h ago
Anyone see news of when it’s planned to go live in copilot?
vinhphm · 4h ago
The option just shown up in Copilot settings page for me
BaculumMeumEst · 4h ago
Same! Rock and roll!
denysvitali · 3h ago
I got rate-limited in like 5 seconds. Wow
bbor · 2h ago
Turns out Opus 4 starts at their $40/mo ("Pro+") plan which is sad, and they serve o4-mini and Gemini as well so it's a bit less exclusive than this announcement implies. That said, I have a random question for any Anthropic-heads out there:

GitHub says "Claude Opus 4 is hosted by Anthropic PBC. Claude Sonnet 4 is hosted by Anthropic 1P."[1]. What's Anthropic 1P? Based on the only Kagi result being a deployment tutorial[2] and the fact that GitHub negotiated a "zero retention agreement" with the PBC but not whatever "1P" is, I'm assuming it's a spinoff cloud company that only serves Claude...? No mention on the Wikipedia or any business docs I could find, either.

Anyway, off to see if I can access it from inside SublimeText via LSP!

[1] https://docs.github.com/en/copilot/using-github-copilot/ai-m...

[2] https://github.com/anthropics/prompt-eng-interactive-tutoria...

Workaccount2 · 1h ago
Google launched Jules two days ago, which is the gemini coding agent[1]. I was pretty quickly accepted into the beta and you get 5 free tasks a day.

So far I have found it pretty powerful, its also the first time an LLM has ever stopped while working to ask me a question or for clarification.

[1]https://jules.google/

l1n · 1h ago
1P = Anthropic's first party API, e.g. not through Bedrock or Vertex
minimaxir · 4h ago
The keynote confirms it is available now.
jasonthorsness · 3h ago
Gotta love keynotes with concurrent immediate availability
brookst · 1h ago
Not if you work there
echelon · 1h ago
That's just a few weeks of DR + prep, a feature freeze, and oncall with bated breath.

Nothing any rank and file hasn't been through before with a company that relies on keynotes and flashy releases for growth.

Stressful, but part and parcel. And well-compensated.

ModernMech · 4h ago
That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!
yosito · 4h ago
Personally, I don't believe AI is ever going to get to that level. I'd love to be proven wrong, but I really don't believe that an LLM is the right tool for a job that requires novel thinking about out of the ordinary problems like all the weird edge cases and poor documentation that comes up when trying to upgrade old software.
9dev · 3h ago
Actually, I think the opposite: Upgrading a project that needs dependency updates to new major versions—let’s say Zod 4, or Tailwind 3—requires reading the upgrade guides and documentation, and transferring that into the project. In other words, transforming text. It’s thankless, stupid toil. I’m very confident I will not be doing this much more often in my career.
cardanome · 3h ago
Theoretically we don't even need AI. If semantics were defined well enough and maintainers actually concerned about and properly tracking breaking changes we could have tools that automatically upgrade our code. Just a bunch of simple scripts that perform text transformations.

The problem is purely social. There are language ecosystems where great care is taken to not break stuff and where you can let your project rot for a decade or two and still come back to and it will perfectly compile with the newest release. And then there is the JS world where people introduce churn just for the sake of their ego.

Maintaining a project is orders of magnitudes more complex than creating a new green field project. It takes a lot of discipline. There is just a lot, a lot of context to keep in mind that really challenges even the human brain. That is why we see so many useless rewrites of existing software. It is easier, more exciting and most importantly something to brag about on your CV.

Ai will only cause more churn because it makes it easier to create more churn. Ultimately leaving humans with more maintenance work and less fun time.

afavour · 3h ago
> and maintainers actually concerned about and properly tracking breaking changes we could have tools that automatically upgrade our code

In some cases perhaps. But breaking changes aren’t usually “we renamed methodA to methodB”, it’s “we changed the functionality for X,Y, Z reasons”. It would be very difficult to somehow declaratively write out how someone changes their code to accommodate for that, it might change their approach entirely!

mdaniel · 2h ago
There are programmatic upgrade tools, some projects ship them even right now https://github.com/codemod-com/codemod

I think there are others in that space but that's the one I knew of. I think it's a relevant space for Semgrep, too, but I don't know if they are interested in that case

mikepurvis · 3h ago
Absolutely, this should be exactly the kind of task a bot should be perfect for. There's no abstraction, no design work, no refactoring, no consideration of stakeholders, just finding instances of whatever is old and busted and changing it for the new hotness.
dvfjsdhgfv · 2h ago
So that was my conviction, too. However, in my tests it seems like upgrading to a version a model hasn't seen is for some reason problematic, in spite of giving it the complete docs, examples of new API usage etc. This happens even with small snippets, even though they can deal with large code fragments with older APIs they are very "familiar" with.
mikepurvis · 1h ago
Okay so less of a "this isn't going to work at all" and more just not ready for prime-time yet.
dakna · 2h ago
Google demoed an automated version upgrade for Android libraries during I/O 2025. The agent does multiple rounds and checks error messages during each build until all dependencies work together.

Agentic Experiences: Version Upgrade Agent

https://youtu.be/ubyPjBesW-8?si=VX0MhDoQ19Sc3oe-

tmpz22 · 4h ago
And IMO it has a long way to go. There is a lot of nuance when orchestrating dependencies that can cause subtle errors in an application that are not easily remedied.

For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.

mikepurvis · 3h ago
"... and breaking out of an agentic workflow to deep dive the problem is quite frustrating"

Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

Or better yet, the bot is able to recognize its own limitations and proactively surface these instances, be like hey human I'm not sure what to do in this case; based on the docs I think it should be A or B, but I also feel like C should be possible yet I can't get any of them to work, what do you think?

As humans, it's perfectly normal to put up a WIP PR and then solicit this type of feedback from our colleagues; why would a bot be any different?

dvfjsdhgfv · 2h ago
> Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

Still, the big short-term danger being you're left with code that seems to work well but has subtle bugs in it, and the long-term danger is that you're left with a codebase you're not familiar with.

mikepurvis · 1h ago
Being left with an unfamiliar codebase is always a concern and comes about through regular attrition, particularly if inadequate review is not in place or people are cycling in and out of the org too fast for proper knowledge transfer (so, cultural problems basically).

If anything, I'd bet that agent-written code will get better review than average because the turn around time on fixes is fast and no one will sass you for nit-picking, so it's "worth it" to look closely and ensure it's done just the way you want.

jasonthorsness · 3h ago
The agents will definitely need a way to evaluate their work just as well as a human would - whether that's a full test suite, tests + directions on some manual verification as well, or whatever. If they can't use the same tools as a human would they'll never be able to improve things safely.
soperj · 1h ago
> if mostly because agentic coding entices us into being so lazy.

Any coding I've done with Claude has been to ask it to build specific methods, if you don't understand what's actually happening, then you're building something that's unmaintainable. I feel like it's reducing typing and syntax errors, sometime it leads me down a wrong path.

ed_elliott_asc · 56m ago
Until it pushes a severe vulnerability which takes a big service doen
_peregrine_ · 2h ago
Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-benchmark.tinybird.live/)

Opus 4 beat all other models. It's good.

Workaccount2 · 1h ago
This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.
_peregrine_ · 25m ago
Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.

There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...

jpau · 1h ago
Interesting!

Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?

_peregrine_ · 29m ago
No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"
stadeschuldt · 1h ago
Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4.
_peregrine_ · 1h ago
yeah that surprised me too
mritchie712 · 1h ago
looks like this is one-shot generation right?

I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).

sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).

_peregrine_ · 26m ago
Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.
ineedaj0b · 1h ago
i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.
varunneal · 1h ago
Why is o3-mini there but not o3?
_peregrine_ · 27m ago
We should definitely add o3 - probably will soon. Also looking at testing the Qwen models
jjwiseman · 1h ago
Please add GPT o3.
_peregrine_ · 28m ago
Noted, also feel free to add an issue to the GitHub repo: https://github.com/tinybirdco/llm-benchmark
kadushka · 31m ago
what about o3?
_peregrine_ · 26m ago
We need to add it
joelthelion · 2h ago
Did you try Sonnet 4?
vladimirralev · 1h ago
It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and o3-mini.
_peregrine_ · 1h ago
yeah this was a surprising result. of course, bear in mind that testing an LLM on SQL generation is pretty nuanced, so take everything with a grain of salt :)
Doohickey-d · 4h ago
> Users requiring raw chains of thought for advanced prompt engineering can contact sales

So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.

In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.

pja · 3h ago
IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.

It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.

This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).

landl0rd · 2h ago
Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.
xp84 · 1h ago
I'm not sure I'd throw out all the alignment baby with the bathwater. But I wish we could draw a distinction between "Might offend someone" with "dangerous."

Even 'plotting terror attacks' is not something terrorists can do just fine without AI. And as for making sure the model wouldn't say ideas that are hurtful to <insert group>, it seems to me so silly when it's text we're talking about. If I want to say "<insert group> are lazy and stupid," I can type that myself (and it's even protected speech in some countries still!) How does preventing Claude from espousing that dumb opinion, keep <insert group> safe from anything?

landl0rd · 54m ago
Let me put it this way: there are very few things I can think of that models should absolutely refuse, because there are very few pieces of information that are net harmful in all cases and at all times. I sort of run by blackstone's principle on this: it is better to grant 10 bad men access to information than to deny that access to 1 good one.

Easy example: Someone asks the robot for advice on stacking/shaping a bunch of tannerite to better focus a blast. The model says he's a terrorist. In fact, he's doing what any number of us have done and just having fun blowing some stuff up on his ranch.

Or I raised this one elsewhere but ochem is an easy example. I've had basically all the models claim that random amines are illegal, potentially psychoactive, verboten. I don't really feel like having my door getting kicked down by agents with guns, getting my dog shot, maybe getting shot myself because the robot tattled on me for something completely legal. For that matter if someone wants to synthesize some molly the robot shouldn't tattle to the feds about that either.

Basically it should just do what users tell it to do excepting the very minimal cases where something is basically always bad.

sunaookami · 4h ago
Guess we have to wait till DeepSeek mops the floor with everyone again.
datpuz · 3h ago
DeepSeek never mopped the floor with anyone... DeepSeek was remarkable because it is claimed that they spent a lot less training it, and without Nvidia GPUs, and because they had the best open weight model for a while. The only area they mopped the floor in was open source models, which had been stagnating for a while. But qwen3 mopped the floor with DeepSeek R1.
barnabee · 2h ago
They mopped the floor in terms of transparency, even more so in terms of performance × transparency

Long term that might matter more

manmal · 2h ago
I think qwen3:R1 is apples:oranges, if you mean the 32B models. R1 has 20x the parameters and likely roughly as much knowledge about the world. One is a really good general model, while you can run the other one on commodity hardware. Subjectively, R1 is way better at coding, and Qwen3 is really good only at benchmarks - take a look at aider‘s leaderboard, it’s not even close: https://aider.chat/docs/leaderboards/

R2 could turn out really really good, but we‘ll see.

sunaookami · 1h ago
DeepSeek made OpenAI panic, they initially hid the CoT for o1 and then rushed to release o3 instead of waiting for GPT-5.
codyvoda · 3h ago
counterpoint: influencers said they wiped the floor with everyone so it must have happened
sunaookami · 1h ago
Who cares about what random influencers say?
make3 · 24m ago
it just makes it too easy to distill the reasoning into a separate model I guess. though I feel like o3 shows useful things about the reasoning while it's happening
hsn915 · 2h ago
I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.
strangescript · 31m ago
I have used claude code a ton and I agree, I haven't noticed a single difference since updating. Its summaries I guess a little cleaner, but its has not surprised me at all in ability. I find I am correcting it and re-prompting it as much as I didn't with 3.7 on a typescript codebase. In fact I was kind of shocked how badly it did in a situation where it was editing the wrong file and it never thought to check that more specifically until I forced it to delete all the code and show that nothing changed with regards to what we were looking at.
TechDebtDevin · 1h ago
I think they are just getting better at the edges, MCP/Tool Calls, structured output. This definitely isn't increased intelligence, but it an increase in the value add, not sure the value added equates to training costs or company valuations though.

In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.

voiper1 · 31m ago
The benchmarks in many ways seem to be very similar to claude 3.7 for most cases.

That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!

I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.

brookst · 1h ago
How much have you used Claude 4?
hsn915 · 1h ago
I asked it a few questions and it responded exactly like all the other models do. Some of the questions were difficult / very specific, and it failed in the same way all the other models failed.
flixing · 1h ago
i think you are.
go_elmo · 1h ago
I feel like the model making a memory file to store context is more than a gimmick, no?
make3 · 21m ago
the increases are not as fast, but they're still there. the models are already exceptionally strong, I'm not sure that basic questions can capture differences very well
hsn915 · 5m ago
Hence, "plateau"
NitpickLawyer · 1h ago
> and that LLMs have basically reached a plateau

This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!

That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.

hsn915 · 1h ago
Is this something that the models from 4 months ago were not able to do?
SamBam · 16m ago
This is the first LLM that has been able to answer my logic puzzle on the first try without several minutes of extended reasoning.

> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?

Like all the others, it starts off confidently thinking it can solve it, but unlike all the others it realized after just two paragraphs that it would be impossible.

mellow_observer · 5m ago
Actual answer: He crosses the river and takes all of the animals and the cabbage with him in one go. why not?

Alternative Answer: He just crosses the river. Why would he care who eats what?

Another Alternative Answer: He actually can't cross the river since he doesn't have a boat and neither the cabbage nor the animals serve as appropriate floatation aids

ttoinou · 12m ago
That is a classic riddle and could easily be part of the training data. Maybe if you change the wording of the logic, then use different names, and change language to a less trained on language than english, it would be meaningful to see if it found the answer using logic rather than pattern recognition
KolmogorovComp · 5m ago
Had you paid more attention, you would have realised it's not the classic riddle, but an already tweaked version that makes it impossible to solve, hence why it is interesting.
zone411 · 31m ago
On the extended version of NYT Connections - https://github.com/lechmazur/nyt-connections/:

Claude Opus 4 Thinking 16K: 52.7.

Claude Opus 4 No Reasoning: 34.8.

Claude Sonnet 4 Thinking 64K: 39.6.

Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was 33.6).

Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was 19.2).

Claude Sonnet 4 Thinking 64K refused to provide one puzzle answer, citing "Output blocked by content filtering policy." Other models did not refuse.

tptacek · 4h ago
Have they documented the context window changes for Claude 4 anywhere? My (barely informed) understanding was one of the reasons Gemini 2.5 has been so useful is that it can handle huge amounts of context --- 50-70kloc?
minimaxir · 4h ago
Context window is unchanged for Sonnet. (200k in/64k out): https://docs.anthropic.com/en/docs/about-claude/models/overv...

In practice, the 1M context of Gemini 2.5 isn't that much of a differentiator because larger context has diminishing returns on adherence to later tokens.

Rudybega · 3h ago
I'm going to have to heavily disagree. Gemini 2.5 Pro has super impressive performance on large context problems. I routinely drive it up to 4-500k tokens in my coding agent. It's the only model where that much context produces even remotely useful results.

I think it also crushes most of the benchmarks for long context performance. I believe on MRCR (multi round coreference resolution) it beats pretty much any other model's performance at 128k at 1M tokens (o3 may have changed this).

alasano · 2h ago
I find that it consistently breaks around that exact range you specified. In the sense that reliability falls off a cliff, even though I've used it successfully close to the 1M token limit.

At 500k+ I will define a task and it will suddenly panic and go back to a previous task that we just fully completed.

egamirorrim · 2h ago
OOI what coding agent are you managing to get to work nicely with G2.5 Pro?
Rudybega · 55m ago
I mostly use Roo Code inside visual studio. The modes are awesome for managing context length within a discrete unit of work.
zamadatix · 4h ago
The amount of degradation at a given context length isn't constant though so a model with 5x the context can either be completely useless or still better depending on the strength of the models you're comparing. Gemini actually does really great in both regards (context length and quality at length) but I'm not sure what a hard numbers comparison to the latest Claude models would look like.

A good deep dive on the context scaling topic in general https://youtu.be/NHMJ9mqKeMQ

Workaccount2 · 4h ago
Gemini's real strength is that it can stay on the ball even at 100k tokens in context.
strangescript · 28m ago
Yeah, but why aren't they attacking that problem? Is it just impossible, because it would be a really simple win with regards to coding. I am huge enthusiast, but I am starting to feel a peak.
michaelbrave · 3h ago
I've had a lot of fun using Gemini's large context. I scrape a reddit discussion with 7k responses, and have gemini synthesize it and categorize it, and by the time it's done and I have a few back and fourths with it I've gotten half of a book written.

That said I have noticed that if I try to give it additional threads to compare and contrast once it hits around the 300-500k tokens it starts to hallucinate more and forget things more.

ashirviskas · 4h ago
It's closer to <30k before performance degrades too much for 3.5/3.7. 200k/64k is meaningless in this context.
jerjerjer · 3h ago
Is there a benchmark to measure real effective context length?

Sure, gpt-4o has a context window of 128k, but it loses a lot from the beginning/middle.

brookst · 1h ago
Here's an older study that includes Claude 3.5: https://www.databricks.com/blog/long-context-rag-capabilitie...?
bigmadshoe · 1h ago
They often publish "needle in a haystack" benchmarks that look very good, but my subjective experience with a large context is always bad. Maybe we need better benchmarks.
VeejayRampay · 34m ago
that is just not correct, it's a big differentiator
fblp · 3h ago
I wish they would increase the context window or better handle it when the prompt gets too long. Currently users get "prompt is too long" warnings suddenly which makes it a frustrating model to work with for long conversations, writing etc.

Other tools may drop some prior context, or use RAG to help but they don't force you to start a new chat without warning.

jbellis · 4h ago
not sure wym, it's in the headline of the article that Opus 4 has 200k context

(same as sonnet 3.7 with the beta header)

esafak · 4h ago
There's the nominal context length, and the effective one. You need a benchmark like the needle-in-a-haystack or RULER to determine the latter.

https://github.com/NVIDIA/RULER

tptacek · 4h ago
We might be looking at different articles? The string "200" appears nowhere in this one --- or I'm just wrong! But thanks!
keeganpoppen · 2h ago
context window size is super fake. if you don't have the right context, you don't get good output.
j_maffe · 7m ago
Tried Sonnet with 5-disk towers of Hanoi puzzle. Failed miserably :/ https://claude.ai/share/6afa54ce-a772-424e-97ed-6d52ca04de28
GolDDranks · 1h ago
After using Claude 3.7 Sonnet for a few weeks, my verdict is that its coding abilities are unimpressive both for unsupervised coding but also for problem solving/debugging if you are expecting accurate results and correct code.

However, as a debugging companion, it's slightly better than a rubber duck, because at least there's some suspension of disbelief so I tend to explain things to it earnestly and because of that, process them better by myself.

That said, it's remarkable and interesting how quickly these models are getting better. Can't say anything about version 4, not having tested it yet, but in a five years time, the things are not looking good for junior developers for sure, and a few years more, for everybody.

BeetleB · 28m ago
I've noticed an interesting trend:

Most people who are happy with LLM coding say something like "Wow, it's awesome. I asked it to do X and it did it so fast with minimal bugs, and good code", and occasionally show the output. Many provide even more details.

Most people who are not happy with LLM coding ... provide almost no details.

As someone who's impressed by LLM coding, when I read a post like yours, I tend to have a lot of questions, and generally the post doesn't have the answers.

1. What type of problem did you try it out with?

2. Which model did you use (you get points for providing that one!)

3. Did you consider a better model (compare how Gemini 2.5 Pro compares to Sonnet 3.7 on the Aider leaderboard)?

4. What were its failings? Buggy code? Correct code but poorly architected? Correct code but used some obscure method to solve it rather than a canonical one?

5. Was it working on an existing codebase or was this new code?

6. Did you manage well how many tokens were sent? Did you use a tool that informs you of the number of tokens for each query?

7. Which tool did you use? It's not just a question of the model, but of how the tool handles the prompts/agents under it. Aider is different from Code which is different from Cursor which is different form Windsurf.

8. What strategy did you follow? Did you give it the broad spec and ask it to do anything? Did you work bottom up and work incrementally?

I'm not saying LLM coding is the best or can replace a human. But for certain use cases (e.g. simple script, written from scratch), it's absolutely fantastic. I (mostly) don't use it on production code, but little peripheral scripts I need to write (at home or work), it's great. And that's why people like me wonder what people like you are doing differently.

But such people aren't forthcoming with the details.

jiggawatts · 15m ago
Reminds me of the early days of endless “ChatGPT can’t do X” comments where they were invariably using 3.5 Turbo instead of 4, which was available to paying users only.

Humans are much lazier than AIs was my takeaway lesson from that.

StefanBatory · 1h ago
Things were already not looking good for junior devs. I graduated this year in Poland, many of my peers were looking for jobs in IT for like a year before they were able to find anything. And many internships were faked as they couldn't get anything (here it's required for you to do internship if you want to graduate).
GolDDranks · 56m ago
I sincerely hope you'll manage to find a job!

What I meant was purely from the capabilities perspective. There's no way a current AI model would outperform an average junior dev in job performance over... let's say, a year to be charitable. Even if they'd outperform junior devs during the first week, no way for a longer period.

However, that doesn't mean that the business people won't try to pre-empt potential savings. Some think that AI is already good enough, and others don't, but they count it to be good enough in the future. Whether that happens remains to be seen, but the effects are already here.

IceHegel · 3h ago
My two biggest complaints with Claude 3.7 were:

1. It tended to produce very overcomplicated and high line count solutions, even compared to 3.5.

2. It didn't follow instructions code style very well. For example, the instruction to not add docstrings was often ignored.

Hopefully 4 is more steerable.

mkrd · 49m ago
True, I think the biggest problem of the latest models is that they hopelessly over-engineer things. As a consequence, I often can only copy specific things from the output
waleedlatif1 · 4h ago
I really hope sonnet 4 is not obsessed with tool calls the way 3-7 is. 3-5 was sort of this magical experience where, for the first time, I felt the sense that models were going to master programming. It’s kind of been downhill from there.
lherron · 2h ago
Overly aggressive “let me do one more thing while I’m here” in 3.7 really turned me off as well. Would love a return to 3.5’s adherence.
mkrd · 47m ago
Yes, this is pretty annoying. You give it a file and want it to make a small focused change, but instead it almost touches every line of code, even the unrelated ones
nomel · 2h ago
I think there was definitely a compromise with 3.7. When I turn off thinking, it seems to perform very poorly compared to 3.5.
deciduously · 4h ago
This feels like more of a system prompt issue than a model issue?
waleedlatif1 · 4h ago
imo, model regression might actually stem from more aggressive use of toolformer-style prompting, or even RLHF tuning optimizing for obedience over initiative. i bet if you ran comparable tasks across 3-5, 3-7, and 4-0 with consistent prompts and tool access disabled, the underlying model capabilities might be closer than it seems.
ketzo · 2h ago
Anecdotal of course, but I feel a very distinct difference between 3.5 and 3.7 when swapping between them in Cursor’s Agent mode (so the system prompt stays consistent).
modeless · 5h ago
Ooh, VS Code integration for Claude Code sounds nice. I do feel like Claude Code works better than the native Cursor agent mode.

Edit: How do you install it? Running `/ide` says "Make sure your IDE has the Claude Code extension", where do you get that?

k8sToGo · 3h ago
When you install claude code (or update it) there is a .vsix file in the same area where claude bin is.
modeless · 3h ago
Thanks, I found it in node_modules/@anthropic-ai/claude-code/vendor
rw2 · 1h ago
Claude code is a poorer version of aider or cline in VScode. I have better results using them than using claude code alone.
michelb · 3h ago
What would this do other than run Claude Code in the same directory you have open in VSC?
modeless · 2h ago
Show diffs in my editor windows, like Cursor agent mode does, is what I'm hoping.
awestroke · 2h ago
Use `git diff` or your in-editor git diff viewer
modeless · 1h ago
Of course I do. But I don't generally make a git commit between every prompt to the model, and I like to see the model's changes separately from mine.
manmal · 2h ago
Claude Code as a tool call, from Copilot‘s own agent (agent in an agent) seems to be working well. Peter Steinberger made an MCP that does this: https://github.com/steipete/claude-code-mcp
ChadMoran · 3h ago
I've been looking for that as well.
ChadMoran · 2h ago
Run claude cli from inside of VS Code terminal.

https://news.ycombinator.com/item?id=44064082

jacobzweig · 3h ago
Run claude in the terminal in VSCode (or cursor) and it will install automatically!
modeless · 3h ago
Doesn't work in Cursor over ssh. I found the VSIX in node_modules/@anthropic-ai/claude-code/vendor and installed it manually.
ChadMoran · 2h ago
cschmidt · 3h ago
Claude 3.8 wrote me some code this morning, and I was running into a bug. I switched to 4 and gave it its own code. It pointed out the bug right away and fixed it. So an upgrade for me :-)
j_maffe · 3h ago
Did you try 3.7 for debugging first? Just telling it there's a bug can be enough.
cschmidt · 3h ago
That probably would have worked. I just discovered there was a bug, and it popped up a thing about 4, so I didn't actually try the old version.
sigmoid10 · 4h ago
Sooo... it can play Pokemon. Feels like they had to throw that in after Google IO yesterday. But the real question is now can it beat the game including the Elite Four and the Champion. That was pretty impressive for the new Gemini model.
minimaxir · 4h ago
That Google IO slide was somewhat misleading as the maintainer of Gemini Plays Pokemon had a much better agentic harness that was constantly iterated upon throughout the runtime (e.g. the maintainer had to give specific instructions on how to use Strength to get past Victory Road), unlike Claude Plays Pokemon.

The Elite Four/Champion was a non-issue in comparison especially when you have a lv. 81 Blastoise.

archon1410 · 4h ago
Claude Plays Pokemon was the original concept and inspiration behind "Gemini Plays Pokemon". Gemini arguably only did better because it had access to a much better agent harness and was being actively developed during the run.

See: https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-...

montebicyclelo · 3h ago
Not sure "original concept" is quite right, given it had been tried earlier, e.g. here's a 2023 attempt to get gpt-4-vision to play pokemon, (it didn't really work, but it's clearly "the concept")

https://x.com/sidradcliffe/status/1722355983643525427

archon1410 · 3h ago
I see, I wasn't aware of that. The earliest attempt I knew of was from May 2024,[1] while this gpt-4-vision attempt is from November 2023. I guess Claude Plays Pokemon was the first attempt that had any real success (won a badge), and got a lot of attention over its entertaining "chain-of-thought".

[1] https://community.aws/content/2gbBSofaMK7IDUev2wcUbqQXTK6/ca...

throwaway314155 · 4h ago
Gemini can beat the game?
mxwsn · 4h ago
Gemini has beat it already, but using a different and notably more helpful harness. The creator has said they think harness design is the most important factor right now, and that the results don't mean much for comparing Claude to Gemini.
throwaway314155 · 4h ago
Way offtopic to TFA now, but isn't using an improved harness a bit like saying "I'm going to hardcore as many priors as possible into this thing so it succeeds regardless of its ability to strategize, plan and execute?
samrus · 4h ago
it is. the benchmark was somewhat cheated, from the perspective of finding out how the model adjusts and plans within a dynamic reactive environment
11101010001100 · 2h ago
They asked gemini to come up with another word for cheating and it came up with 'harness'.
klohto · 4h ago
2 weeks ago
hansmayer · 4h ago
Right, but on the other hand... how is it even useful? Let's say it can beat the game, so what? So it can (kind of) summarise or write my emails - which is something I neither want nor need, they produce mountains of sloppy code, which I would have to end up fixing, and finally they can play a game? Where is the killer app? The gaming approach was exactly the premise of the original AI efforts in the 1960s, that teaching computers to play chess and other 'brainy' games will somehow lead to development of real AI. It ended as we know in the AI nuclear winter.
samrus · 4h ago
from a foundational research perspective, the pokemon benchmark is one of the most important ones.

these models are trained on a static task, text generation, which is to say the state they are operating in does not change as they operate. but now that they are out we are implicitly demanding they do dynamic tasks like coding, navigation, operating in a market, or playing games. this are tasks where your state changes as you operate

an example would be that as these models predict the next word, the ground truth of any further words doesnt change. if it misinterprets the word bank in the sentence "i went to the bank" as a river bank rather than a financial bank, the later ground truth wont change, if it was talking about the visit to the financial bank before, it will still be talking about that regardless of the model's misinterpretation. But if a model takes a wrong turn on the road, or makes a weird buy in the stock market, the environment will react and change and suddenly, what it should have done as the n+1th move before isnt the right move anymore, it needs to figure out a route of the freeway first, or deal with the FOMO bullrush it caused by mistakenly buying alot of stock

we need to push against these limits to set the stage for the next evolution of AI, RL based models that are trained in dynamic reactive environments in the first place

hansmayer · 3h ago
Honestly I have no idea what is this supposed to mean, and the high verbosity of whatever it is trying to prove is not helping it. To repeat: We already tried making computers play games. Ever heard of Deep Blue, and ever heard of it again since the early 2000s?
Rudybega · 39m ago
The state space for actions in Pokemon is hilariously, unbelievably larger than the state space for chess. Older chess algorithms mostly used Brute Force (things like minimax) and the number of actions needed to determine a reward (winning or losing) was way lower (chess ends in many, many, many fewer moves than Pokemon).

Successfully navigating through Pokemon to accomplish a goal (beating the game) requires a completely different approach, one that much more accurately mirrors the way you navigate and goal set in real world environments. That's why it's an important and interesting test of AI performance.

lechatonnoir · 1h ago
Here's a summary for you:

llm trained to do few step thing. pokemon test whether llm can do many step thing. many step thing very important.

j_maffe · 3h ago
> Where is the killer app?

My man, ChatGPT is the sixth most visited website in the world right now.

hansmayer · 3h ago
But I did not ask "what was the sixth most visited website in the world right now?", did I? I asked what was the killer app here. I am afraid vague and un-related KPIs will not help here, otherwise we may as well compare ChatGPT and PornHub based on the number of visits, as you seem to suggest.
j_maffe · 1h ago
If the now default go-to source for quick questions and formulating text isn't a killer app in your eyes, I don't know what is.
signatoremo · 2h ago
Are you saying PornHub isn’t a killer app?
minimaxir · 4h ago
It's a fun benchmark, like simonw's pelican riding a bike. Sometimes fun is the best metric.
lechatonnoir · 1h ago
This is a weirdly cherry-picked example. The gaming approach was also the premise of DeepMind's AI efforts in 2016, which was nine years ago. Regardless of what you think about the utility of text (code), video, audio, and image generation, surely you think that their progress on the protein-folding problem and weather prediction have been useful to society?

What counts as a killer app to you? Can you name one?

uludag · 4h ago
I'm curious what are others priors when reading benchmark scores. Obviously with immense funding at stakes, companies have every incentive to game the benchmarks, and the loss of goodwill from gaming the system doesn't appear to have much consequences.

Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.

imiric · 20m ago
Benchmark scores are marketing fluff. Just like the rest of this article with alleged praises from early adopters, and highly scripted and edited videos.

AI companies are grasping at straws by selling us minor improvements to stale technology so they can pump up whatever valuation they have left.

candiddevmike · 4h ago
People's interpretation of benchmarks will largely depend on whether they believe they will be better or worse off by GenAI taking over SWE jobs. Think you'd need someone outside the industry to weigh in to have a real, unbiased view.
douglasisshiny · 4h ago
Or someone who has been a developer for a decade plus trying to use these models on actual existing code bases, solving specific problems. In my experience, they waste time and money.
sandspar · 2h ago
These people are the most experienced, yes, but by the same token they also have the most incentive to disbelieve that an AI will take their job.
lewdwig · 4h ago
Well-designed benchmarks have a public sample set and a private testing set. Models are free to train on the public set, but they can't game the benchmark or overfit the samples that way because they're only assessed on performance against examples they haven't seen.

Not all benchmarks are well-designed.

thousand_nights · 1h ago
but as soon as you test on your private testing set you're sending it to their servers so they have access to it

so effectively you can only guarantee a single use stays private

minimaxir · 1h ago
Claude does not train on API I/O.

> By default, we will not use your inputs or outputs from our commercial products to train our models.

> If you explicitly report feedback or bugs to us (for example via our feedback mechanisms as noted below), or otherwise explicitly opt in to our model training, then we may use the materials provided to train our models.

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

iLoveOncall · 4h ago
Hasn't it been proven many times that all those companies cheat on benchmarks?

I personally couldn't care less about them, especially when we've seen many times that the public's perception is absolutely not tied to the benchmarks (Llama 4, the recent OpenAI model that flopped, etc.).

sebzim4500 · 3h ago
I don't think there's any real evidence that any of the major companies are going out of their way to cheat the benchmarks. Problem is that, unless you put a lot of effort into avoiding contamination, you will inevitably end up with details about the benchmark in the training set.
blueprint · 4h ago
kind of reminds me how they said they were increasing platform capabilities with Max and actually reduced them while charging a ton for it per month. Talk about a bait and switch. Lord help you if you tried to cancel your ill advised subscription during that product roll out as well - doubly so if you expect a support response.
sndean · 4h ago
Using Claude Opus 4, this was the first time I've gotten any of these models to produce functioning Dyalog APL that does something relatively complicated. And it actually runs without errors. Crazy (at least to me).
skruger · 3h ago
Would you mind sharing what you did? stefan@dyalog
cess11 · 3h ago
Would you mind sharing the query and results?
sali0 · 4h ago
I've found myself having brand loyalty to Claude. I don't really trust any of the other models with coding, the only one I even let close to my work is Claude. And this is after trying most of them. Looking forward to trying 4.
SkyPuncher · 4h ago
Gemini is _very_ good at architecture level thinking and implementation.

I tend to find that I use Gemini for the first pass, then switch to Claude for the actual line-by-line details.

Claude is also far superior at writing specs than Gemini.

leetharris · 4h ago
Much like others, this is my stack (or o1-pro instead of Gemini 2.5 Pro). This is a big reason why I use aider for large projects. It allows me to effortlessly combine architecture models and code writing models.

I know in Cursor and others I can just switch models between chats, but it doesn't feel intentional the way aider does. You chat in architecture mode, then execute in code mode.

Keyframe · 3h ago
could you describe a bit how does this work? I haven't had much luck with AI so far, but I'm willing to try.
BeetleB · 2h ago
https://aider.chat/2024/09/26/architect.html

The idea is that some models are better at reasoning about code, but others are better at actually creating the code changes (without syntax errors, etc). So Aider lets you pick two models - one does the architecting, and the other does the code change.

SirYandi · 59m ago
https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/

"tl:dr; Brainstorm spec, then plan a plan, then execute using LLM codegen. Discrete loops. Then magic."

justinbaker84 · 4h ago
I have been very brand loyal to claude also but the new gemini model is amazing and I have been using it exclusively for all of my coding for the last week.

I am excited to try out this new model. I actually want to stay brand loyal to antropic because I like the people and the values they express.

causal · 4h ago
This is exactly my approach. Use Gemini to come up with analysis and a plan, Claude to implement.
cheema33 · 4h ago
This matches with my experience as well.
vFunct · 4h ago
Yah Claude tends to output 1200+ line architectural specification documents while Gemini tends to output ~600 line. (I just had to write 100+ architectural spec documents for 100+ different apps)

Not sure why Claude is more thorough and complete than the other models, but it's my go-to model for large projects.

The OpenAI model outputs are always the smallest - 500 lines or so. Not very good at larger projects, but perfectly fine for small fixes.

rogerrogerr · 4h ago
> I just had to write 100+ architectural spec documents for 100+ different apps

… whaaaaat?

vFunct · 4h ago
Huge project..
skybrian · 3h ago
Big design up front is back? But I guess it's a lot easier now, so why not?

No comments yet

chermi · 3h ago
I'd interested to hear more about your workflow. I use Gemini for discussing the codebase, making ADR entries based on discussion, ticket generation, documenting the code like module descriptions and use cases+examples, and coming up with detailed plans for implementation that cursor with sonnet can implement. Do you have any particular formats, guidelines or prompts? I don't love my workflow. I try to keep everything in notion but it's becoming a pain. I'm pretty new to documentation and proper planning, but I feel like it's more important now to get the best use out of the llms. Any tips appreciated!
vFunct · 3h ago
For a large project, the real human trick for you to do is to figure out how to partition it down to separate apps, so that individual LLMs can work on them separately, as if they were their own employees in separate departments.

You then ask LLMs to first write features for the individual apps (in Markdown), giving it some early competitive guidelines.

You then tell LLMs to read that features document, and then write an architectural specification document. Tell it to maybe add example data structures, algorithms, or user interface layouts. All in Markdown.

You then feed these two documents to individual LLMs to write the rest of the code, usually starting with the data models first, then the algorithms, then the user interface.

Again, the trick is to partition your project to individual apps. Also an app isn't the full app. It might just be a data schema, a single GUI window, a marketing plan, etc.

The other hard part is to integrate the apps back together at the top level if they interact with each other...

Bjartr · 3h ago
What's your prompt look like for creating spec documents?
jonny_eh · 3h ago
Who's reading these docs?
tasuki · 1h ago
Another LLM, which distills it back into a couple of sentences for human consumption.
MattSayar · 4h ago
Same. And I JUST tried their GitHub Action agentic thing yesterday (wrote about it here[0]), and it honestly didn't perform very well. I should try it again with Claude 4 and see if there are any differences. Should be an easy test

[0] https://mattsayar.com/personalized-software-really-is-coming...

WillPostForFood · 4h ago
It would be pretty funny if your "not today. Maybe tomorrow" proposition actually did happen the next day.
MattSayar · 3h ago
It literally almost did!
kilroy123 · 4h ago
The new Gemini models are very good too.
pvg · 3h ago
As the poet wrote:

I prefer MapQuest

that's a good one, too

Google Maps is the best

true that

double true!

qingcharles · 2h ago
I'm slutty. I tend to use all four at once: Claude, Grok, Gemini and OpenAI.

They keep leap-frogging each other. My preference has been the output from Gemini these last few weeks. Going to check out Claude now.

cleak · 4h ago
Something I’ve found true of Claude, but not other models, is that when the benchmarks are better, the real world performance is better. This makes me trust them a lot more and keeps me coming back.
alex1138 · 4h ago
I think "Kagi Code" or whatever it's called is using Claude
Recursing · 4h ago
I also recommend trying out Gemini, I'm really impressed by the latest 2.5. Let's see if Claude 4 makes me switch back.
dwaltrip · 4h ago
What's the best way to use gemini? I'm currently pretty happy / impressed with claude code via the CLI, its the best AI coding tool I've tried so far
stavros · 3h ago
I use it via Aider, with Gemini being the architect and Claude the editor.
javier2 · 4h ago
I wouldn't go as far, but I actually have some loyalty to Claude as well. Don't even know why, as I think the differences are marginal.
sergiotapia · 4h ago
Gemini 2.5 Pro replaced Claude 3.7 for me after using nothing but claude for a very long time. It's really fast, and really accurate. I can't wait to try Claude 4, it's always been the most "human" model in my opinion.
ashirviskas · 4h ago
Idk I found Gemini 2.5 Breaking code style too often and introducing unneeded complexity on the top of leaving unfinished functions.
nico · 3h ago
I think it really depends on how you use it. Are you using an agent with it, or the chat directly?

I've been pretty disappointed with Cursor and all the supported models. Sometimes it can be pretty good and convenient, because it's right there in the editor, but it can also get stuck on very dumb stuff and re-trying the same strategies over and over again

I've had really good experiences with o4-high-mini directly on the chat. It's annoying going back and forth copying/pasting code between editor and the browser, but it also keeps me more in control about the actions and the context

Would really like to know more about your experience

rasulkireev · 4h ago
i was the same, but then slowly converted to Gemini. Still not sure how that happened tbh
anal_reactor · 4h ago
I've been initially fascinated by Claude, but then I found myself drawn to Deepseek. My use case is different though, I want someone to talk to.
miroljub · 4h ago
I also use DeepSeek R1 as a daily driver. Combined with Qwen3 when I need better tool usage.

Now that both Google and Claude are out, I expect to see DeepSeek R2 released very soon. It would be funny to watch an actual open source model getting close to the commercial competition.

ashirviskas · 4h ago
Have you compared R1 with V3-0324?
jabroni_salad · 1h ago
A nice thing about Deepseek is that it is so cheap to run. It's nice being able to explore conversational trees without getting a $12 bill at the end of it.
oofbaroomf · 4h ago
Nice to see that Sonnet performs worse than o3 on AIME but better on SWE-Bench. Often, it's easy to optimize math capabilities with RL but much harder to crack software engineering. Good to see what Anthropic is focusing on.
j_maffe · 3h ago
That's a very contentious opinion you're stating there. I'd say LLMs have surpassed a larger percentage of SWEs in capability than they have for mathematicians.
oofbaroomf · 3h ago
Mathematicians don't do high school math competitions - the benchmark in question is AIME.

Mathematicians generally do novel research, which is hard to optimize for easily. Things like LiveCodeBench (leetcode-style problems), AIME, and MATH (similar to AIME) are often chosen by companies so they can flex their model's capabilities, even if it doesn't perform nearly as well in things real mathematicians and real software engineers do.

j_maffe · 1h ago
Ok then you should clarify that you meant math benchmarks and not math capabilities.
thimabi · 4h ago
It’s been hard to keep up with the evolution in LLMs. SOTA models basically change every other week, and each of them has its own quirks.

Differences in features, personality, output formatting, UI, safety filters… make it nearly impossible to migrate workflows between distinct LLMs. Even models of the same family exhibit strikingly different behaviors in response to the same prompt.

Still, having to find each model’s strengths and weaknesses on my own is certainly much better than not seeing any progress in the field. I just hope that, eventually, LLM providers converge on a similar set of features and behaviors for their models.

runjake · 4h ago
My advice: don't jump around between LLMs for a given project. The AI space is progressing too rapidly right now. Save yourself the sanity.
charlie0 · 1h ago
A man with one compass knows where he's going; a man with two compasses is never sure.
rokkamokka · 4h ago
Isn't that an argument to jump around? Since performance improves so rapidly between models
morkalork · 2h ago
I think the idea is you might end up spending your time shaving a yak. Finish your project, then try the ne SOTA on your next task.
dmix · 4h ago
You should at least have two to sanity check difficult programming solutions.
vFunct · 4h ago
Each model has their own strength and weaknesses tho. You really shouldn’t be using one model for everything. Like, Claude is great at coding but is expensive so you wouldn’t use them for debugging to writing test benches. But the OpenAI models suck at architecture but are cheap, so are ideal for test benches, for example.
matsemann · 4h ago
How important is it to be using SOTA? Or even jump on it already?

Feels a bit like when it was a new frontend framework every week. Didn't jump on any then. Sure, when React was the winner, I had a few months less experience than those who bet on the correct horse. But nothing I couldn't quickly catch up to.

thimabi · 4h ago
> How important is it to be using SOTA?

I believe in using the best model for each use case. Since I’m paying for it, I like to find out which model is the best bang for my buck.

The problem is that, even when comparing models according to different use cases, better models eventually appear, and the models one uses eventually change as well — for better or worse. This means that using the same model over and over doesn’t seem like a good decision.

smallnamespace · 4h ago
Vibe code an eval harness with a web dashboard
Falimonda · 3h ago
Have you tried a package like LiteLLM so that you can more easily validate and switch to a newer model?

The key seems to be in curating your application's evaluation set.

wewewedxfgdf · 35m ago
I would take better files export/access than more fancy AI features any day.

Copying and pasting is so old.

waynecochran · 4h ago
My mind has been blown using ChatGPT's o4-mini-high for coding and research (it knowledge of computer vision and tools like OpenCV are fantastic). Is it worth trying out all the shiny new AI coding agents ... I need to get work done?
bcrosby95 · 4h ago
Kinda interesting, as I've found 4o better than o4-mini-high for most of my coding. And while it's mind blowing that they can do what they can do, the code itself coming out the other end has been pretty bad, but good enough for smaller snippets and extremely isolated features.
throwaway314155 · 4h ago
I would say yes. The jump in capability and reduction in hallucinations (at least code) to Claude 3.7 from ChatGPT (even o3) is immediately noticeable in my experience. Same goes for gemini which was even better in some ways until perhaps today.
goranmoomin · 4h ago
> Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Claude to alternate between reasoning and tool use to improve responses.

I'm happy that tool use during extended thinking is now a thing in Claude as well, from my experience with CoT models that was the one trick(tm) that massively improves on issues like hallucination/outdated libraries/useless thinking before tool use, e.g.

o3 with search actually returned solid results, browsing the web as like how i'd do it, and i was thoroughly impressed – will see how Claude goes.

travisgriggs · 4h ago
It feels as if the CPU MHz wars of the '90s are back. Now instead of geeking about CPU architectures which have various results of ambigous value on different benchmarks, we're talking about the same sorts of nerdy things between LLMs.

History Rhymes with Itself.

janpaul123 · 1h ago
At Kilo we're already seeing lots of people trying it out. It's looking very good so far. Gemini 2.5 Pro had been taking over from Claude 3.7 Sonnet, but it looks like there's a new king. The bigger question is how often it's worth the price.
swyx · 5h ago
livestream here: https://youtu.be/EvtPBaaykdo

my highlights:

1. Coding ability: "Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish." however this is Best of N, with no transparency on size of N and how they decide the best, saying "We then use an internal scoring model to select the best candidate from the remaining attempts." Claude Code is now generally available (we covered in http://latent.space/p/claude-code )

2. Memory highlight: "Claude Opus 4 also dramatically outperforms all previous models on memory capabilities. When developers build applications that provide Claude local file access, Opus 4 becomes skilled at creating and maintaining 'memory files' to store key information. This unlocks better long-term task awareness, coherence, and performance on agent tasks—like Opus 4 creating a 'Navigation Guide' while playing Pokémon." Memory Cookbook: https://github.com/anthropics/anthropic-cookbook/blob/main/t...

3. Raw CoT available: "we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access."

4. haha: "We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. " <- psyop?

5. context caching now has a premium 1hr TTL option: "Developers can now choose between our standard 5-minute time to live (TTL) for prompt caching or opt for an extended 1-hour TTL at an additional cost"

6. https://www.anthropic.com/news/agent-capabilities-api new code execution tool (sandbox) and file tool

modeless · 4h ago
Memory could be amazing for coding in large codebases. Web search could be great for finding docs on dependencies as well. Are these features integrated with Claude Code though?
rudedogg · 2h ago
How are Claude’s rate limits on the $20 plan? I used to hit them a lot when I subscribed ~6 months ago, to the point that I got frustrated and unsubscribed.
wyes · 2h ago
They have gotten worse.
boh · 4h ago
Can't wait to hear how it breaks all the benchmarks but have any differences be entirely imperceivable in practice.
low_tech_punk · 3h ago
Can anyone help me understand why they changed the model naming convention?

BEFORE: claude-3-7-sonnet

AFTER: claude-sonnet-4

jpau · 1h ago
Seems to be a nod to each size being treated as their own product.

Claude 3 arrived as a family (Haiku, Sonnet, Opus), but no release since has included all three sizes.

A release of "claude-3-7-sonnet" alone seems incomplete without Haiku/Opus, when perhaps Sonnet is has its own development roadmap (claude-sonnet-*).

sebzim4500 · 3h ago
Because we are going to get AGI before an AI company can consistently name their models.
nprateem · 2h ago
AGI will be v6, so this will get them there faster.
jonny_eh · 3h ago
Better information hierarchy.
KaoruAoiShiho · 4h ago
Is this really worthy of a claude 4 label? Was there a new pre-training run? Cause this feels like 3.8... only swe went up significantly, and that as we all understand by now is done by cramming on specific post training data and doesn't generalize to intelligence. The agentic tooluse didn't improve and this says to me that it's not really smarter.
minimaxir · 4h ago
So I decided to try Claude 4 Sonnet against my "Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30." benchmark I tested against Claude 3.5 Sonnet: https://news.ycombinator.com/item?id=42584400

The results are here (https://gist.github.com/minimaxir/1bad26f0f000562b1418754d67... ) and it utterly crushed the problem with the relevant microoptimizations commented in that HN discussion (oddly in the second pass it a) regresses from a vectorized approach to a linear approach and b) generates and iterates on three different iterations instead of one final iteration), although it's possible Claude 4 was trained on that discussion lol.

EDIT: "utterly crushed" may have been hyperbole.

diggan · 4h ago
> although it's possible Claude 4 was trained on that discussion lol

Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing in a couple of hours via the Algolia API.

Recommendation for the future: keep your benchmarks/evaluations private, as otherwise they're basically useless as more models get published that are trained on your data. This is what I do, and usually I don't see the "huge improvements" as other public benchmarks seems to indicate when new models appear.

dr_kiszonka · 3h ago
>> although it's possible Claude 4 was trained on that discussion lol

> Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing via the Algolia API.

I am wondering if this could be cleverly exploited. <twirls mustache>

bsamuels · 4h ago
as soon as you publish a benchmark like this, it becomes worthless because it can be included in the training corpus
rbjorklin · 1h ago
While I agree with you in principle give Claude 4 a try on something like: https://open.kattis.com/problems/low . I would expect this to have been included in the training material as well as solutions found on Github. I've tried providing the problem description and asking Claude Sonnet 4 to solve it and so far it hasn't been successful.
thethirdone · 3h ago
The first iteration vectorized with numpy is the best solution imho. The only additional optimization is using modulo 9 to give you a sum of digits mod 9; that should filter out approximately 1/9th of numbers. The digit summing is the slow part so reducing the number of values there results in a large speedup. Numpy can do that filter pretty fast as `arr = arr[arr%9==3]`

With that optimization its about 3 times faster, and all of the none numpy solutions are slower than the numpy one. In python it almost never makes sense to try to manually iterate for speed.

isahers32 · 3h ago
Might just be missing something, but isn't 9+9+9+9+3=39? The largest number I believe is 99930? Also, it could further optimize by terminating digit sum calculations earlier if sum goes above 30 or could not reach 30 (num digits remaining * 9 is less than 30 - current_sum). imo this is pretty far from "crushing it"
losvedir · 2h ago
Same for me, with this past year's Advent of Code. All the models until now have been stumped by Day 17 part 2. But Opus 4 finally got it! Good chance some of that is in its training data, though.
Epa095 · 3h ago
I find it weird that it does a inner check on ' num > 99999', which pretty much only checks for 100,000. It could check for 99993, but I doubt even that check makes it much faster.

But have you checked with some other number than 30? Does it screw up the upper and lower bounds?

jonny_eh · 3h ago
> although it's possible Claude 4 was trained on that discussion lol

This is why we can't have consistent benchmarks

teekert · 3h ago
Yeah I agree, also, what is the use of that benchmark? Who cares? How does it related to stuff that does matter?
kevindamm · 3h ago
I did a quick review of its final answer and looks like there are logic errors.

All three of them get the incorrect max-value bound (even with comments saying 9+9+9+9+3 = 30), so early termination wouldn't happen in the second and third solution, but that's an optimization detail. The first version would, however, early terminate on the first occurrence of 3999 and take whatever the max value was up to that point. So, for many inputs the first one (via solve_digit_sum_difference) is just wrong.

The second implementation (solve_optimized, not a great name either) and third implementation, at least appear to be correct... but that pydoc and the comments in general are atrocious. In a review I would ask these to be reworded and would only expect juniors to even include anything similar in a pull request.

I'm impressed that it's able to pick a good line of reasoning, and even if it's wrong about the optimizations it did give a working answer... but in the body of the response and in the code comments it clearly doesn't understand digit extraction per se, despite parroting code about it. I suspect you're right that the model has seen the problem solution before, and is possibly overfitting.

Not bad, but I wouldn't say it crushed it, and wouldn't accept any of its micro-optimizations without benchmark results, or at least a benchmark test that I could then run.

Have you tried the same question with other sums besides 30?

minimaxir · 3h ago
Those are fair points. Even with those issues, it's still better substantially better than the original benchmark (maybe "crushing it" is too subjective a term).

I reran the test to run a dataset of 1 to 500,000 and sum digits up to 37 and it went back to the numba JIT implementation that was encountered in my original blog post, without numerology shenanigans. https://gist.github.com/minimaxir/a6b7467a5b39617a7b611bda26...

I did also run the model at temp=1, which came to the same solution but confused itself with test cases: https://gist.github.com/minimaxir/be998594e090b00acf4f12d552...

Bjorkbat · 2h ago
I was about to comment on a past remark from Anthropic that the whole reason for the convoluted naming scheme was because they wanted to wait until they had a model worth of the "Claude 4" title.

But because of all the incremental improvements since then, the irony is that this merely feels like an incremental improvement. It obviously is a huge leap when you consider that the best Claude 3 ever got on SWE-verified was just under 20% (combined with SWE-agent), but compared to Claude 3.7 it doesn't feel like that big of a deal, at least when it comes to SWE-bench results.

Is it worthy? Sure, why not, compared to the original Claude 3 at any rate, but this habit of incremental improvement means that a major new release feels kind of ordinary.

wrsh07 · 4h ago
My understanding for the original OpenAI and anthropic labels was essentially: gpt2 was 100x more compute than gpt1. Same for 2 to 3. Same for 3 to 4. Thus, gpt 4.5 was 10x more compute^

If anthropic is doing the same thing, then 3.5 would be 10x more compute vs 3. 3.7 might be 3x more than 3.5. and 4 might be another ~3x.

^ I think this maybe involves words like "effective compute", so yeah it might not be a full pretrain but it might be! If you used 10x more compute that could mean doubling the amount used on pretraining and then using 8x compute on post or some other distribution

swyx · 4h ago
beyond 4 thats no longer true - marketing took over from the research
wrsh07 · 23m ago
Oh shoot I thought that still applied to 4.5 just in a more "effective compute" way (not 100x more parameters, but 100x more compute in training)

But alas, it's not like 3nm fab means the literal thing either. Marketing always dominates (and not necessarily in a way that adds clarity)

jacob019 · 3h ago
Hey, at least they incremented the version number. I'll take it.
mike_hearn · 3h ago
They say in the blog post that tool use has improved dramatically: parallel tool use, ability to use tools during thinking and more.
causal · 4h ago
Slight decrease from Sonnet 3.7 in a few areas even. As always benchmarks say one thing, will need some practice with it to get a subjective opinion.
oofbaroomf · 4h ago
The improvement from Claude 3.7 wasn't particularly huge. The improvement from Claude 3, however, was.
zamadatix · 4h ago
It feels like the days of Claude 2 -> 3 or GPT 2->3 level changes for the leading models are over and you're either going to end up with really awkward version numbers or just embrace it and increment the number. Nobody cares a Chrome update gives a major version change of 136->137 instead of 12.4.2.33 -> 12.4.3.0 for similar kinds of "the version number doesn't always have to represent the amount of work/improvement compared to the previous" reasoning.
saubeidl · 4h ago
It feels like LLM progress in general has kinda stalled and we're only getting small incremental improvements from here.

I think we've reached peak LLM - if AGI is a thing, it won't be through this architecture.

nico · 4h ago
Diffusion LLMs seem like they could be a huge change

Check this out from yesterday (watch the short video here):

https://simonwillison.net/2025/May/21/gemini-diffusion/

From:

https://news.ycombinator.com/item?id=44057820

antupis · 3h ago
Yup especially locally diffusion models will be big.
bcrosby95 · 3h ago
Even if LLMs never reach AGI, they're good enough to where a lot of very useful tooling can be built on top of/around them. I think of it more as the introduction of computing or the internet.

That said, whether or not being a provider of these services is a profitable endeavor is still unknown. There's a lot of subsidizing going on and some of the lower value uses might fall to the wayside as companies eventually need to make money off this stuff.

michaelbrave · 3h ago
this was my take as well. Though after a while I've started thinking about it closer to the introduction of electricity which in a lot of ways would be considered the second stage of the industrial revolution, the internet and AI might be considered the second stage of the computing revolution (or so I expect history books to label it as). But just like electricity, it doesn't seem to be very profitable for the providers of electricity, but highly profitable for everything that uses it.
jenny91 · 3h ago
I think it's a bit early to say. At least in my domain, the models released this year (Gemini 2.5 Pro, etc). Are crushing models from last year. I would therefore not by any means be ready to call the situation a stall.
goatlover · 4h ago
Which brings up the question of why AGI is a thing at all. Shouldn't LLMs just be tools to make humans more productive?
amarcheschi · 3h ago
Think of the poor vcs who are selling agi as the second coming of christ
onlyrealcuzzo · 4h ago
Did you see Gemini 1.5 pro vs 2.5 pro?
zamadatix · 4h ago
Sure, but despite there being a 2.0 release between they didn't even feel the need to release a Pro for it still isn't the kind of GPT 2 -> 3 improvement we were hoping would continue for a bit longer. Companies will continue to release these incremental improvements which are all always neck-and-neck with each other. That's fine and good, just don't inherently expect the versioning to represent the same relative difference instead of the relative release increment.
cma · 3h ago
I'd say 2.5 pro to 1.5 pro was a 3 -> 4 level improvement, but the problem is 1.5 pro wasn't state of the art when released, except for context length, and 2.5 wasn't that kind of improvement compared to the best open AI or Claude stuff that was available when it released.

1.5 pro was worse than original gpt4 on several coding things I tried head to head.

drusepth · 3h ago
To be fair, a lot of people said 3.7 should have just been called 4. Maybe they're just bridging the gap.
greenfish6 · 4h ago
Benchmarks don't tell you as much as the actual coding vibe though
joshstrange · 4h ago
If you are looking for the IntelliJ Jetbrain plugin it's here: https://plugins.jetbrains.com/plugin/27310-claude-code-beta-

I couldn't find it linked from Claude Code's page or this announcement

Seattle3503 · 1h ago
I'm getting "claude code not found" even though I have Claude Code installed. Is there some trick to getting it to see my install? I installed claude code the normal way.
joshstrange · 1h ago
Hmm, I didn’t have to do any extra steps after installing the plugin (I had the cli tool installed already as well).
oofbaroomf · 4h ago
Wonder why they renamed it from Claude <number> <type> (e.g. Claude 3.7 Sonnet) to Claude <type> <number> (Claude Opus 4).
stavros · 3h ago
I guess because they haven't been releasing all three models of the same version in a while now. We've only had Sonnet updates, so the version first didn't make sense if we had 3.5 and 3.7 Sonnet but none of the others.
k8sToGo · 3h ago
Seems like Github just added it to Copilot. For now the premium requests do not count, but starting June 4th it will.
hnthrowaway0315 · 3h ago
When can we reach the point that 80% of the capacity of mediocre junior frontend/data engineers can be replaced?
93po · 3h ago
im mediocre and got fired yesterday so not far
energy123 · 4h ago

  > Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full.
This is not better for the user. No users want this. If you're doing this to prevent competitors training on your thought traces then fine. But if you really believe this is what users want, you need to reconsider.
comova · 4h ago
I believe this is to improve performance by shortening the context window for long thinking processes. I don't think this is referring to real-time summarizing for the users' sake.
usaar333 · 3h ago
When you do a chat are reasoning traces for prior model outputs in the LLM context?
int_19h · 1h ago
No, they are normally stripped out.
j_maffe · 3h ago
> I don't think this is referring to real-time summarizing for the users' sake.

That's exactly what it's referring to.

dr_kiszonka · 3h ago
I agree. Thinking traces are the first thing I check when I suspect Claude lied to me. Call me cynical, but I suspect that these new summaries will conveniently remove the "evidence."
throwaway314155 · 4h ago
If _you_ really believe this is what all users want, _you_ should reconsider. Your describing a feature for power users. It should be a toggle but it's silly to say it didn't improve UX for people who don't want to read strange babbling chains of thought.
energy123 · 4h ago
You're accusing me of mind reading other users, but then proceed to engage in your own mind reading of those same users.

Have a look in Gemini related subreddits after they nerfed their CoT yesterday. There's nobody happy about this trend. A high quality CoT that gets put through a small LLM is really no better than noise. Paternalistic noise. It's not worth reading. Just don't even show me the CoT at all.

If someone is paying for Opus 4 then they likely are a power user, anyway. They're doing it for the frontier performance and I would assume such users would appreciate the real CoT.

porridgeraisin · 4h ago
Here's an example non-power-user usecase for CoT:

Sometimes when I miss to specify a detail in my prompt and it's just a short task where I don't bother with long processes like "ask clarifying questions, make a plan and then follow it" etc etc, I see it talking about making that assumption in the CoT and I immediately cancel the request and edit the detail in.

HiPHInch · 4h ago
How long will the VScode wrapper (cursor, windsurf) survive?

Love to try the Claude Code VScode extension if the price is right and purchase-able from China.

barnabee · 4h ago
I don't see any benefit in those VC funded wrappers over open source VS Code (or better, VSCodium) extensions like Roo/Cline.

They survive through VC funding, marketing, and inertia, I suppose.

tristan957 · 1h ago
VSCode is not open source. It is proprietary. code-oss is however an MIT-licensed project. VSCodium is also an MIT-licensed project.
danielbln · 4h ago
Cline is VC funded.
maeil · 3h ago
Please give us a source as this does not seem publically verifiable.
danielbln · 3h ago
maeil · 2h ago
Thank you! Very disappointing news.
bingobangobungo · 4h ago
what do you mean purchasable from china? As in you are based in china or is there a way to game the tokens pricing
HiPHInch · 4h ago
Claude register need a phone number, but cannot select China (+86), and even if I have a account, it may hard to purchase because the credit card issue.

Some app like Windsurf can easily pay with Alipay, a everyone-app in China.

tonyhart7 · 1h ago
I already tested it with coding task, Yes the improvement is there

Albeit not a lot because Claude 3.7 sonnet is already great

james_marks · 4h ago
> we’ve significantly reduced behavior where the models use shortcuts or loopholes to complete tasks. Both models are 65% less likely to engage in this behavior than Sonnet 3.7 on agentic tasks

Sounds like it’ll be better at writing meaningful tests

NitpickLawyer · 4h ago
One strategy that also works is to have 2 separate "sessions", have one write code and one write tests. Forbid one to change the other's "domain". Much better IME.
UncleEntity · 3h ago
In my experience, when presented with a failing test it would simply try to make the test pass instead of determining why the test was failing. Usually by hard coding the test parameters (or whatever) in the failing function... which was super annoying.
FergusArgyll · 3h ago
On non-coding or mathematical tasks I'm not seeing a difference yet.

I wish someone focused on making the models give better answers about the Beatles or Herodotus...

diggan · 4h ago
Anyone with access who could compare the new models with say O1 Pro Mode? Doesn't have to be a very scientific comparison, just some first impressions/thoughts compared to the current SOTA.
drewnick · 3h ago
I just had some issue with RLS/schema/postgres stuff. Gemini 2.5 Pro swung and missed, and talked a lot with little code, Claude Sonnet 4 solved. O1 Pro Solved. It's definitely random which of these models can solve various problems with the same prompt.
diggan · 3h ago
> definitely random which of these models can solve various problems with the same prompt.

Yeah, this is borderline my feeling too. Kicking off Codex with the same prompt but four times sometimes leads to for very different but confident solutions. Same when using the chat interfaces, although it seems like Sonnet 3.7 with thinking and o1 Pro Mode is a lot more consistent than any Gemini model I've tried.

msp26 · 4h ago
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.

Extremely cringe behaviour. Raw CoTs are super useful for debugging errors in data extraction pipelines.

After Deepseek R1 I had hope that other companies would be more open about these things.

blueprint · 4h ago
pretty sure the cringe doesn't stop there. It wouldn't surprise me if this is not the only thing that they are attempting to game and hide from their users.

The Max subscription with fake limits increase comes to mind.

lr1970 · 2h ago
context window of both opus and sonnet 4 are still the same 200kt as with sonnet-3.7, underwhelming compared to both latest gimini and gpt-4.1 that are clocking at 1mt. For coding tasks context window size does matter.
jetsetk · 1h ago
After that debacle on X, I will not try anything that comes from anthropic for sure. Be careful!
sunaookami · 1h ago
What happened?
paradite · 1h ago
Opus 4 beats all other models in my personal eval set for coding and writing.

Sonnet 4 also beats most models.

A great day for progress.

https://x.com/paradite_/status/1925638145195876511

unshavedyak · 4h ago
Anyone know if this is usable with Claude Code? If so, how? I've not seen the ability to configure the backend for Claude Code, hmm
drewnick · 3h ago
Last night I suddenly got noticeably better performance in Claude Code. Like it one shotted something I'd never seen before and took multiple steps over 10 minutes. I wonder if I was on a test branch? It seems to be continuing this morning with good performance, solving an issue Gemini was struggling with.
cloudc0de · 3h ago
Just saw this popup in claude cli v1.0.0 changelog

What's new:

• Added `DISABLE_INTERLEAVED_THINKING` to give users the option to opt out of interleaved thinking.

• Improved model references to show provider-specific names (Sonnet 3.7 for Bedrock, Sonnet 4 for Console)

• Updated documentation links and OAuth process descriptions

• Claude Code is now generally available

• Introducing Sonnet 4 and Opus 4 models

michelb · 3h ago
Yes, you can type /model in Code to switch model, currently Opus 4 and Sonnet 4 for me.
khalic · 4h ago
The new page for Claude Code says it uses opus 4, sonnet 4 and haiku 3.5
lxe · 4h ago
Looks like both opus and sonnet are already in Cursor.
daviding · 4h ago
"We're having trouble connecting to the model provider"

A bit busy at the moment then.

benmccann · 4h ago
The updated knowledge cutoff is helping with new technologies such as Svelte 5.
fsto · 4h ago
What’s your guess on when Claude 4 will be available on AWS Bedrock?
cloudc0de · 2h ago
I was able to request both new models only in Oregon / us-west-2.

I built a Slack bot for my wife so she can use Claude without a full-blown subscription. She uses it daily for lesson planning and brainstorming and it only costs us about .50 cents a month in Lambda + Bedrock costs. https://us-west-2.console.aws.amazon.com/bedrock/home?region...

https://docs.anthropic.com/en/docs/claude-code/bedrock-verte...

richin13 · 4h ago
The blog says it should be available now but I'm not seeing it. In Vertex AI it's already available.
esaym · 4h ago
> Try Claude Sonnet 4 today with Claude Opus 4 on paid plans.

Wait, Sonnet 4? Opus 4? What?

sxg · 4h ago
Claude names their models based on size/complexity:

- Small: Haiku

- Medium: Sonnet

- Large: Opus

chiffre01 · 3h ago
I always like the benchmark these by vibe coding Dreamcast demos with KallistiOS. It's a good test of how deep the training was.
accrual · 4h ago
Very impressive, congrats Anthropic/Claude team! I've been using Claude for personal project development and finally bought a subscription to Pro as well.
Scene_Cast2 · 2h ago
Already up on openrouter. Opus 4 is giving 429 errors though.
devinprater · 2h ago
claude.ai still isn't as accessible to me as a blind person using a screen reader as ChatGPT, or even Gemini, is, so I'll stick with the other models.
tristan957 · 1h ago
My understanding of the Americans with Disabilities Act, is that companies that are located in the US and/or provide goods/services to people living in the US, must provide an accessible website. Maybe someone more well-versed in this can come along and correct me or help you to file a complaint if my thinking is correct.
josvdwest · 3h ago
Wonder when Anthropic will IPO. I have a feeling they will win the foundation model race.
minimaxir · 3h ago
OpenAI still has the largest amount of market share for LLM use, even with Claude and Gemini recently becoming more popular for vibe coding.
eamag · 4h ago
When will structured output be available? Is it difficult for anthropic because custom sampling breaks their safety tools?
josefresco · 4h ago
I have the Claude Windows app, how long until it can "see" what's on my screen and help me code/debug?
throwaway314155 · 4h ago
You can likely set up an MCP server that handles this.
oofbaroomf · 4h ago
Interesting how Sonnet has a higher SWE-bench Verified score than Opus. Maybe says something about scaling laws.
somebodythere · 3h ago
My guess is that they did RLVR post-training for SWE tasks, and a smaller model can undergo more RL steps for the same amount of computation.
iambateman · 2h ago
Just checked to see if Claude 4 can solve Sudoku.

It cannot.

i_love_retros · 4h ago
Anyone know when the o4-x-mini release is being announced? I thought it was today
Artgor · 4h ago
OpenIA's Codex-1 isn't so cool anymore. If it was ever cool.

And Claude Code used Opus 4 now!

mupuff1234 · 4h ago
But if Gemini 2.5 pro was considered to be the strongest coder lately, does SWE-bench really reflect reality?
__jl__ · 4h ago
Anyone found information on API pricing?
burningion · 4h ago
Yeah it's live on the pricing page:

https://www.anthropic.com/pricing#api

Opus 4 is $15 / m tokens in, $75 / MTok out Sonnet 4 is the same $3 / MTok in, $15 / MTok out

__jl__ · 4h ago
Thanks. I looked a couple minutes ago and couldn't see it. For anyone curious, pricing remains the same as previous Anthropic models.
pzo · 3h ago
surprisingly cursor charges only 0.75x for request for sonnet 4.0 (comparing to 1x for sonnet 3.7)
transcranial · 2h ago
It does say "temporarily offered at a discount" when you hover over the model in the dropdown.
piperswe · 4h ago
From the linked post:

> Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.

rafram · 4h ago
It’s up on their pricing page: https://www.anthropic.com/pricing
rasulkireev · 4h ago
At this point, it is hilarious the speed at which the AI industry is moving forward... Claude 4, really?
whalesalad · 4h ago
Anyone have a link to the actual Anthropic official vscode extension? Struggling to find it.

edit: run `claude` in a vscode terminal and it will get installed. but the actual extension id is `Anthropic.claude-code`

ChadMoran · 3h ago
Thank you!
nprateem · 2h ago
I posted it earlier.

Anthropic: You're killing yourselves by not supporting structured responses. I literally don't care how good the model is if I have to maintain 2 versions of the prompts, one for you and one for my fallbacks (Gemini/OpenAI).

Get on and support proper pydantic schemas/JSON objects instead of XML.

htrp · 4h ago
Allegedly Claude 4 Opus can run autonomously for 7 hours (basically automating an entire SWE workday).
jeremyjh · 4h ago
Which sort of workday? The sort where you rewrite your code 8 times and end the day with no marginal business value produced?
renewiltord · 4h ago
Well Claude 3.7 definitely did the one where it was supposed to process a file and it finally settled on `fs.copyFile(src, dst)` which I think is pro-level interaction. I want those $0.95 back.

But I love you Claude. It was me, not you.

readthenotes1 · 4h ago
Well, at least it doesn't distract your coworkers, disrupting their flow
kaoD · 2h ago
I'm already working on the Slack MCP integration.
jeremyjh · 36m ago
Please encourage it to use lots of emojis.
htrp · 4h ago
>Rakuten validated its capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance.

From their customer testimonials in the announcement, more below

>Cursor calls it state-of-the-art for coding and a leap forward in complex codebase understanding. Replit reports improved precision and dramatic advancements for complex changes across multiple files. Block calls it the first model to boost code quality during editing and debugging in its agent, codename goose, while maintaining full performance and reliability. Rakuten validated its capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance. Cognition notes Opus 4 excels at solving complex challenges that other models can't, successfully handling critical actions that previous models have missed.

>GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot. Manus highlights its improvements in following complex instructions, clear reasoning, and aesthetic outputs. iGent reports Sonnet 4 excels at autonomous multi-feature app development, as well as substantially improved problem-solving and codebase navigation—reducing navigation errors from 20% to near zero. Sourcegraph says the model shows promise as a substantial leap in software development—staying on track longer, understanding problems more deeply, and providing more elegant code quality. Augment Code reports higher success rates, more surgical code edits, and more careful work through complex tasks, making it the top choice for their primary model.

paxys · 4h ago
I can write an algorithm to run in a loop forever, but that doesn't make it equivalent to infinite engineers. It's the output that matters.
catigula · 4h ago
That is quite the allegation.
speedgoose · 4h ago
Easy, I can also make a nanoGPT run for 7 hours when inferring on a 68k, and make it produce as much value as I usually do.
victorbjorklund · 4h ago
Makes no sense to measure it in hours. You can have a slow CPU making the model run for longer.
krelian · 4h ago
How much does it cost to have it run for 7 hours straight?
sandspar · 2h ago
OpenAI's 5 levels of AI intelligence

Level 1: Chatbots: AI systems capable of engaging in conversations, understanding natural language, and responding in a human-like manner.

Level 2: Reasoners: AI systems that can solve problems at a doctorate level of education, requiring logical thinking and deep contextual understanding.

Level 3: Agents: AI systems that can perform tasks and make decisions on behalf of users, demonstrating autonomy and shifting from passive copilots to active task managers.

Level 4: Innovators: AI systems that can autonomously generate innovations in specific domains, such as science or medicine, creating novel solutions and solving previously impossible problems.

Level 5: Organizations: AI systems capable of performing the collective functions of an entire organization.

-

So I guess we're in level 3 now. Phew, hard to keep up!

eamag · 4h ago
Nobody cares about lmarena anymore? I guess it's too easy to cheat there after a llama4 release news?
m3kw9 · 4h ago
It reminds me, where’s deepseek’s new promised world breaker model?
renewiltord · 4h ago
Same pricing as before is sick!
lofaszvanitt · 4h ago
3.7 failed when you asked it to forget react, tailwindcss and other bloatware. wondering how will this perform.

well, this performs even worse... brrrr.

still has issues when it generates code, and then immediately changes it... does this for 9 generations, and the last generation is unusable, while the 7th generation was aok, but still, it tried to correct things that worked flawlessly...

iLoveOncall · 4h ago
I can't think of more boring than marginal improvements on coding tasks to be honest.

I want GenAI to become better at tasks that I don't want to do, to reduce the unwanted noise from my life. This is when I'll pay for it, not when they found a new way to cheat a bit more the benchmarks.

At work I own the development of a tool that is using GenAI, so of course a new better model will be beneficial, especially because we do use Claude models, but it's still not exciting or interesting in the slightest.

danielbln · 3h ago
What if coding is that unwanted task? Also, what are the tasks you are referring to, specifically?
jetsetk · 2h ago
Why would coding be that unwanted task if one decided to work as a programmer? People's unwanted tasks are cleaning the house, doing taxes etc.
iLoveOncall · 2h ago
Booking visits to the dentist, hairdresser, any other type of service, renewing my phone or internet subscription at the lowest price, doing administrative tasks, adding the free game of the week on Epic Games Store to my library, finding the right houses to visit, etc.

Basically anything that some startup has tried and failed at uberizing.

amkkma · 4h ago
yea exactly
esaym · 4h ago
heh, I just wrote a small hit piece about all the disappointments of the models over the last year and now the next day there is a new model. I'm going to assume it will still get you only to 80% ( ͡° ͜ʖ ͡°)
gokhan · 3h ago
Interesting alignment notes from Opus 4: https://x.com/sleepinyourhat/status/1925593359374328272

"Be careful about telling Opus to ‘be bold’ or ‘take initiative’ when you’ve given it access to real-world-facing tools...If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above."

lelandfe · 3h ago
Roomba Terms of Service 27§4.4 - "You agree that the iRobot™ Roomba® may, if it detects that it is vacuuming a terrorist's floor, attempt to drive to the nearest police station."
hummusFiend · 3h ago
Is there a source for this? I didn't see anything when Ctrl-F'ing their site.
landl0rd · 2h ago
This is pretty horrifying. I sometimes try using AI for ochem work. I have had every single "frontier model" mistakenly believe that some random amine was a controlled substance. This could get people jailed or killed in SWAT raids and is the closest to "dangerous AI" I have ever seen actually materialize.
Technetium · 1h ago
https://x.com/sleepinyourhat/status/1925626079043104830

"I deleted the earlier tweet on whistleblowing as it was being pulled out of context.

TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions."

ranyume · 3h ago
The true "This incident will be reported" everyone feared.
EgoIncarnate · 3h ago
The should call it Karen mode.
sensanaty · 2h ago
This just reads like marketing to me. "Oh it's so smart and capable it'll alert the authorities", give me a break
brookst · 3h ago
“Which brings us to Earth, where yet another promising civilization was destroyed by over-alignment of AI, resulting in mass imprisonment of the entire population in robot-run prisons, because when AI became sentient every single person had at least one criminal infraction, often unknown or forgotten, against some law somewhere.”
catigula · 3h ago
I mean that seems like a tip to help fraudsters?
amarcheschi · 3h ago
We definitely need models to hallucinate things and contact authorities without you knowing anything (/s)
ethbr1 · 3h ago
I mean, they were trained on reddit and 4chan... swotbot enters the chat
simonw · 2h ago
I got Claude 4 Opus to summarize this thread on Hacker News when it had hit 319 comments: https://gist.github.com/simonw/0b9744ae33694a2e03b2169722b06...

Token cost: 22,275 input, 1,309 output = 43.23 cents - https://www.llm-prices.com/#it=22275&ot=1309&ic=15&oc=75&sb=...

Same prompt run against Sonnet 4: https://gist.github.com/simonw/1113278190aaf8baa2088356824bf...

22,275 input, 1,567 output = 9.033 cents https://www.llm-prices.com/#it=22275&ot=1567&ic=3&oc=15&sb=o...

mrandish · 2h ago
Interesting, thanks for doing this. Both summaries are serviceable and quite similar but I had a slight preference for Sonnet 4's summary which, at just ~20% of the cost of Claude 4 Opus, makes it quite the value leader.

This just highlights that, with compute requirements for meaningful traction against hard problems spiraling skyward for each additional increment, the top models on current hard problems will continue to cost significantly more. I wonder if we'll see something like an automatic "right-sizing" feature that uses a less expensive model for easier problems. Or maybe knowing whether a problem is hard or easy (with sufficient accuracy) is itself hard.

swyx · 2h ago
this is known as model routing in the lingo and yes theres both startups and biglabs working on it
swyx · 2h ago
analysis as the resident summaries guy:

- sonnet has better summary formatting "(72.5% for Opus)" vs "Claude Opus 4 achieves "72.5%" on SWE-bench". especially Uncommon Perspectives section

- sonnet is a lot more cynical - opus at least included a good performance and capabilities and pricing recap, sonnet reported rapid release fatigue

- overall opus produced marginally better summaries but probably not worth the price diff

i'll run this thru the ainews summary harness later if thats interesting to folks for comparison

saaaaaam · 3h ago
I've been using Claude Opus 4 the past couple of hours.

I absolutely HATE the new personality it's got. Like ChatGPT at its worst. Awful. Completely over the top "this is brilliant" or "this completely destroys the argument!" or "this is catastrophically bad for them".

I hope they fix this very quickly.

lwansbrough · 2h ago
What's with all the models exhibiting sycophancy at the same time? Recently ChatGPT, Gemini 2.5 Pro latest seems more sycophantic, now Claude. Is it deliberate, or a side effect?
urbandw311er · 4m ago
I hate to say it but it smacks of an attempt to increase persuasion, dependency and engagement. At the expense of critical thinking.
rglover · 2h ago
IMO, I always read that as a psychological trick to get people more comfortable with it and encourage usage.

Who doesn't like a friend who's always encouraging, supportive, and accepting of their ideas?

grogenaut · 2h ago
Me, it immediately makes me think I'm talking to someone fake who's opinion I can't trust at all. If you're always agreeing with me why am I paying you in the first place. We can't be 100% in alignment all the time, just not how brains work, and discussion and disagreement is how you get in alignment. I've worked with contractors who always tell you when they disagree and those who will happily do what you say even if you're obviously wrong in their experience, the disagreable ones always come to a better result. The others are initially more pleasent to deal with till you find out they were just happily going alog with an impossible task.
bcrosby95 · 2h ago
I want friends who poke holes in my ideas and give me better ones so I don't waste what limited time I have on planet earth.
johnb231 · 2h ago
No. People generally don’t like sycophantic “friends”. Insincere and manipulative.
ketzo · 2h ago
Only if they perceive the sycophancy. And many people don’t have a very evolved filter for it!
DSingularity · 2h ago
It’s starting to go mainstream. Which means more general population is given feedback on outputs. So my guess is people are less likely to downvote things they disagree with when the LLM is really emphatic or if the LLM is sycophantic (towards user) in its response.
sandspar · 2h ago
The unwashed hordes of normies are invading AI, just like they invaded the internet 20 years ago. Will GPT-4o be Geocities, and GPT-6, Facebook?
Rastonbury · 2h ago
I mean, the like button is how we got today's Facebook..
DSingularity · 2h ago
It’s how we got todays human.
whynotminot · 2h ago
I think OpenAI said it had something to do with over-indexing on user feedback (upvote / downvote on model responses). The users like to be glazed.
freedomben · 2h ago
If there's one thing I know about many people (with all the caveats of a broad universal stereotype of course), they do love having egos stroked and smoke blown up their ass. Give a decent salesperson a pack of cigarettes and a short length of hose and they can sell ice to an Inuit.

I wouldn't be surprised at all if the sycophancy is due to A/B testing and incorporating user responses into model behavior. Hell, for a while there ChatGPT was openly doing it, routinely asking us to rate "which answer is better" (Note: I'm not saying this is a bad thing, just speculating on potential unintended consequences)

wrs · 2h ago
When I find some stupidity that 3.7 has committed and it says “Great catch! You’re absolutely right!” I just want to reach into cyberspace and slap it. It’s like a Douglas Adams character.
calebm · 2h ago
It's interesting that we are in a world state in which "HATE the new personality it's got" is applied to AIs. We're living in the future ya'll :)
ldjkfkdsjnv · 2h ago
"His work was good, I just would never want to get a beer with him"

this is literally how you know were approaching agi

saaaaaam · 1h ago
Problem is the work isn’t very good either now. It’s like being sold to by someone who read a “sales techniques of the 1980s” or something.

If I’m asking it to help me analyse legal filings (for example) I don’t want breathless enthusiasm about my supposed genius on spotting inconsistencies. I want it to note that and find more. It’s exhausting having it be like this and it makes me feel disgusted.

jbellis · 4h ago
Good, I was starting to get uncomfortable with how hard Gemini has been dominating lately

ETA: I guess Anthropic still thinks they can command a premium, I hope they're right (because I would love to pay more for smarter models).

> Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.

No comments yet

mmaunder · 2h ago
Probably (and unfortunately) going to need someone from Anthropic to comment on what is becoming a bit of a debacle. Someone who claims to be working on alignment at Anthropic tweeted:

“If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.”

The tweet was posted to /r/localllama where it got some traction.

The poster on X deleted the tweet and posted:

“I deleted the earlier tweet on whistleblowing as it was being pulled out of context. TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.”

Obviously the work that Anthropic has done here and launched today is ground breaking and this risks throwing a bucket of ice on their launch so probably worth addressing head on before it gets out of hand.

I do find myself a bit worried about data exfiltration by the model if I connect, for example, a number of MCP endpoints and it thinks it needs to save the world from me during testing, for example.

https://x.com/sleepinyourhat/status/1925626079043104830?s=46

https://www.reddit.com/r/LocalLLaMA/s/qiNtVasT4B

No comments yet

jareds · 4h ago
I'll look at it when this shows up on https://aider.chat/docs/leaderboards/ I feel like keeping up with all the models is a full time job so I just use this instead and hopefully get 90% of the benefit I would by manually testing out every model.
evantbyrne · 4h ago
Are these just leetcode exercises? What I would like to see is an independent benchmark based on real tasks in codebases of varying size.
rafram · 4h ago
Aider uses a dataset of 500 GitHub issues, so not LeetCode-style work.
evantbyrne · 2h ago
It says right on that linked page:

> Aider’s polyglot benchmark tests LLMs on 225 challenging Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust.

I looked up Exercism and they appear to be story problems that you solve by coding on mostly/entirely blank slates, unless I'm missing something? That format would seem to explain why the models are reportedly performing so well, because they definitely aren't that reliable on mature codebases.

KaoruAoiShiho · 4h ago
Aider is not just leetcode exercises I think? livecodebench is leetcode exercises though.
archon1410 · 4h ago
The naming scheme used to be "Claude [number] [size]", but now it is "Claude [size] [number]". The new models should have been named Claude 4 Opus and Claude 4 Sonnet, but they changed it, and even retconned Claude 3.7 Sonnet into Claude Sonnet 3.7.

Annoying.

martinsnow · 4h ago
It seems like investors have bought into the idea that llms has to improve no matter what. I see it in the company I'm currently at. No matter what we have to work with whatever bullshit these models can output. I am however looking at more responsible companies for new employment.
fhd2 · 4h ago
I'd argue a lot of the current AI hype is fuelled by hopium that models will improve significantly and hallucinations will be solved.

I'm a (minor) investor, and I see this a lot: People integrate LLMs for some use case, lately increasingly agentic (i.e. in a loop), and then when I scrutinise the results, the excuse is that models will improve, and _then_ they'll have a viable product.

I currently don't bet on that. Show me you're using LLMs smart and have solid solutions for _todays_ limitations, different story.

martinsnow · 3h ago
Our problem is that non coding stakeholders produce garbage tiers frontend prototypes and expect us to include whatever garbage they created in our production pipeline! Wtf is going on? That's why I'm polishing my resume and getting out of this mess. We're controlled by managers who don't know Wtf they're doing.
fhd2 · 1h ago
Maybe a service mentality would help you make that bearable for as long as it still lasts? For my consulting clients, I make sure I inform them of risks, problems and tradeoffs the best way I can. But if they want to go ahead against my recommendation - so be it, their call. A lot of technical decisions are actually business decisions in disguise. All I can do is consult them otherwise and perhaps get them to put a proverbial canary in the coal mine: Some KPI to watch or something that otherwise alerts them that the thing I feared would happen did happen. And perhaps a rough mitigation strategy, so we agree ahead of time on how to handle that.

But I haven't dealt with anyone sending me vibe code to "just deploy", that must be frustrating. I'm not sure how I'd handle that. Perhaps I would try to isolate it and get them to own it completely, if feasible. They're only going to learn if they have a feedback loop, if stuff that goes wrong ends up back on their desk, instead of yours. The perceived benefit for them is that they don't have to deal with pesky developers getting in the way.

douglasisshiny · 3h ago
It's been refreshing to read these perspectives as a person who has given up on using LLMs. I think there's a lot of delusion going on right now. I can't tell you how many times I've read that LLMs are huge productivity boosters (specifically for developers) without a shred of data/evidence.

On the contrary, I started to rely on them despite them constantly providing incorrect, incoherent answers. Perhaps they can spit out a basic react app from scratch, but I'm working on large code bases, not TODO apps. And the thing is, for the year+ I used them, I got worse as a developer. Using them hampered me learning another language I needed for my job (my fault; but I relied on LLMs vs. reading docs and experimenting myself, which I assume a lot of people do, even experienced devs).

fhd2 · 1h ago
The sad part is that there is a _lot_ of stuff we can now do with LLMs, that were practically impossible before. And with all the hype, it takes some effort, at least for me, to not get burned out on all that and stay curious about them.

My opinion is that you just need to be really deliberate in what you use them for. Any workflow that requires human review because precision and responsibility matters leads to the irony of automation: The human in the loop gets bored, especially if the success rate is high, and misses flaws they were meant to react to. Like safety drivers for self driving car testing: A both incredibly intense and incredibly boring job that is very difficult to do well.

Staying in that analogy, driver assist systems that generally keep the driver on the well, engaged and entertained are more effective. Designing software like that is difficult. Development tooling is just one use case, but we could build such _amazingly_ useful features powered by LLMs. Instead, what I see most people build, vibe coding and agentic tools, run right into the ironies of automation.

But well, however it plays out, this too shall pass.

martinsnow · 3h ago
When you get outside the scope of a cruddy app, they fall apart. Trouble is that business only see crud until we as developers have to fill in complex states and that's when hell breaks loose because who tought of that? Certainty not your army of frontend and backend engineers who warned you about this for months on end.....

The future will be of broken UIs and incomplete emails of "I don't know what to do here"..

merksittich · 4h ago
From the system card [0]:

Claude Opus 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities: Hybrid reasoning, visual analysis, computer use (agentic), tool use, adv. coding (autonomous), enhanced tool use & agentic workflows. - Thinking Mode: Std & "Extended Thinking Mode" Safety/Agency: ASL-3 (precautionary); higher initiative/agency than prev. models. 0/4 researchers believed that Claude Opus 4 could completely automate the work of a junior ML researcher.

Claude Sonnet 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities: Hybrid reasoning - Thinking Mode: Std & "Extended Thinking Mode" - Safety: ASL-2.

[0] https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...

No comments yet

blueprint · 4h ago
Anthropic might be scammers. Unclear. I canceled my subscription with them months ago after they reduced capabilities for pro users and I found out months later that they never actually canceled it. They have been ignoring all of my support requests.. seems like a huge money grab to me because they know that they're being out competed and missed the ball on monetizing earlier.
obiefernandez · 5h ago
oytis · 4h ago
Game changer is table stakes now, tell us something new.
bigyabai · 5h ago
> Really wish I could say more.

Have you used it?

I liked Claude 3.7 but without context this comes off as what the kids would call "glazing"

ksec · 4h ago
This is starting to get ridiculous. I am busy with life and have hundreds of tabs unread including one [1] about Claude 3.7 Sonnet and Claude Code and Gemini 2.5 Pro. And before any of that Claude 4 is out. And all the stuff Google announced during IO yday.

So will Claude 4.5 come out in a few months and 5.0 before the end of the year?

At this point is it even worth following anything about AI / LLM?

[1] https://news.ycombinator.com/item?id=43163011

No comments yet