Gemini 2.5 Pro Preview

350 meetpateltech 325 5/6/2025, 3:10:00 PM developers.googleblog.com ↗

Comments (325)

segphault · 4h ago
My frustration with using these models for programming in the past has largely been around their tendency to hallucinate APIs that simply don't exist. The Gemini 2.5 models, both pro and flash, seem significantly less susceptible to this than any other model I've tried.

There are still significant limitations, no amount of prompting will get current models to approach abstraction and architecture the way a person does. But I'm finding that these Gemini models are finally able to replace searches and stackoverflow for a lot of my day-to-day programming.

jstummbillig · 38m ago
> no amount of prompting will get current models to approach abstraction and architecture the way a person does

I find this sentiment increasingly worrisome. It's entirely clear that every last human will be beaten on code design in the upcoming years (I am not going to argue if it's 1 or 5 years away, who cares?)

I wished people would just stop holding on to what amounts to nothing, and think and talk more about what can be done in a new world. We need good ideas and I think this could be a place to advance them.

jjice · 26m ago
I'm confused by your comment. It seems like you didn't really provide a retort to the parent's comment about bad architecture and abstraction from LLMs.

FWIW, I think you're probably right that we need to adapt, but there was no explanation as to _why_ you believe that that's the case.

saurik · 7m ago
I mean, didn't you just admit you are wrong? If we are talking 1-5 years out, that's not "current models".
bboygravity · 6m ago
This is hilarious to read if you have actually seen the average (embedded systems) production code written by humans.

Either you have no idea how terrible real world commercial software (architecture) is or you're vastly underestimating newer LLMs or both.

Jordan-117 · 2h ago
I recently needed to recommend some IAM permissions for an assistant on a hobby project; not complete access but just enough to do what was required. Was rusty with the console and didn't have direct access to it at the time, but figured it was a solid use case for LLMs since AWS is so ubiquitous and well-documented. I actually queried 4o, 3.7 Sonnet, and Gemini 2.5 for recommendations, stripped the list of duplicates, then passed the result to Gemini to vet and format as JSON. The result was perfectly formatted... and still contained a bunch of non-existent permissions. My first time being burned by a hallucination IRL, but just goes to show that even the latest models working in concert on a very well-defined problem space can screw up.
darepublic · 1h ago
Listen I don't blame any mortal being for not grokking the AWS and Google docs. They are a twisting labyrinth of pointers to pointers some of them deprecated though recommended by Google itself.
dotancohen · 1h ago
AWS docs have (had) an embedded AI model that would do this perfectly. I suppose it had better training data, and the actual spec as a RAG.
siscia · 2h ago
This problem have been solved by LSP (language server protocol), all we need is a small server behind MCP that can communicate LSP information back to the LLM and get the LLM to use by adding to the prompt something like: "check your API usage with the LSP"

The unfortunate state of open source funding makes buildings such simple tool a loosing adventure unfortunately.

satvikpendem · 1h ago
This already happens in agent modes in IDEs like Cursor or VSCode with Copilot, it can check for errors with the LSP.
doug_durham · 2h ago
If they never get good at abstraction or architecture they will still provide a tremendous amount of value. I have them do the parts of my job that I don't like. I like doing abstraction and architecture.
mynameisvlad · 2h ago
Sure, but that's not the problem people have with them nor the general criticism. It's that people without the knowledge to do abstraction and architecture don't realize the importance of these things and pretend that "vibe coding" is a reasonable alternative to a well-thought-out project.
Karrot_Kream · 26m ago
We can rewind the clock 10 years and I can substitute "vibe coding" for VBA/Excel macros and we'd get a common type of post from back then.

There's always been a demand for programming by non technical stakeholders that they try and solve without bringing on real programmers. No matter the tool, I think the problem is evergreen.

sanderjd · 1h ago
The way I see this is that it's just another skill differentiator that you can take advantage of if you can get it right.

That is, if it's true that abstraction and architecture are useful for a given product, then people who know how to do those things will succeed in creating that product, and those who don't will fail. I think this is true for essentially all production software, but a lot of software never reaches production.

Transitioning or entirely recreating "vibecoded" proofs of concept to production software is another skill that will be valuable.

Having a good sense for when to do that transition, or when to start building production software from the start, and especially the ability to influence decision makers to agree with you, is another valuable skill.

I do worry about what the careers of entry level people will look like. It isn't obvious to me how they'll naturally develop any of these skills.

mynameisvlad · 36m ago
> "vibecoded" proofs of concept

The fact that you called it out as a PoC is already many bars above what most vibe coders are doing. Which is considering a barely functioning web app as proof that vibe coding is a viable solution for coding in general.

> I do worry about what the careers of entry level people will look like. It isn't obvious to me how they'll naturally develop any of these skills.

Exactly. There isn't really a path forward from vibe coding to anything productizable without actual, deep CS knowledge. And LLMs are not providing that.

codebolt · 4h ago
I've found they do a decent job searching for bugs now as well. Just yesterday I had a bug report on a component/page I wasn't familiar with in our Angular app. I simply described the issue as well as I could to Claude and asked politely for help figuring out the cause. It found the exact issue correctly on the first try and came up with a few different suggestions for how to fix it. The solutions weren't quite what I needed but it still saved me a bunch of time just figuring out the error.
M4v3R · 2h ago
That’s my experience as well. Many bugs involve typos, syntax issues or other small errors that LLMs are very good at catching.
yousif_123123 · 1h ago
The opposite problem is also true. I was using it to edit code I had that was calling the new openai image API, which is slightly different from the dalle API. But Gemini was consistently "fixing" the OpenAI call even when I explained clearly not to do that since I'm using a new API design etc. Claude wasn't having that issue.

The models are very impressive. But issues like these still make me feel they are still more pattern matching (although there's also some magic, don't get me wrong) but not fully reasoning over everything correctly like you'd expect of a typical human reasoner.

toomuchtodo · 1h ago
It seems like the fix is straightforward (check the output against a machine readable spec before providing it to the user), but perhaps I am a rube. This is no different than me clicking through a search result to the underlying page to verify the veracity of the search result surfaced.
disgruntledphd2 · 1h ago
Why coding agents et al don't make use of the AST through LSP is a question I've been asking myself since the first release of GitHub copilot.

I assume that it's trickier than it seems as it hasn't happened yet.

celeritascelery · 36m ago
What good do you think that would do?
disgruntledphd2 · 1h ago
They are definitely pattern matching. Like, that's how we train them, and no matter how many layers of post training you add, you won't get too far from next token prediction.

And that's fine and useful.

abletonlive · 10m ago
I feel like there are two realities right now where half the people say LLM doesn't do anything well and there is another half that's just using LLM to the max. Can everybody preface what stack they are using or what exactly they are doing so we can better determine why it's not working for you? Maybe even include what your expectations are? Maybe even tell us what models you're using? How are you prompting the models exactly?

I find for 90% of the things I'm doing LLM removes 90% of the starting friction and let me get to the part that I'm actually interested in. Of course I also develop professionally in a python stack and LLMs are 1 shotting a ton of stuff. My work is standard data pipelines and web apps. I'm a tech lead at faang adjacent and the systems I work with are responsible for about half a billion dollars a year in transactions directly and growing. Everybody that I work with is getting valuable output from LLMs. We are using all the latest openAI models and have a business relationship with openAI. I don't think I'm even that good at prompting and mostly rely on "vibes". Half of the time I'm pointing the model to an example and telling it "in the style of X do X for me".

I feel like comments like these almost seem gaslight-y or maybe there's just a major expectation mismatch between people. Are you expecting LLMs to just do exactly what you say and your entire job is to sit back prompt the LLM?

jppittma · 18m ago
I've had great success by asking it to do project design first, compose the design into an artifact, and then asking it to consult the design artifact as it writes code.
epaga · 5m ago
This is a great idea - do you have a more detailed overview of this approach and/or an example? What types of things do you tell it to put into the "artefact"?
jug · 2h ago
I’ve seen benchs on hallucinations and OpenAI has typically performed worse than Google and Anthropic models. Sometimes significantly so. But it doesn’t seem like they have cared much. I’ve suspected that LLM performance is correlated to risking hallucinations? That is, if they’re bolder, this can be beneficial? Which helps in other performance benchmarks. But of course at the risk of hallucinating more…
mountainriver · 2h ago
The hallucinations are a result of RLVR. We reward the model for an answer and then force it to reason about how to get there when the base model may not have that information.
mdp2021 · 9m ago
> The hallucinations are a result of RLVR

Well let us reward them for producing output that is consistent with database accessed selected documentation then, and massacre them for output they cannot justify - like we do with humans.

froh · 32m ago
searching and ranking existing fragments and recombining them within well known paths is one thing, exploratively combining existing fragments to completely novel solutions quickly runs into combinatorial explosion.

so it's a great tool in the hands of a creative architect, but it is not one in and by itself and I don't see yet how it can be.

my pet theory is that the human brain can't understand and formalize its creativity because you need a higher order logic to fully capture some other logic. I've been contested that the second Gödel incompleteness theorem "can't be applied like this to the brain" but I stubbornly insist yes, the brain implements _some_ formal system and it can't understand how that system works. tongue in cheek, somewhat, maybe.

but back to earth I agree llms are a great tool for a creative human mind.

mbesto · 49m ago
To date, LLMs can't replace the human element of:

- Determining what features to make for users

- Forecasting out a roadmap that are aligned to business goals

- Translating and prioritizing all of these to a developer (regardless of whether these developers are agentic or human)

Coincidentally these are the areas that frequently are the largest contributors to software businesses successes....not wether you use NextJs with a Go and Elixir backend against a multi-geo redundant multi sharded CockroachDB database, or that your code is clean/elegant.

nearbuy · 34m ago
What does it say when you ask it to?
redox99 · 3h ago
Making LLMs know what they don't know is a hard problem. Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.
Volundr · 3h ago
> Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.

Are we sure they know these things as opposed to being able to consistently guess correctly? With LLMs I'm not sure we even have a clear definition of what it means for it to "know" something.

redox99 · 2h ago
Yes. You could ask for factual information like "Tallest building in X place" and first it would answer it did not know. After pressuring it, it would answer with the correct building and height.

But also things where guessing was desirable. For example with a riddle it would tell you it did not know or there wasn't enough information. After pressuring it to answer anyway it would correctly solve the riddle.

The official llama 2 finetune was pretty bad with this stuff.

ajross · 2h ago
> Are we sure they know these things as opposed to being able to consistently guess correctly?

What is the practical difference you're imagining between "consistently correct guess" and "knowledge"?

LLMs aren't databases. We have databases. LLMs are probabilistic inference engines. All they do is guess, essentially. The discussion here is about how to get the guess to "check itself" with a firmer idea of "truth". And it turns out that's hard because it requires that the guessing engine know that something needs to be checked in the first place.

mynameisvlad · 2h ago
Simple, and even simpler from your own example.

Knowledge has an objective correctness. We know that there is a "right" and "wrong" answer and we know what a "right" answer is. "Consistently correct guesses", based on the name itself, is not reliable enough to actually be trusted. There's absolutely no guarantee that the next "consistently correct guess" is knowledge or a hallucination.

fwip · 1h ago
So, if that were so, then an LLM possess no knowledge whatsoever, and cannot ever be trusted. Is that the line of thought you are drawing?
ajross · 2h ago
This is a circular semantic argument. You're saying knowledge is knowledge because it's correct, where guessing is guessing because it's a guess. But "is it correct?" is precisely the question you're asking the poor LLM to answer in the first place. It's not helpful to just demand a computation device work the way you want, you need to actually make it work.

Also, too, there are whole subfields of philosophy that make your statement here kinda laughably naive. Suffice it to say that, no, knowledge as rigorously understood does not have "an objective correctness".

mynameisvlad · 1h ago
I mean, it clearly does based on your comments showing a need for a correctness check to disambiguate between made up "hallucinations" and actual "knowledge" (together, a "consistently correct guess").

The fact that you are humanizing an LLM is honestly just plain weird. It does not have feelings. It doesn't care that it has to answer "is it correct?" and saying poor LLM is just trying to tug on heartstrings to make your point.

ajross · 1h ago
FWIW "asking the poor <system> to do <requirement>" is an extremely common idiom. It's used as a metaphor for an inappropriate or unachievable design requirement. Nothing to do with LLMs. I work on microcontrollers for a living.
rdtsc · 1h ago
> Making LLMs know what they don't know is a hard problem. Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.

They are the perfect "fake it till you make it" example cranked up to 11. They'll bullshit you, but will do it confidently and with proper grammar.

> Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.

I can see in some contexts that being desirable if it can be a parameter that can be tweaked. I guess it's not that easy, or we'd already have it.

bezier-curve · 1h ago
The best way around this is to dump documentation of the APIs you need them privy to into their context window.
mountainriver · 2h ago
https://github.com/IINemo/lm-polygraph is the best work in this space
tough · 2h ago
You should give it docs for each of your base dependencies in a mcp/tool whatever so it can just consult.

internet also helps.

Also having markdown files with the stack etc and any -rules-

ChocolateGod · 4h ago
I asked today both Claude and ChatGPT to fix a Grafana Loki query I was trying to build, both hallucinated functions that didn't exist, even when telling to use existing functions.

To my surprise, Gemini got it spot on first time.

fwip · 1h ago
Could be a bit of a "it's always in the last place you look" kind of thing - if Claude or CGPT had gotten it right, you wouldn't have tried Gemini.
pdntspa · 16m ago
I don't know about that, my own adventures with Gemini Pro 2.5 in Roo Code has it outputting code in a style that is very close to my own

While far from perfect for large projects, controlling the scope of individual requests (with orchestrator/boomerang mode, for example) seems to do wonders

Given the sheer, uh, variety of code I see day to day in an enterprise setting, maybe the problem isn't with Gemini?

0x457 · 2h ago
I've noticed that models that can search internet do it a lot less because I guess they can look up documentation? My annoyance now is that it doesn't take version into consideration.
satvikpendem · 1h ago
If you use Cursor, you can use @Docs to let it index the documentation for the libraries and languages you use, so no hallucination happens.
impulser_ · 2h ago
Use few-shot learning. Build a simple prompt with basic examples of how to use the API and it will do significantly better.

LLMs just guess, so you have to give it a cheatsheet to help it guess closer to what you want.

M4v3R · 2h ago
At this point the time it takes to teach the model might be more than you save from using it for interacting with that API.
rcpt · 2h ago
I'm using repomix for this
johnisgood · 2h ago
> hallucinate APIs

Tell me about it. Thankfully I have not experienced it as much with Claude as I did with GPT. It can get quite annoying. GPT kept telling me to use this and that and none of them were real projects.

pzo · 3h ago
I feel your pain. Cursor has docs features but many times when I pointed to check @docs and selected one recently indexed one it sometimes still didn't get it. I still have to try contex7 mcp which looks promising:

https://github.com/upstash/context7

thr0waway39290 · 1h ago
Replacing stackoverflow is definitely helpful, but the best use case for me is how much it helps in high-level architecture and planning before starting a project.
ksec · 4h ago
I have been asking if AI without hallucination, coding or not is possible but so far with no real concrete answer.
mattlondon · 4h ago
It's already much improved on the early days.

But I wonder when we'll be happy? Do we expect colleagues friends and family to be 100% laser-accurate 100% of the time? I'd wager we don't. Should we expect that from an artificial intelligence too?

mdp2021 · 4m ago
Yes we want people "in the game" to be of sound mind. (The matter there is not about being accurate, but of being trustworthy - substance, not appearance.)

And tools in the game, even more so (there's no excuse for the engineered).

ksec · 17m ago
I dont expect it to be 100% accurate. Software aren't bug free, human aren't perfect. But may be 99.99%? At least given enough time and resources human could fact check it ourselves. And precisely because we know we are not perfect, in accounting and court cases we have due diligence.

And it is also not just about the %. It is also about the type of error. Will we reach a point we change our perception and say these are expected non-human error?

Or could we have a specific LLM that only checks for these types of error?

kweingar · 4h ago
I expect my calculator to be 100% accurate 100% of the time. I have slightly more tolerance for other software having defects, but not much more.
mattlondon · 56m ago
AIs aren't intended to be used as calculators though?

You could say that when I use my spanner/wrench to tighten a nut it works 100% of the time, but as soon as I try to use a screwdriver it's terrible and full of problems and it can't even reliably so something as trivially easy as tighten a nut, even though a screwdriver works the same way by using torque to tighten a fastener.

Well that's because one tool is designed for one thing, and one is designed for another.

gilbetron · 1h ago
It's your option not to use it. However, this is a competitive environment and so we will see who pulls out ahead, those that use AI as a productivity multiplier versus those that do not. Maybe that multiplier is less than 1, time will tell.
kweingar · 1h ago
Agreed. The nice thing is that I am told by HN and Twitter that agentic workflows makes code tasks very easy, so if it turns out that using these tools multiplies productivity, then I can just start using them and it will be easy. Then I am caught up with the early adopters and don't need to worry about being out-competed by them.
asadotzler · 3h ago
And a $2.99 drugstore slim wallet calculator with solar power gets it right 100% of the time while billion dollar LLMs can still get arithmetic wrong on occasion.
pb7 · 2h ago
My hammer can't do any arithmetic at all, why does anyone even use them?
izacus · 1h ago
What you're being asked is to stop trying to hammer every single thing that comes into your vicinity. Smashing your computer with a hammer won't create code.
namaria · 2h ago
Does it sometimes instead of driving a nail hit random things in the house?
hn_go_brrrrr · 1h ago
Yes, like my thumb.
Vvector · 2h ago
Try "1/3". The calculator answer is not "100% accurate"
bb88 · 1h ago
I had a casio calculator back in the 1980's that did fractions.

So when I punched in 1/3 it was exactly 1/3.

pizza · 2h ago
Are you sure about that? Try these..

- (1e(1e10) + 1) - 1e(1e10)

- sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2))

ctxc · 2h ago
Three decades and I haven't had to do anything remotely resembling this on a calculator, much less find the calculator wrong. Same for the majority of general population I assume.
tasuki · 2h ago
The person you're replying to pointed out that you shouldn't expect a calculator to be 100% accurate 100% of the time. Especially not when faced with adversarial prompts.
jjmarr · 2h ago
(1/3)*3
Analemma_ · 3h ago
I don't think that's the relevant comparison though. Do you expect StackOverflow or product documentation to be 100% accurate 100% of the time? I definitely don't.
kweingar · 1h ago
I actually agree with this. I use LLMs often, and I don't compare them to a calculator.

Mainly I meant to push back against the reflexive comparison to a friend or family member or colleague. AI is a multi-purpose tool that is used for many different kinds of tasks. Some of these tasks are analogues to human tasks, where we should anticipate human error. Others are not, and yet we often ask an LLM to do them anyway.

ctxc · 2h ago
Also, documentation and SO are incorrect in a predictable way. We don't expect them to state things in a matter of fact way that just don't exist.
ctxc · 2h ago
The error introduced by the data is expected and internalized, it's the error of LLMs on _top_ of that that's hard to.
ziml77 · 4h ago
Yes we should expect better from an AI that has a knowledge base much larger than any individual and which can very quickly find and consume documentation. I also expect them to not get stuck trying the same thing they've already been told doesn't work, same as I would expect from a person.
cinntaile · 4h ago
It's tool not a human so I don't know if the comparison even makes sense?
kortilla · 3h ago
If colleagues lie with the certainty that LLMs do, they would get fired for incompetence.
ChromaticPanic · 1h ago
Have you worked in an actual workplace. Confidence is king.
dmd · 2h ago
Or elected to high office.
scarab92 · 2h ago
I wish that were true, but I’ve found that certain types of employees do confidently lie as much as llms, especially when answering “do you understand” type questions
izacus · 1h ago
And we try to PIP and fire those as well, not turn everyone else into them.
pohuing · 4h ago
It's a tool, not an intelligence, a tool that costs money on every erroneous token. I expect my computer to be more reliable at remembering things than myself, that's one of the primary use cases even. Especially if using it costs money. Of course errors are possible, but rarely do they happen as frequently in any other program I use.
Foreignborn · 4h ago
Try dropping the entire api docs in the context. If it’s verbose, i usually pull only a subset of pages.

Usually I’m using a minimum of 200k tokens to start with gemini 2.5.

nolist_policy · 1h ago
That's more than 222 novel pages:

200k tk = 1/3 200k words = 1/300 1/3 200k pages

pizza · 3h ago
"if it were a fact, it wouldn't be called intelligence" - donald rumsfeld
thefourthchime · 4h ago
Ask the models that can search to double check their API usage. This can just be part of a pre-prompt.
Tainnor · 2h ago
I definitely get more use out of Gemini Pro than other models I've tried, but it's still very prone to bullshitting.

I asked it a complicated question about the Scala ZIO framework that involved subtyping, type inference, etc. - something that would definitely be hard to figure out just from reading the docs. The first answer it gave me was very detailed, very convincing and very wrong. Thankfully I noticed it myself and was able to re-prompt it and I got an answer that is probably right. So it was useful in the end, but only because I realised that the first answer was nonsense.

mannycalavera42 · 57m ago
same, I asked a simple question about javascript fetch api and it started talking about the workspace api. When I asked about that workspace api it replied it was the google workspace API ¯ \ _ (ツ) _ / ¯
gxs · 2h ago
Huh? Have you ever just told it, that API doesn’t exist, find another solution?

Never seen it fumble that around

Swear people act like humans themselves don’t ever need to be asked for clarification

paulirish · 2h ago
> Gemini 2.5 Pro now ranks #1 on the WebDev Arena leaderboard

It'd make sense to rename WebDev Arena to React/Tailwind Arena. Its system prompt requires [1] those technologies and the entire tool breaks when requesting vanilla JS or other frameworks. The second-order implications of models competing on this narrow definition of webdev are rather troublesome.

[1] https://blog.lmarena.ai/blog/2025/webdev-arena/#:~:text=PROM...

martinsnow · 1h ago
Bwoah it's almost as if react and tailwind is the bees knees ind frontend atm

No comments yet

ranyume · 4h ago
I don't know if I'm doing something wrong, but every time I ask gemini 2.5 for code it outputs SO MANY comments. An exaggerated amount of comments. Sections comments, step comments, block comments, inline comments, all the gang.
lukeschlather · 3h ago
I usually remove the comments by hand. It's actually pretty helpful, it ensures I've reviewed every piece of code carefully, especially since most of the comments are literally just restating the next line, and "does this comment add any information?" is a really helpful question to make sure I understand the code.
tasuki · 2h ago
Same! It eases my code review. In the rare occasions I don't want to do that, I ask the LLM to provide the code without comments.
Benjammer · 4h ago
I've found that heavily commented code can be better for the LLM to read later, so it pulls in explanatory comments into context at the same time as reading code, similar to pulling in @docs, so maybe it's doing that on purpose?
koakuma-chan · 4h ago
No, it's just bad. I've been writing a lot of Python code past two days with Gemini 2.5 Pro Preview, and all of its code was like:

```python

def whatever():

  --- SECTION ONE OF THE CODE ---

  ...

  --- SECTION TWO OF THE CODE ---

  try:
    [some "dangerous" code]
  except Exception as e:
     logging.error(f"Failed to save files to {output_path}: {e}")
     # Decide whether to raise the error or just warn
     # raise IOError(f"Failed to save files to {output_path}: {e}")
```

(it adds commented out code like that all the time, "just in case")

It's terrible.

I'm back to Claude Code.

NeutralForest · 3h ago
I'm seeing it trying to catch blind exceptions in Python all the time. I see it in my colleagues code all the time, it's driving me nuts.
JoshuaDavid · 2h ago
The training loop asked the model to one-shot working code for the given problems without being able to iterate. If you had to write code that had to work on the first try, and where a partially correct answer was better than complete failure, I bet your code would look like that too.

In any case, it knows what good code looks like. You can say "take this code and remove spurious comments and prefer narrow exception handling over catch-all", and it'll do just fine (in a way it wouldn't do just fine if your prompt told it to write it that way the first time, writing new code and editing existing code are different tasks).

jerkstate · 3h ago
There are a bunch of stupid behaviors of LLM coding that will be fixed by more awareness pretty soon. Imagine putting the docs and code for all of your libraries into the context window so it can understand what exceptions might be thrown!
maccard · 2h ago
Copilot and the likes have been around for 4 years, and we’ve been hearing this all along. I’m bullish on LLM assistants (not vibe coding) but I’d love to see some of these things actually start to happen.
kenjackson · 2h ago
I feel like it has gotten better over time, but I don't have any metrics to confirm this. And it may also depend on what type of you language/libraries that you use.
tclancy · 2h ago
Well, at least now we know who to blame for the training data :)
brandall10 · 4h ago
It's certainly annoying, but you can try following up with "can you please remove superfluous comments? In particular, if a comment doesn't add anything to the understanding of the code, it doesn't deserve to be there".
diggan · 3h ago
I'm having the same issue, and no matter what I prompt (even stuff like "Don't add any comments at all to anything, at any time") it still tries to add these typical junior-dev comments where it's just re-iterating what the code on the next line does.
tough · 2h ago
you can have a script that drops them all
shawabawa3 · 3h ago
You don't need a follow up

Just end your prompt with "no code comments"

breppp · 2h ago
I always thought these were there to ground the LLM on the task and produce better code, an artifact of the fact that this will autocomplete better based on past tokens. Similarly always thought this is why ChatGPT always starts every reply with repeating exactly what you asked again
rst · 1h ago
Comments describing the organization and intent, perhaps. Comments just saying what a "require ..." line requires, not so much. (I find it will frequently put notes on the change it is making in comments, contrasting it with the previous state of the code; these aren't helpful at all to anyone doing further work on the result, and I wound up trimming a lot of them off by hand.)
puika · 4h ago
I have the same issue plus unnecessary refactorings (that break functionality). it doesn't matter if I write a whole paragraph in the chat or the prompt explaining I don't want it to change anything else apart from what is required to fulfill my very specific request. It will just go rogue and massacre the entirety of the file.
mgw · 4h ago
This has also been my biggest gripe with Gemini 2.5 Pro. While it is fantastic at one-shotting major new features, when wanting to make smaller iterative changes, it always does big refactors at the same time. I haven't found a way to change that behavior through changes in my prompts.

Claude 3.7 Sonnet is much more restrained and does smaller changes.

cryptoz · 3h ago
This exact problem is something I’m hoping to fix with a tool that parses the source to AST and then has the LLM write code to modify the AST (which you then run to get your changes) rather than output code directly.

I’ve started in a narrow niche of python/flask webapps and constrained to that stack for now, but if you’re interested I’ve just opened it for signups: https://codeplusequalsai.com

Would love feedback! Especially if you see promising results in not getting huge refactors out of small change requests!

(Edit: I also blogged about how the AST idea works in case you're just that curious: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...)

HenriNext · 2h ago
Interesting idea. But LLMs are trained on vast amount of "code as text" and tiny fraction of "code as AST"; wouldn't that significantly hurt the result quality?
cryptoz · 2h ago
Thanks and yeah that is a concern; however I have been getting quite good results from this AST approach, at least for building medium-complexity webapps. On the other hand though, this wasn't always true...the only OpenAI model that really works well is o3 series. Older models do write AST code but fail to do a good job because of the exact issue you mention, I suspect!
tough · 2h ago
Interesting, i started playing with ts-morph and neo4j to parse TypeScript codebases.

simonw has symbex which could be useful for you for python

jtwaleson · 2h ago
Having the LLM modify the AST seems like a great idea. Constraining an LLM to only generate valid code would be super interesting too. Hope this works out!
nolist_policy · 3h ago
Can't you just commit the relevant parts? The git index is made for this sort of thing.
tasuki · 2h ago
It's not always trivial to find the relevant 5 line change in a diff of 200 lines...
fwip · 1h ago
Really? I haven't tried Gemini 2.5 yet, but my main complaint with Claude 3.7 is this exact behavior - creating 200+ line diffs when I asked it to fix one function.
fkyoureadthedoc · 4h ago
Where/how do you use it? I've only tried this model through GitHub Copilot in VS Code and I haven't experienced much changing of random things.
diggan · 3h ago
I've used it via Google's own AI studio and via my own library/program using the API and finally via Aider. All of them lead to the same outcome, large chunks of changes to a lot of unrelated things ("helpful" refactors that I didn't ask for) and tons of unnecessary comments everywhere (like those comments you ask junior devs to stop making). No amount of prompting seems to address either problems.
dherikb · 4h ago
I have the exactly same issue using it with Aider.
bugglebeetle · 3h ago
This is generally controllable with prompting. I usually include something like, “be excessively cautious and conservative in refactoring, only implementing the desired changes” to avoid.
Workaccount2 · 2h ago
I have a strong sense that the comments are for the model more than the user. It's effectively more thinking in context.
HenriNext · 2h ago
Same experience. Especially the "step" comments about the performed changes are super annoying. Here is my prompt-rule to prevent them:

"5. You must never output any comments about the progress or type of changes of your refactoring or generation. Example: you must NOT add comments like: 'Added dependency' or 'Changed to new style' or worst of all 'Keeping existing implementation'."

Maxatar · 4h ago
Tell it not to write so many comments then. You have a great deal of flexibility in dictating the coding style and can even include that style in your system prompt or upload a coding style document and have Gemini use it.
Trasmatta · 4h ago
Every time I ask an LLM to not write comments, it still litters it with comments. Is Gemini better about that?
nearbuy · 25m ago
Sample size of one, but I just tried it and it worked for me on 2.5 pro. I just ended my prompt with "Do not include any comments whatsoever."
grw_ · 3h ago
No, you can tell it not to write these comments in every prompt and it'll still do it
sitkack · 4h ago
LLMs are extremely poor at following negative instructions, tell them what to do, not what not to do.
diggan · 3h ago
Ok, so saying "Implement feature X" leads to a ton of comments. How do you rewrite that comment to not include "don't write comments" while making the output not containing comments? "Write only source code, no plain text with special characters in the beginning of the line" or what are you suggesting here in practical terms?
sroussey · 3h ago
“Constrain all comments to a single block at the top of the file. Be concise.”

Or something similar that does not rely on negation.

diggan · 1h ago
But I want no comments whatsoever, not one huge block of comments at the top of the file. How'd I get that without negation?

Besides, other models seems to handle negation correctly, not sure why it's so difficult for the Gemini family of models to understand.

sitkack · 3h ago
I also include something about "Target the comments towards a staff engineer that favors concise comments that focus on the why, and only for code that might cause confusion."

I also try and get it to channel that energy into the doc strings, so it isn't buried in the source.

staticman2 · 3h ago
This is sort of LLM specific. For some tasks you might try including the word comment but give the order at the beginning and end of the prompt. This is very model dependent. Like:

Refractor this. Do not write any comments.

<code to refractor>

As a reminder your task is to refractor the above code and do not write any comments.

diggan · 1h ago
> Do not write any comments. [...] do not write any comments.

Literally both of those are negations.

FireBeyond · 3h ago
"Implement feature X, and as you do, insert only minimal and absolutely necessary comments that explain why something is being done, not what is being done."
sitkack · 3h ago
You would say "omit the how". That word has negation built in.
dheera · 4h ago
I usually ask ChatGPT to "comment the shit out of this" for everything it writes. I find it vastly helps future LLM conversations pick up all of the context and why various pieces of code are there.

If it is ingesting data, there should also be a sample of the data in a comment.

Semaphor · 2h ago
2.5 was the most impressive model I use, but I agree about the comments. And when refactoring some code it wrote before, it just adds more comments, it becomes like archaeological history (disclaimer: I don’t use it for work, but to see what it can do, so I try to intervene as little as possible, and get it to refactor what it thinks it should)
Scene_Cast2 · 4h ago
It also does super defensive coding. Not that it's a bad thing in general, but I write a lot of prototype code.
prpl · 4h ago
Production quality code is defensive. Probably trained on a lot of google code.
Tainnor · 2h ago
Depends on what you mean by "defensive". Anticipating error and non-happy-path cases and handling them is definitely good. Also fault tolerance, i.e. allowing parts of the application to fail without bringing down everything.

But I've heard "defensive code" used for the kind of code where almost every method validates its input parameters, wraps everything in a try-catch, returns nonsensical default values in failure scenarios, etc. This is a complete waste because the caller won't know what to do with the failed validations or thrown errors, and it's just unnecessary bloat that obfuscates the business logic. Validation, error handling and so on should be done in specific parts of the codebase (bonus points if you can encode the successful validation or the presence/absence of errors in the type system).

neilellis · 2h ago
this!

lots of hasattr("") rubbish, I've increased the amount of prompting but it still does this - basically it defers it's lack of compile time knowledge to runtime 'let's hope for the best, and see what happens!'

Trying to teach it FAIL FAST is an uphill struggle.

Oh and yes, returning mock objects if something goes wrong is a favourite.

It truly is an Idiot Savant - but still amazingly productive.

montebicyclelo · 3h ago
Does the code consist of many large try except blocks that catch "Exception", which Gemini seems to like doing, (I thought it was a bad practice to catch the generic Exception in Python)
hnuser123456 · 3h ago
Catching the generic exception is a nice middleground between not catching exceptions at all (and letting your script crash), and catching every conceivable exception individually and deciding exactly how to handle each one. Depends on how reliable you need your code to be.
taf2 · 4h ago
I really liked the Gemini 2.5 pro model when it was first released - the upload code folder was very nice (but they removed it). The annoying things I find with the model is it does a really bad job of formatting the code it generates... I know I can use a code formatting tool and I do when i use gemini output but otherwise I find grok much easier to work with and yields better results.
throwup238 · 1h ago
> I really liked the Gemini 2.5 pro model when it was first released - the upload code folder was very nice (but they removed it).

Removed from where? I use the attach code folder feature every day from the Gemini web app (with a script that clones a local repo that deletes .git and anything matching a gitignore pattern).

sureIy · 3h ago
My custom default Claude prompt asks it to never explain code unless specifically asked to. Also to produce modern and compact code. It's a beauty to see. You ask for code and you get code, nothing else.
freddydumont · 1h ago
That’s been my experience as well. It’s especially jarring when asking for a refactor as it will leave a bunch of WIP-style comments highlighting the difference with the previous approach.
Hikikomori · 1h ago
So many comments, more verbose code and will refactor stuff on its own. Still better than chatgpt, but I just want a small amount of code that does what I asked for so I can read through it quickly.
energy123 · 4h ago
It probably increases scores in the RL training since it's a kind of locally specific reasoning that would reduce bugs.

Which means if you try to force it to stop, the code quality will drop.

AuthConnectFail · 29m ago
you can ask it to remove, it does p good job at it
guestbest · 4h ago
What kind of problems are you putting in where that is the solution? Just curious.
benbristow · 3h ago
You can ask it to remove the comments afterwards, and it'll do a decent job of it, but yeah, it's a pain.
asadm · 3h ago
you need to do a 2nd step as a post-process to erase the comments.

Models use comments to think, asking to remove will affect code quality.

merksittich · 3h ago
My favourites are comments such as: from openai import DefaultHttpxClient # Import the httpx client
kurtis_reed · 2h ago
> all the gang

What does that mean?

bugglebeetle · 3h ago
It’s annoying, but I’ve done extensive work with this model and leaving the comments in for the first few iterations produced better outcomes. I expect this is baked into the RL they’re doing, but because of the context size, it’s not really an issue. You can just ask it to strip out in the final pass.
dyauspitr · 4h ago
Just ask it for fewer comments, it’s not rocket science.
GaggiX · 4h ago
You can ask to not use comments or use less comments, you can put this in the system prompt too.
ChadMoran · 4h ago
I've tried this, aggressively and it still does it for me. I gave up.
koakuma-chan · 3h ago
Have you tried threats?
throwup238 · 1h ago
It strips the comments from the code or else it gets the hose again.
ziml77 · 4h ago
I tried this as well. I'm interfacing with Gemini 2.5 using Cursor and I have rules to to limit the comments. It still ends up over-commenting.
shawabawa3 · 3h ago
I have a feeling this may be a cursor issue, perhaps cursors system prompt asks for comments? Asking in the aistudio UI for code and ending the prompt with "no code comments" has always worked for me
blensor · 4h ago
Maybe too many comments could be a good metric to check if someone just yolo accepted the result or if they actually checked if it's correct.

I don't have problems with getting lot's of comments in the output, I am just deleting it while reading what it did

tough · 2h ago
another great tell of code reviewers yolo'ing it is that LLM's usually put the full filename path on the output, so if you see a file with the filename / path on the first line, thats prob a llm output
tucnak · 4h ago
Ask it to do less of it, problem solved, no? With tools like Cursor it's become really easy to fit the models to the shoe, or the shoe to the foot.
mrinterweb · 3h ago
If you don't want so many comments, have you tried asking the AI for fewer comments. Seems like something a little prompt engineering could solve.
cchance · 4h ago
And comments are bad? I mean you could tell it to not comment the code or to self-document with naming instead of inline comments, its a LLM it does what you tell it to

No comments yet

andy12_ · 4h ago
Interestingly, when compering benchmarks of Experimental 03-25 [1] and Experimental 05-06 [2] it seems the new version scores slightly lower in everything except on LiveCodeBench.

[1] https://storage.googleapis.com/model-cards/documents/gemini-... [2] https://deepmind.google/technologies/gemini/

merksittich · 3h ago
According to the article, "[t]he previous iteration (03-25) now points to the most recent version (05-06)." I assume this applies to both the free tier gemini-2.5-pro-exp-03-25 in the API (which will be used for training) and the paid tier gemini-2.5-pro-preview-03-25.

Fair enough, one could say, as these were all labeled as preview or experimental. Still, considering that the new model is slightly worse across the board in benchmarks (except for LiveCodeBench), it would have been nice to have the option to stick with the older version. Not everyone is using these models for coding.

zurfer · 1h ago
Just switching a pinned version (even alpha, beta, experimental, preview) to another model doesn't feel right.

I get it, chips are sparse and they want their capacity back, but it breaks trust with developers to just downgrade your model.

Call it gemini-latest and I understand that things will change. Call it *-03-25 and I want the same model that I got on 25th March.

arnaudsm · 3h ago
This should be the top comment. Cherry-picking is hurting this industry.

I bet they kept training on coding tasks, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.

luckydata · 3h ago
Or because they realized that coding is what most of those LLMs are used for anyways?
arnaudsm · 3h ago
They should have shown the benchmarks. Or market it as a coding model, like Qwen & Mistral.
jjani · 3h ago
That's clearly not a PR angle they could possibly take when it's replacing the overall SotA model. This is a business decision, potentially inference cost related.
arnaudsm · 2h ago
From a business pov it's a great move, for the customers it's evil to hide evidence that your product became worse.
nopinsight · 2h ago
Livebench.ai actually suggests the new version is better on most things.

https://livebench.ai/#/

jjani · 3h ago
Sounds like they were losing so much money on 2.5-Pro they came up with a forced update that made it cheaper to run. They can't come out with "we've made it worse across the board", nor do they want to be the first to actually raise prices, so instead they made a bit of a distill that's slightly better at coding so they can still spin it positively.
sauwan · 3h ago
I'd be surprised if this was a new base model. It sounds like they just did some post-training RL tuning to make this version specifically stronger for coding, at the expense of other priorities.
jjani · 2h ago
Every frontier model now is a distill of a larger unpublished model. This could be a slightly smaller distill, with potentially the extra tuning you're mentioning.
tangjurine · 1h ago
Any info on this?
cubefox · 2h ago
That's an unsubstantiated claim. I doubt this is true, since people are disproportionately more willing to pay for the best of the best, rather than for something worse.
Workaccount2 · 2h ago
Google doesn't pay the nvidia tax. Their TPUs are designed for Gemini and Gemini designed for their TPUs. Google is no doubt paying far less per token than every other AI house.
laborcontract · 4h ago
My guess is that they've done a lot of tuning to improve diff based code editing. Gemini 2.5 is fantastic at agentic work, but it still is pretty rough around the edges in terms of generating perfectly matching diffs to edit code. It's probably one of the very few issues with the model. Luckily, aider tracks this.

They measure the old gemini 2.5 generating proper diffs 92% of the time. I bet this goes up to ~95-98% https://aider.chat/docs/leaderboards/

Question for the google peeps who monitor these threads: Is gemini-2.5-pro-exp (free tier) updated as well, or will it go away?

Also, in the blog post, it says:

  > The previous iteration (03-25) now points to the most recent version (05-06), so no action is required to use the improved model, and it continues to be available at the same price.
Does this mean gemini-2.5-pro-preview-03-25 now uses 05-06? Does the same apply to gemini-2.5-pro-exp-03-25?

update: I just tried updating the date in the exp model (gemini-2.5-pro-exp-05-06) and that doesnt work.

laborcontract · 1h ago
Update 2: I've been using this model in both aider and cline and I've haven't gotten a diff matching error yet, even with some pretty difficult substitutions across different places in multiple files. The overall feel of this model is nice.

I don't have a formal benchmark but there's a notable improvement in code generation due to this alone.

I've had gemini chug away on plans that have taken ~1 hour to implement. (~80mln tokens spent) A good portion of that energy was spent fixing mistakes made by cline/aider/roo due to search/replace mistakes. If this model gets anywhere close to 100% on diffs then this is a BFD. I estimate this will translate to a 50-75% productivity boost on long context coding tasks. I hope the initial results i'm seeing hold up!

I'm surprised by the reaction in the rest of the thread. A lot unproductive complaining, a lot of off topic stuff, nothing talking about the model itself.

Any thoughts from anyone else using the updated model?

okdood64 · 2h ago
What do you mean by agentic work in this context?
laborcontract · 2h ago
Knowing when to call functions, generating the proper function calling text structure, properly executing functions in sequence, knowing when it's completed its objective, and doing that over an extended context window.
mohsen1 · 4h ago
I use Gemini for almost everything. But their model card[1] only compares to o3-mini! In known benchmarks o3 is still ahead:

        +------------------------------+---------+--------------+
        |         Benchmark            |   o3    | Gemini 2.5   |
        |                              |         |    Pro       |
        +------------------------------+---------+--------------+
        | ARC-AGI (High Compute)       |  87.5%  |     —        |
        | GPQA Diamond (Science)       |  87.7%  |   84.0%      |
        | AIME 2024 (Math)             |  96.7%  |   92.0%      |
        | SWE-bench Verified (Coding)  |  71.7%  |   63.8%      |
        | Codeforces Elo Rating        |  2727   |     —        |
        | MMMU (Visual Reasoning)      |  82.9%  |   81.7%      |
        | MathVista (Visual Math)      |  86.8%  |     —        |
        | Humanity’s Last Exam         |  26.6%  |   18.8%      |
        +------------------------------+---------+--------------+
[1] https://storage.googleapis.com/model-cards/documents/gemini-...
jsnell · 2h ago
The text in the model card says the results are from March (including the Gemini 2.5 Pro results), and o3 wasn't released yet.

Is this maybe not the updated card, even though the blog post claims there is one? Sure, the timestamp is in late April, but I seem to remember that the first model card for 2.5 Pro was only released in the last couple of weeks.

cbg0 · 54m ago
o3 is $40/M output tokens and 2.5 Pro is $10-15/M output tokens so o3 being slightly ahead is not really worth 4 times more than gemini.
jorl17 · 32m ago
Also, o3 is insanely slow compared to Gemini 2.5 Pro
franze · 14m ago
I like it. I threw some random concepts at it (Neon, LSD, Falling, Elite, Shooter, Escher + Mobile Game + SPA) at it and this is what it came up with after a few (5x) roundtrips.

https://show.franzai.com/a/star-zero-huge?nobuttons

herpdyderp · 4h ago
I agree it's very good but the UI is still usually an unusable, scroll-jacking disaster. I've found it's best to let a chat sit for around a few minutes after it has finished printing the AI's output. Finding the `ms-code-block` element in dev tools and logging `$0.textContext` is reliable too.
OsrsNeedsf2P · 4h ago
Loading the UI on mobile while on low bandwidth is also a non-starter. It simply doesn't work.
uh_uh · 4h ago
Noticed this too. There's something funny about billion dollar models being handicapped by stuck buttons.
energy123 · 4h ago
The Gemini app has a number of severe bugs that impacts everyone who uses it, and those bugs have persisted for over 6 months.

There's something seriously dysfunctional and incompetent about the team that built that web app. What a way to waste the best LLM in the world.

kubb · 3h ago
It's the company. Letting incompetent people who are vocal rise to the top is a part of Google's culture, and the internal performance review process discourages excellence - doing the thousand small improvements that makes a product truly great is invisible to it, so nobody does it.

Software that people truly love is impossible to build in there.

arnaudsm · 3h ago
Be careful, this model is worse than 03-25 in 10 of the 12 benchmarks (!)

I bet they kept training on coding, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.

jstummbillig · 2h ago
It seems that trying to build llms is the definition of accepting sunk cost.
ionwake · 4h ago
Is it possible to sue this with Cursor? If so what is the name of the model? gemini-2.5-pro-preview ?

edit> Its gemini-2.5-pro-preview-05-06

edit>Cursor syas it doesnt have "good support" et, but im not sure if this is a defualt message when it doesnt recognise a model? is this a big deal? should I wait until its officially supported by cursor?

Just trying to save time here for everyone - anyone know the answer?

bn-l · 1h ago
The one with exp in the name is free (you may have to add it yourself) but they train on you. And after a certain limit it becomes paid).
androng · 2h ago
At the bottom of the article it says no action is required and the Gemini-2.5-pro-preview-03-25 now points to the new model
tough · 2h ago
Cursor UI sucks, it tells me to use -auto mode- to be faster, but gemini 2.5 is way faster than any of the other free models, so just selecting that one is faster even if the UI says otherwise
killerstorm · 4h ago
Why can't they just use version numbers instead of this "new preview" stuff?

E.g. call it Gemini Pro 2.5.1.

lukeschlather · 3h ago
I take preview to mean the model may be retired on an accelerated timescale and replaced with a "real" model so it's dangerous to put into prod unless you are paying attention.
lolinder · 2h ago
They could still use version numbers for that. 2.5.1-preview becomes 2.5.1 when stable.
danenania · 2h ago
Scheduled tasks in ChatGPT are useful for keeping track of these kinds of things. You can have it check daily whether there's a change in status, price, etc. for a particular model (or set of models).
cdolan · 1h ago
I appriciate that you are trying to help

But I do not want to have to build a network of bots with non-deterministic outputs to simply stay on top of versions

danenania · 1h ago
Neither do I, but it's the best solution I've found so far. It beats checking models/prices manually every day to see if anything has changed, and it works well enough in practice.

But yeah, some kind of deterministic way to get alerts would be better.

mhh__ · 3h ago
Are you saying you find model names like o4-mini-high-pro-experimental-version5 confusing and stupid?
siwakotisaurav · 4h ago
Usually don’t believe the benchmarks but first in web dev arena specifically is crazy. That one has been Claude for so long, which tracks in my experience
hersko · 4h ago
Give Gemini a shot. It is genuinely very good.
enraged_camel · 3h ago
I'm wondering when Claude 4 will drop. It's long overdue.
Etheryte · 4m ago
For me, Claude 3.7 was a noticeable step down across a wide range of tasks when compared to 3.5 with the same prompt. Benchmarks are one thing, but for real life use, I kept finding myself switching back to 3.5. Wouldn't be surprised if they were trying to figure out what happened there and how to prevent that in the next version.
danielbln · 2h ago
I was a little disappointed when the last thing coming out of Anthropic was their MAX pricing plan instead of a better model...
djrj477dhsnv · 4h ago
I don't understand what I'm doing wrong.. it seems like everyone is saying Gemini is better, but I've compared dozens of examples from my work, and Grok has always produced better results.
athoun · 3h ago
I agree, from my experience Grok gives superior coding results, especially when modifying large sections of the codebase at once such as in refactoring.

Although it’s not for coding, I have noticed Gemini 2.5 pro Deep Research has surpassed Grok’s DeepSearch in thoroughness and research quality however.

redox99 · 3h ago
I haven't tested this release yet, but I found Gemini to be overrated before.

My choice of LLMs was

Coding in cursor: Claude

General questions: Grok, if it fails then Gemini

Deep Research: Gemini (I don't have GPT plus, I heard it's better)

dyauspitr · 3h ago
Anecdotally grok has been the worst of the bunch for me.
mliker · 4h ago
The "video to learning app" feature is a cool concept (see it in AI Studio). I just passed in two separate Stanford lectures to see if it could come up with an interesting interactive app. The apps it generated weren't too useful, but I can see with more focus and development, it'd be a game changer for education.
SparkyMcUnicorn · 1h ago
Anyone know of any coding agents that support video inputs?

Web chat interfaces are great, but copy/paste gets old fast.

qwertox · 1h ago
I have my issues with the code Gemini Pro in AI Studio generates without customized "System Instructions".

It turns a well readable code-snippet of 5 lines into a 30 line snippet full of comments and mostly unnecessary error handling. Code which becomes harder to reason about.

But for sysadmin tasks, like dealing with ZFS and LVM, it is absolutely incredible.

bn-l · 1h ago
I’ve found the same thing. I don’t use it for code any more because it produces highly verbose and inefficient code that may work but is ugly and subtly brittle.
m_kos · 2h ago
[Tangent] Anyone here using 2.5 Pro in Gemini Advanced? I have been experiencing a ton of bugs, e.g.,:

- [codes] showing up instead of references,

- raw search tool output sliding across the screen,

- Gemini continusly answering questions asked two or more messages before but ignoring the most recent one (you need to ask Gemini an unrelated question for it to snap out of this bug for a few minutes),

- weird messages including text irrelevant to any of my chats with Gemini, like baseball,

- confusing its own replies with mine,

- not being able to run its own Python code due to some unsolvable formatting issue,

- timeouts, and more.

Dardalus · 27m ago
The Gemini app is absolute dog doo... use it through AI studio. Google ought to shut down the entire Gemini app.
xnx · 4h ago
This is much bigger news than OpenAI's acquisition of WindSurf.
EliasWatson · 4h ago
I wonder how the latest version of Grok 3 would stack up to Gemini 2.5 Pro on the web dev arena leaderboard. They are still just showing the original early access model for some reason, despite there being API access to the latest model. I've been using Grok 3 with Aider Chat and have been very impressed with it. I get $150 of free API credits every month by allowing them to train on my data, which I'm fine with since I'm just working on personal side projects. Gemini 2.5 Pro and Claude 3.7 might be a little better than Grok 3, but I can't justify the cost when Grok doesn't cost me a penny to use.
ramoz · 3h ago
Never sleep on Google.
crat3r · 4h ago
So, are people using these tools without the org they work for knowing? The amount of hoops I would have to jump through to get either of the smaller companies I have worked for since the AI boom to let me use a tool like this would make it absolutely not worth the effort.

I'm assuming large companies are mandating it, but ultimately the work that these LLMs seem poised for would benefit smaller companies most and I don't think they can really afford using them? Are people here paying for a personal subscription and then linking it to their work machines?

tasuki · 2h ago
> The amount of hoops I would have to jump through to get either of the smaller companies I have worked for since the AI boom to let me use a tool like this would make it absolutely not worth the effort.

Define "smaller"? In small companies, say 10 people, there are no hoops. That is the whole point of small companies!

codebolt · 4h ago
If you can get them to approve GitHub Copilot Business then Gemini Pro 2.5 and many others are available there. They have guarantees that they don't share/store prompts or code and the parent company is Microsoft. If you can argue that they will save money (on saved developer time), what would be their argument against?
otabdeveloper4 · 37m ago
> They have guarantees that they don't share/store prompts or code

"They trust me. Dumb ..."

bongodongobob · 4h ago
I work for a large company and everything other than MS Copilot is blocked aggressively at the DNS/cert level. Tried Deepseek when it came out and they already had it blocked. All .ai TLDs are blocked as well. If you're not in tech, there is a lot of "security" fear around AI.
jeffbee · 4h ago
Not every coding task is something you want to check into your repo. I have mostly used Gemini to generate random crud. For example I had a huge JSON representation of a graph, and I wanted the graph modified in a given way, and I wanted it printed out on my terminal in color. None of which I was remotely interested in writing, so I let a robot do it and it was fine.
crat3r · 4h ago
Fair, but I am seeing so much talk about how it is completing actual SDE tickets. Maybe not this model specifically, but to be honest I don't care about generating dummy data, I care about the claims that these newer models are on par with junior engineers.

Junior engineers will complete a task to update an API, or fix a bug on the front-end, within a couple days with lets say 80 percent certainty they hit the mark (maybe an inflated metric). How are people comparing the output of these models to that of a junior engineer if they generally just say "Here is some of my code, what's wrong with it?". That certainly isn't taking a real ticket and completing it in any capacity.

I am obviously very skeptical but mostly I want to try one of these models myself but in reality I think that my higher-ups would think that they introduce both risk AND the potential for major slacking off haha.

jpc0 · 1h ago
I don’t know about tickets but my org definitely happily pays for Gemini Advanced and encourages it’s use and would be considered a small org.

The latest SOTA models are definitely at the point where they can absolutely improve workflows and not get in your way too much.

I treat it a lot like an intern, “Here’s an api doc and spec, write me the boilerplate and a general idea about implementation”

Then I go in, review, rip put crud and add what I need.

It almost always gets architecture wrong, don’t expect that from it. However small functions and such is great.

When it comes to refactoring ask it for suggestions, eat the meat leave the bones.

thevillagechief · 4h ago
I've been switching between this and GPT-4o at work, and Gemini is really verbose. But I've been primarily using it. I'm confused though, the model available in copilot says Gemini 2.5 Pro (Preview), and I've had it for a few weeks. This was just released today. Is this an updated preview? If so, the blog/naming is confusing.
childintime · 3h ago
How does it perform on anything but Python and Javascript? In my experience my milage varied a lot when using C#, for example, or Zig, so I've learnt to just let it select the language it wants.

Also, why doesn't Ctrl+C work??

scbenet · 2h ago
It's very good at Go, which makes sense because I'm assuming it's trained on a lot of Google's code
simianwords · 1h ago
How would they train it on google code without revealing internal IP?
gitroom · 4h ago
man that endless commenting seriously kills my flow - gotta say, even after all the prompts and hacks, still can't get these models to chill out. you think we'll ever get ai to stop overdoing it and actually fit real developer habits or is it always gonna be like this?
CSMastermind · 4h ago
Hasn't Gemini 2.5 Pro been out for a while?

At first I was very impressed with it's coding abilities, switching off of Claud for it but recently I've been using GPT o3 which I find is much more concise and generally better at problem solving when you hit an error.

spaceman_2020 · 4h ago
Think that was still the experimental model incorrectly labeled by many platforms as “Pro”
85392_school · 3h ago
That's inaccurate. First, there was the experimental 03-25 checkpoint. Then it was promoted to Preview without changing anything. And now we have a new 05-06 checkpoint, still called Gemini 2.5 Pro, and still in Preview.
oellegaard · 4h ago
Is there anything like Claude code for other models such as gemini?
vunderba · 2h ago
Haven't tried it yet, but I've heard good things about Plandex.

https://github.com/plandex-ai/plandex

mickeyp · 4h ago
I'm literally working on this particular problem. Locally-run server; browser-based interface instead of TUI/CLI; connects to all the major model APIs; many, many quality of life and feature improvements over other tools that hook into your browser.

Drop me a line (see profile) if you're interested in beta testing it when it's out.

oellegaard · 4h ago
I'm actually very happy with everything in Claude code, eg the CLI so im really just curious to try other models
Filligree · 3h ago
I find that 2.5 Pro has a higher ceiling of understanding, while Claude writes more maintainable code with better comments. If we want to combine them... well, it should be easier to fix 2.5 than Claude. That said, neither is there yet.

Currently Claude Code is a big value-add for Claude. Google has nothing equivalent; aider requires far more manual work.

revicon · 4h ago
Same! I prefer the CLI, way easier when I’m connected via ssh from another network somewhere.
mickeyp · 3h ago
The CLI definitely has its advantages!

But with my app: you can install the host anywhere and connect to it securely (via SSH forwarding or private VPN or what have you) so that workflow definitely still works!

elliot07 · 3h ago
OpenAi has a version called Codex that has support. It's lacking in a few features like MCP right now and the TUI isn't there yet, but interestingly they are building a Rust version (it's all open source) that seems to include MCP support and looks significantly higher quality. I'd bet within the next few weeks there will be a high quality claude code alternative.
martythemaniak · 3h ago
Goose by Block (Square/CashApp) is like an open-source Claude Code that works with any remote or local LLM.

https://github.com/block/goose

alphabettsy · 3h ago
Aider
danielbln · 2h ago
Aider wasn't all that agentic last time I tried it, has that changed?
martinald · 4h ago
I'm totally lost again! If I use Gemini on the website (gemini.google.com), am I using 2.5 Pro IO edition, or am I using the old one?
koakuma-chan · 3h ago
martinald · 2h ago
I get this in AI studio, but does it apply to gemini.google.com?
disgruntledphd2 · 4h ago
Check the dropdown in the top left (on my screen,at least).
martinald · 2h ago
Are you referring to gemini.google.com or ai studio? I see 2.5 Pro but is this the right one? I saw a tweet from them saying you have to select Canvas first? I'm so so lost.
pzo · 3h ago
"The previous iteration (03-25) now points to the most recent version (05-06), so no action is required to use the improved model"
mvdtnz · 54m ago
I truly do not understand how people are getting worthwhile results from Gemini 2.5 Pro. I have used all of the major models for lots of different programming tasks and I have never once had Gemini produce something useful. It's not just wrong, it's laughably bad. And people are making claims that it's the best. I just... don't... get it.
nashashmi · 3h ago
I keep hearing good things about Gemini online and offline. I wrote them off as terrible when they first launched and have not looked back since.

How are they now? Sufficiently good? Competent? Competitive? Or limited? My needs are very consumer oriented, not programming/api stuff.

danielbln · 2h ago
Bard sucked, Gemini sucked, Gemini 2 was alright, 2.5 is awesome and my main driver for coding these days.
thevillagechief · 1h ago
The Gemini deep research is a revelation. I obsessively research most things I buy, from home appliances to gym equipment. It has literally saved untold hours of comparisons. You get detailed reports generated from every website including youtube reviews. I've bought a bunch of stuff on it's recommendation.
hmate9 · 3h ago
Probably the best one right now, their deep research is also very good.
obsolete_wagie · 2h ago
o3 is so far ahead of antrhopic and google, these models arent even worth using
mattlondon · 45m ago
The benchmarks (1) seem to suggest that o3 is in 3rd place after Gemini 2.5 pro preview and Gemini 2.5 pro exp (for text reasoning, o3 4th for webdev). o3 doesn't even appear on the openrouter leaderboards (2) suggesting is hardly used (if at all) by anyone using LLMs do actually do anything (such as coding) which makes one question if it is actually any good at all (otherwise if it was so great I'd expect to see heavy usage)

Not sure where your data is coming from but everything else is pointing to Google supremacy in AI right now. I look forward to some new models from Anthropic, xAi, Meta et al (remains to be seen if OpenAI has anything left apart from bluster). Exciting times.

1 - https://beta.lmarena.ai/leaderboard

2 - https://openrouter.ai/rankings

obsolete_wagie · 2m ago
you just arent using the models to their full capacity if you think this, benchmarks have all been hacked
cellis · 29m ago
8x the cost for maybe 5% improvement?
Workaccount2 · 2h ago
o3 is expensive in the API and intentionally crippled in the web app.
Squarex · 55m ago
source?
obsolete_wagie · 8m ago
use the models daily, its not even close
ionwake · 2h ago
Can someone tell me if windsurf is better than cursor? ( pref someone who has used both for a few days? )
kurtis_reed · 2h ago
Relevance?
ionwake · 1h ago
its what literally every hn coder is using to program with these models much as gemini.where u been brother
brap · 4h ago
Gemini is now ranked #1 across every category in lmarena.
aoeusnth1 · 2h ago
LMArena is a joke, though
panarchy · 3h ago
Is it just me that finds that while Gemini 2.5 is able to generate a lot of code that the end results are usually lackluster compared to Claude and even ChatGPT? I also find it hard-headed and frequently does things in ways I explicitly told it not to. The massive context window is pretty great though and enables me to do things I can't with the others so it still gets used a lot.
scrlk · 3h ago
How are you using it?

I find that I get the best results from 2.5 Pro via Google AI Studio with a low temperature (0.2-0.3).

panarchy · 2h ago
AI Studio as well, but I haven't played around with the temperature too much and even then I only lowered it to like 0.8 a few times. So I'll have to try this out. Thanks.
llm_nerd · 4h ago
Their nomenclature is a bit confused. The Gemini web app has a 2.5 Pro (experimental), yet this apparently is referring to 2.5 Pro Preview 05-06.

Would be ideal if they incremented the version number or the like.

jeswin · 4h ago
Now if there was a way to add prepaid credits and monitor usage near real-time on a dashboard, like every other vendor. Hey Google are you listening?
Hawkenfall · 4h ago
You can do this with https://openrouter.ai/
pzo · 3h ago
but if you want to use google SDK (python-genai, js-genai) rather than openai SDK (If found google api more feature rich when using different modality like audio/images/video) you cannot use openrouter. Also not sure if you are developing app and needs higher rate limits - what's typical rate limit via openrouter?
pzo · 3h ago
also for some reason I tested simple prompt (few words, no system prompt) with attached 1 images and openrouter charged me like ~1700 tokens when on the other hand using directly via python-genai its like ~400 tokens. Also keep in mind they charge small markup fee when you top you their account.
simple10 · 3h ago
You can do this with LLM proxies like LiteLLM. e.g. Cursor -> LiteLLM -> LLM provider API.

I have LiteLLM server running locally with Langfuse to view traces. You configure LiteLLM to connect directly to providers' APIs. This has the added benefit of being able to create LiteLLM API keys per project that proxies to different sets of provider API keys to monitor or cap billing usage.

I use https://github.com/LLemonStack/llemonstack/ to spin up local instances of LiteLLM and Langfuse.

greenavocado · 4h ago
You can do that by using deepinfra to manage your billing. It's pay-as-you-go and they have a pass-through virtual target for Google Gemini.

Deepinfra token usage updates every time you switch to the tab if it is opened to the usage page so it is possible to see updates even every second

therealmarv · 4h ago
Is this on Google AI Studio or Google Vertex or both?
slig · 4h ago
In in the meantime, I'm using openrouter.
tucnak · 4h ago
You need LLM Ops. YC happens to have invested in Langfuse, which is if you're serious about tracking metrics, you'll appreciate the rest, too.

And before you ask: yes, for cached content and batch completion discounts you can accommodate both—just needs a bit of logic in your completion-layer code.

cchance · 4h ago
openrouter, i dont think anyone should use google direct till they fix their shit billing
greenavocado · 4h ago
Even afterwards. Avoid paying directly if you can because they generally could not care less about individuals.

You have less than $10 million in spend you will be treated worse than cattle because at least farmers feed their cattle before they are milked

xbmcuser · 4h ago
As a non programmer Gemini 2.5 Pro I have been really loving this for my python scripting for manipulating text and excel files for web scraping. In the past I was able to use Chat Gpt to code some of the things that I wanted but with Gemini 2.5 Pro it has been just another level. If they improved it further that would be amazing
ramesh31 · 4h ago
>Best-in-class frontend web development

It really is wild to have seen this happen over the last year. The days of traditional "design-to-code" FE work are completely over. I haven't written a line of HTML/CSS in months. If you are still doing this stuff by hand, you need to adapt fast. In conjunction with an agentic coding IDE and a few MCP tools, weeks worth of UI work are now done in hours to a higher level of quality and consistency with practically zero effort.

kweingar · 3h ago
If it's zero effort, then why do devs need to adapt fast? And wouldn't adapting be incredibly easy?

The only disadvantage to not using these tools would be that your current output is slower. As soon as your employer asks for more or you're looking for a new job, you can just turn on AI and be as fast as everyone who already uses it.

jaccola · 1h ago
Yup, I see comments like the parent all of the time and they are always a head scratcher. They would be far more rational (and a bit desperate) if they were trying to sell something, but they never appear to be.

Always "10x"/"100x" more productive with AI, "you will miss out if you don't adopt now"! Build a great company 100x faster and every rational actor in the market will notice, believe you and be begging to adopt your ways of working (and you will become filthy rich as a nice kicker).

The proof of the pudding is in the eating.

Workaccount2 · 2h ago
"Why are we paying you $150k/yr to middleman a chatbot?"
ramesh31 · 1h ago
>"Why are we paying you $150k/yr to middleman a chatbot?"

Because I don't get paid $150k/yr to write HTML and CSS. I get paid to provide technical solutions to business problems. And "chatbots" are a very useful new tool to aid in that.

kweingar · 27m ago
> I get paid to provide technical solutions to business problems.

That's true of all SWEs who write HTML and CSS, and it's the reason I don't think there's much downside for devs to not proactively start using these agentic tools.

If it truly turns weeks of work into hours as you say, then my managers will start asking me to use them, and I will use them. I won't be at a disadvantage compared to people who started using them a bit earlier than me.

If I am looking for a new job and find an employer that wants people to use agentic tools, then I will tell the hiring manager that I will use those tools. Again, no disadvantage.

Being outdated as a tech employee puts you at a disadvantage to the extent that there is a difficult-to-cross gap. If you are working in COBOL and the market demands Rust engineers, then you need a significant amount of learning/experience to catch up.

But a major pitch of AI tools is that it is not difficult to cross the gap. You draw on your domain experience to describe what you want, and it gives it to you. When it makes a mistake, you draw on your domain experience to tweak or fix things as needed.

Maybe someday there will be a gap. Maybe people will develop years of experience and intuition using particular AI tools that makes them much more attractive than somebody without this experience. But the tools are churning so quickly (Claude Code and Cursor are brand new, tools from 18 months ago are obsolete, newer and better tools are surely coming soon) that this seems far off.

amarcheschi · 4h ago
i'm surprised by no line of css html in months. maybe it's an exageration and that's okay.

However, just today i was building a website for fun with gemini and had to manually fix some issues with css that he struggled with. as it often happens, trying to let it repair the damage only made it go into a pit of despair (for me). i fixed the issues in about a glance and 5 minutes. This is not to say it's bad, but sometimes it still makes absurd mistakes and can't find a way to solve them

ramesh31 · 3h ago
>"just today i was building a website for fun with gemini and had to manually fix some issues with css that he struggled with."

Tailwind (with utility classes) is the real key here. It provides a semantic layer over CSS that allows the LLM to reason about how things will actually look. Night and day difference from using stylesheets with custom classes.

PaulHoule · 4h ago
I have pretty good luck with AI assistants with CSS and with theming React components like MUI where you have to figure out what to put in an sx or a theme. Sure beats looking through 50 standards documents (fortunately not a lot of "document A invalidates document B" in that pile) or digging through wrong answers where ignoramuses hold court on StackOverflow.
dlojudice · 4h ago
> are now done in hours to a higher level of quality

However, I feel that there is a big difference between the models. In my tests, using Cursor, Clause 3.7 Sonnet has a much more refined "aesthetic sense" than other models. Many times I ask "make it more beautiful" and it manages to improve, where other models just can't understand it.

danielbln · 2h ago
I've noticed the same, but I wonder if this new Gemini checkpoint is better at it now.
preommr · 4h ago
Elaborate, because I have serious doubts about this.

If we're talking about just slapping on tailwind+component-library(e.g. shadcn-ui, material), then that's just one step-above using no-code solutions. Which, yes, that works well. But if someone didn't need customized logic, then it was always possible to just hop on fiverr or use some very simple template-based tools to accomplish this.

If we're talking more advanced logic, understanding aesthetics, etc. Then I'd say it's much worse than other coding areas like backend, because they work on a visual and ux level beyond just code which is just text manipulation (and what llms excel at). In other words, I think the results are still very shallow beyond first impressions.

shostack · 4h ago
What does your tool and model stack look like for this?
ramesh31 · 4h ago
Cline with Gemini 2.5 (https://cline.bot/)

Framelink MCP (https://github.com/GLips/Figma-Context-MCP)

Playwright MCP (https://github.com/microsoft/playwright-mcp)

Pull down designs via Framelink, optionally enrich with PNG exports of nodes added as image uploads to the prompt, write out the components, test/verify via Playwright MCP.

Gemini has a 1M context size now, so this applies to large mature codebases as well as greenfield. The key thing here is the coding agent being really clever about maintaining its' context; you don't need to fit an entire codebase into a single prompt in the same way that you don't need to fit the entire codebase into your head to make a change, you just need enough context on the structure and form to maintain the correct patterns.

jjani · 3h ago
The designs itself are still done by humans, I presume?
ramesh31 · 21m ago
>The designs itself are still done by humans, I presume?

Indeed, in fact design has become the bottleneck now. Figma has really dropped the ball here WRT building out AI assisted (not driven) tooling for designers.

mediaman · 3h ago
I find they achieve acceptable, but not polished levels of work.

I'm not even a designer, but I care about the consistency of UI design and whether the overall experience is well-organized, aligned properly, things are placed in a logical flow for the user, and so on.

While I'm pro-AI tooling and use it heavily, and these models usually provide a good starting point, I can't imagine shipping the slop without writing/editing a line of HTML for anything that's interaction-heavy.

redox99 · 4h ago
What tools do you use?
white_beach · 4h ago
object?

(aider joke)

xyst · 3h ago
Proprietary junk beats DeepSeek by a mere 213 points?

Oof. G and others are way behind