Show HN: Plexe – ML Models from a Prompt (github.com)
84 points by vaibhavdubey97 8h ago 39 comments
Show HN: Pinggy – A free RSS reader for the web (pinggy.com)
6 points by vasanthv 23h ago 1 comments
Show HN: OSle – A 510 bytes OS in x86 assembly (github.com)
159 points by shikaan 4d ago 32 comments
Gemini 2.5 Pro Preview
462 meetpateltech 444 5/6/2025, 3:10:00 PM developers.googleblog.com ↗
There are still significant limitations, no amount of prompting will get current models to approach abstraction and architecture the way a person does. But I'm finding that these Gemini models are finally able to replace searches and stackoverflow for a lot of my day-to-day programming.
I find this sentiment increasingly worrisome. It's entirely clear that every last human will be beaten on code design in the upcoming years (I am not going to argue if it's 1 or 5 years away, who cares?)
I wished people would just stop holding on to what amounts to nothing, and think and talk more about what can be done in a new world. We need good ideas and I think this could be a place to advance them.
Can you point to _any_ evidence to support that human software development abilities will be eclipsed by LLMs other than trying to predict which part of the S-curve we're on?
Citation needed. In fact, I think this pretty clearly hits the "extraordinary claims require extraordinary evidence" bar.
My friend, we are living in a world of exponential increase of AI capability, at least for the last few years - who knows what the future will bring!
Here someone just claimed that it is "entirely clear" LLMs will become super-human, without any evidence.
https://en.wikipedia.org/wiki/Extraordinary_claims_require_e...
The way you've framed it seems like the only evidence you will accept is after it's actually happened.
In my mind, at this point we either need (a) some previously "hidden" super-massive source of training data, or (b) another architectural breakthrough. Without either, this is a game of optimization, and the scaling curves are going to plateau really fast.
a) it hasn't even been a year since the last big breakthrough, the reasoning models like o3 only came out in September, and we don't know how far those will go yet. I'd wait a second before assuming the low-hanging fruit is done.
b) I think coding is a really good environment for agents / reinforcement learning. Rather than requiring a continual supply of new training data, we give the model coding tasks to execute (writing / maintaining / modifying) and then test its code for correctness. We could for example take the entire history of a code-base and just give the model its changing unit + integration tests to implement. My hunch (with no extraordinary evidence) is that this is how coding agents start to nail some of the higher-level abilities.
I think everyone expected AlphaGo to be the research direction to pursue, which is why it was so surprising that LLMs turned out to work.
Isn’t software engineering a lot more than just writing code? And I mean like, A LOT more?
Informing product roadmaps, balancing tradeoffs, understanding relationships between teams, prioritizing between separate tasks, pushing back on tech debt, responding to incidents, it’s a feature and not a bug, …
I’m not saying LLMs will never be able to do this (who knows?), but I’m pretty sure SWEs won’t be the only role affected (or even the most affected) if it comes to this point.
Where am I wrong?
Power saws really reduced time, lathes even more so. Power drills changed drilling immensely, and even nail guns are used on roofing project s because manual is way too slow.
All the jobs still exist, but their tools are way more capable.
* "it's too hard!"
* "my coworkers will just ruin it"
* "startups need to pursue PMF, not architecture"
* "good design doesn't get you promoted"
And now we have "AI will do it better soon."
None of those are entirely wrong. They're not entirely correct, either.
This turns out to be a big issue. I read everything about software design I could get my hands on in years, but then at an actual large company it turned out to not help, because I'd never read anything about how to get others to follow the advice in my head from all that reading.
Asking one to make changes to such a code set, and you will get whatever branch the dice told the tree to go down that day.
To paraphrase, “LLMs are like a box of chocolates…”.
And if you have the patience to try and tack the AI to get back on track, you probably could have just done the work faster yourself.
Has anyone come close to solving this? I keep seeing all of this "cluster of agents" designs that promise to solve all of our problems but I can't help but wonder how it works out in the first place given they're not deterministic.
FWIW, I think you're probably right that we need to adapt, but there was no explanation as to _why_ you believe that that's the case.
That said, IMHO it is inevitable. My personal (dismal) view is that businesses see engineering as a huge cost center to be broken up and it will play out just like manufacturing -- decimated without regard to the human cost. The profit motive and cost savings are just too great to not try. It is a very specific line item so cost/savings attribution is visible and already tracked. Finally, a good % of the industry has been staffed up with under-trained workers (e.g., express bootcamp) who arent working on abstraction, etc -- they are doing basic CRUD work.
Most cost centers in the past were decimated in order to make progress: from horse-drawn carriages to cars and trucks, from mining pickaxes to mining machines, from laundry at the river to clothes washing machines, etc. Is engineering a particularly unique endeavor that needs to be saved from automation?
In what world is this statement remotely true.
If someone were to claim: no computer will ever be able to beat humans in code design, would you agree with that? If the answer is "no", then there's your proof.
I'm not convinced that they can reason effectively (see the ARC-AGI-2 benchmarks). Doesn't mean that they are not useful, but they have their limitations. I suspect we still need to discover tech distinct from LLMs to get closer to what a human brain does.
Our entire industry (after all these years) does not have even remotely sane measure or definition as what is good code design. Hence, this statement is dead on arrival as you are claiming something that cannot be either proven or disproven by anyone.
Which person it is? Because 90% of the people in our trade are bad, like, real bad.
I get that people on HN are in that elitist niche of those who care more, focus on career more, etc so they don't even realize the existence of armies of low quality body rental consultancies and small shops out there working on Magento or Liferay or even worse crap.
I think there is a total seismic change in software that is about to go down, similar to something like going from gas lamps to electric. Software doesn't need to be the way it is now anymore, since we have just about solved human language to computer interface translation. I don't want to fuss with formatting a word document anymore, I would rather just tell and LLM and let it modify the program memory to implement what I want.
No code & AI assisted programming has been told to be around the corner since 2000. We just arrived to a point where models remix what others have typed on their keyboards, and yet somebody still argues that humans will be left in the dust in near times.
No machine, incl. humans can create something more complex than itself. This is the rule of abstraction. As you go higher level, you lose expressiveness. Yes, you express more with less, yet you can express less in total. You're reducing the set's symbol size (element count) as you go higher by clumping symbols together and assigning more complex meanings to it.
Yet, being able to describe a larger set with more elements while keeping all elements addressable with less possible symbols doesn't sound plausible to me.
So, as others said. Citation needed. Extraordinary claims needs extraordinary evidence. No, asking AI to create a premium mobile photo app and getting Halide's design as an output doesn't count. It's training data leakage.
It is kinda a meme at this point, that there is no more "publicly available"... cough... training data. And while there have been massive breakthroughs in architecture, a lot of the progress of the last couple years has been ever more training for ever larger models.
So, at this point we either need (a) some previously "hidden" super-massive source of training data, or (b) another architectural breakthrough. Without either, this is a game of optimization, and the scaling curves are going to plateau really fast.
Technically correct. And yet, you would probably be at least be a little worried about that cliff and rather talk about that.
I just dropped version 0.1 of my Gemini book, and I have an example for making a Gem (really simple to do); read online link:
https://leanpub.com/solo-ai/read
[0] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_poli...
[1] https://docs.aws.amazon.com/aws-managed-policy/latest/refere...
[2] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_poli...
The unfortunate state of open source funding makes buildings such simple tool a loosing adventure unfortunately.
There's always been a demand for programming by non technical stakeholders that they try and solve without bringing on real programmers. No matter the tool, I think the problem is evergreen.
That is, if it's true that abstraction and architecture are useful for a given product, then people who know how to do those things will succeed in creating that product, and those who don't will fail. I think this is true for essentially all production software, but a lot of software never reaches production.
Transitioning or entirely recreating "vibecoded" proofs of concept to production software is another skill that will be valuable.
Having a good sense for when to do that transition, or when to start building production software from the start, and especially the ability to influence decision makers to agree with you, is another valuable skill.
I do worry about what the careers of entry level people will look like. It isn't obvious to me how they'll naturally develop any of these skills.
The fact that you called it out as a PoC is already many bars above what most vibe coders are doing. Which is considering a barely functioning web app as proof that vibe coding is a viable solution for coding in general.
> I do worry about what the careers of entry level people will look like. It isn't obvious to me how they'll naturally develop any of these skills.
Exactly. There isn't really a path forward from vibe coding to anything productizable without actual, deep CS knowledge. And LLMs are not providing that.
I'm sure lots of people aren't seeing it this way, but the point I was trying to make about this being a skill differentiator is that I think understanding the advantages, limitations, and tradeoffs, and keeping that understanding up to date as capabilities expand, is already a valuable skillset, and will continue to be.
The models are very impressive. But issues like these still make me feel they are still more pattern matching (although there's also some magic, don't get me wrong) but not fully reasoning over everything correctly like you'd expect of a typical human reasoner.
And that's fine and useful.
And crippled, incomplete, and deceiving, dangerous.
No: that in context is a plaster cast saw that looks vibrational but is instead a rotational saw for wood, and you will tend to believe it has safety features it was really not engineered with.
For plaster casts you have to have to plan, design and engineer a proper apt saw - learn what you must from the experience of saws for wood, but it's a specific project.
I assume that it's trickier than it seems as it hasn't happened yet.
Are we sure they know these things as opposed to being able to consistently guess correctly? With LLMs I'm not sure we even have a clear definition of what it means for it to "know" something.
But also things where guessing was desirable. For example with a riddle it would tell you it did not know or there wasn't enough information. After pressuring it to answer anyway it would correctly solve the riddle.
The official llama 2 finetune was pretty bad with this stuff.
And if you bully it enough on something nonsensical it'll give you a wrong answer.
You press it, and it takes a guess even though you told it not to, and gets it right, then you go "see it knew!". There's no database hanging out in ChatGPT/Claude/Gemini's weights with a list of cities and the tallest buildings. There's a whole bunch of opaque stats derived from the content it's been trained on that means that most of the time it'll come up with the same guess. But there's no difference in process between that highly consistent response to you asking the tallest building in New York and the one where it hallucinates a Python method that doesn't exist, or suggests glue to keep the cheese on your pizza. It's all the same process to the LLM.
What is the practical difference you're imagining between "consistently correct guess" and "knowledge"?
LLMs aren't databases. We have databases. LLMs are probabilistic inference engines. All they do is guess, essentially. The discussion here is about how to get the guess to "check itself" with a firmer idea of "truth". And it turns out that's hard because it requires that the guessing engine know that something needs to be checked in the first place.
Knowing it's correct. You've just instructed it not to guess remember? With practice people can get really good at guessing all sorts of things.
I think people have a serious misunderstanding about how these things work. They don't have their training set sitting around for reference. They are usually guessing. Most of the time with enough consistency that it seems like they "know'. Then when they get it wrong we call it "hallucinations". But instructing then not to guess means suddenly they can't answer much. There no guessing vs not with an LLM, it's all the same statistical process, the difference is just if it gives the right answer or not.
Knowledge has an objective correctness. We know that there is a "right" and "wrong" answer and we know what a "right" answer is. "Consistently correct guesses", based on the name itself, is not reliable enough to actually be trusted. There's absolutely no guarantee that the next "consistently correct guess" is knowledge or a hallucination.
Also, too, there are whole subfields of philosophy that make your statement here kinda laughably naive. Suffice it to say that, no, knowledge as rigorously understood does not have "an objective correctness".
Knowledge is knowledge because the knower knows it to be correct. I know I'm typing this into my phone, because it's right here in my hand. I'm guessing you typed your reply into some electronic device. I'm guessing this is true for all your comments. Am I 100% accurate? You'll have to answer that for me. I don't know it to be true, it's a highly informed guess.
Being wrong sometimes is not what makes a guess a guess. It's the different between pulling something from your memory banks, be they biological or mechanical, vs inferring it from some combination of your knowledge (what's in those memory banks), statistics, intuition, and whatever other fairy dust you sprinkle on.
The fact that you are humanizing an LLM is honestly just plain weird. It does not have feelings. It doesn't care that it has to answer "is it correct?" and saying poor LLM is just trying to tug on heartstrings to make your point.
They are the perfect "fake it till you make it" example cranked up to 11. They'll bullshit you, but will do it confidently and with proper grammar.
> Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.
I can see in some contexts that being desirable if it can be a parameter that can be tweaked. I guess it's not that easy, or we'd already have it.
Regexes are another area where I can't get much help from LLMs. If it's something common like a phone number, that's fine. But anything novel it seems to have trouble. It will spit out junk very confidently.
Well let us reward them for producing output that is consistent with database accessed selected documentation then, and massacre them for output they cannot justify - like we do with humans.
To my surprise, Gemini got it spot on first time.
- Determining what features to make for users
- Forecasting out a roadmap that are aligned to business goals
- Translating and prioritizing all of these to a developer (regardless of whether these developers are agentic or human)
Coincidentally these are the areas that frequently are the largest contributors to software businesses successes....not wether you use NextJs with a Go and Elixir backend against a multi-geo redundant multi sharded CockroachDB database, or that your code is clean/elegant.
At half of the companies you can randomly pick those three things and probably improve the situation. Using an AI would be a massive improvement.
internet also helps.
Also having markdown files with the stack etc and any -rules-
What do you mean specifically? I found the "let's write a spec, let's make a plan, implement this step by step with testing" results in basically the same approach to design/architecture that I would take.
One area I've still noticed weakness is if you want to use a pretty popular library from one language in another language, it has a tendency to think the function signatures in the popular language match the other.
Naively, this seems like a hard problem to solve.
I.e. ask it how to use torchlib in Ruby instead of Python.
so it's a great tool in the hands of a creative architect, but it is not one in and by itself and I don't see yet how it can be.
my pet theory is that the human brain can't understand and formalize its creativity because you need a higher order logic to fully capture some other logic. I've been contested that the second Gödel incompleteness theorem "can't be applied like this to the brain" but I stubbornly insist yes, the brain implements _some_ formal system and it can't understand how that system works. tongue in cheek, somewhat, maybe.
but back to earth I agree llms are a great tool for a creative human mind.
I would argue that the second incompleteness theorem doesn't have much relevance to the human brain, because it is trying to prove a falsehood. The brain is blatantly not a consistent system. It is, however, paraconsistent: we are perfectly capable of managing a set of inconsistent premises and extracting useful insight from them. That's a good thing.
It's also true that we don't understand how our own brain works, of course.
> If you think his theorem limits human knowledge, think again
https://www.youtube.com/watch?v=OH-ybecvuEo
first, with Neil DeGrasse Tyson I feel in fairly ok company with my little pet peeve fallacy ;-)
yah as I said, I both get it and don't ;-)
And then the video escapes me saying statements about the brain "being a formal method" can't be made "because" the finite brain can't hold infinity.
that's beyond me. although obviously the brain can't enumerate infinite possibilities, we're still fairly well capable of formal thinking, aren't we?
And many lovely formal systems nicely fit on fairly finite paper. And formal proofs can be run on finite computers.
So somehow the logic in the video is beyond me.
My humble point is this: if we build "intelligence" as a formal system, like some silicon running some fancy pants LLM what have you, and we want rigor in it's construction, i.e. if we want to be able to tell "this is how it works", then we need to use a subset of our brain that's capable of formal and consistent thinking. And my claim is that _that subsystem_ can't capture "itself". So we have to use "more" of our brain than that subsystem. so either the "AI" that we understand is "less" than what we need and use to understand it. or we can't understand it.
I fully get our brain is capable of more, and this "more" is obviously capable of very inconsistent outputs, HAL 9000 had that problem, too ;-)
I'm an old woman. it's late at night.
When I sat through Gödel back in the early 1990s in CS and then in contrast listened to the enthusiastic AI lectures it didn't sit right with me. Maybe one of the AI Prof's made that tactical mistake to call our brain "wet biological hardware" in contrast to "dry silicon hardware". but I can't shake of that analogy ;-) I hope I'm wrong :-) "real" AI that we can trust because we can reason about it's inner workings will be fun :-)
I find for 90% of the things I'm doing LLM removes 90% of the starting friction and let me get to the part that I'm actually interested in. Of course I also develop professionally in a python stack and LLMs are 1 shotting a ton of stuff. My work is standard data pipelines and web apps.
I'm a tech lead at faang adjacent w/ 11YOE and the systems I work with are responsible for about half a billion dollars a year in transactions directly and growing. You could argue maybe my standards are lower than yours but I think if I was making deadly mistakes the company would have been on my ass by now or my peers would have caught them.
Everybody that I work with is getting valuable output from LLMs. We are using all the latest openAI models and have a business relationship with openAI. I don't think I'm even that good at prompting and mostly rely on "vibes". Half of the time I'm pointing the model to an example and telling it "in the style of X do X for me".
I feel like comments like these almost seem gaslight-y or maybe there's just a major expectation mismatch between people. Are you expecting LLMs to just do exactly what you say and your entire job is to sit back prompt the LLM? Maybe I'm just use to shit code but I've looked at many code bases and there is a huge variance in quality and the average is pretty poor. The average code that AI pumps out is much better.
Just right now, I've been feeding o4-mini with high effort a C++ file with a deadlock in it.
It has failed to fix the problem after 3 times, and it introduced a double free bug in one of the attempts. It did not see the double free problem until I pointed it out.
I use it mostly for Golang and Rust, I work building cloud infrastructure automation tools.
I'll try to give some examples, they may seem overly specific but it's the first things that popped into my head when thinking about the subject.
Personally, I found that LLMs consistently struggle with dependency injection patterns. They'll generate tightly coupled services that directly instantiate dependencies rather than accepting interfaces, making testing nearly impossible.
If I ask them to generate code and also their respective unit tests, they'll often just create a bunch of mocks or start importing mock libraries to compensate for their faulty implementation, rather than fixing the underlying architectural issues.
They consistently fail to understand architecture patterns, generating code where infrastructure concerns bleed into domain logic. When corrected, they'll make surface level changes while missing the fundamental design principle of accepting interfaces rather than concrete implementations, even when explicitly instructed that it should move things like side-effects to the application edges.
Despite tailoring prompts for different models based on guides and personal experience, I often spend 10+ minutes correcting the LLM's output when I could have written the functionality myself in half the time.
No, I'm not expecting LLMs to replace my job. I'm expecting them to produce code that follows fundamental design principles without requiring extensive rewriting. There's a vast middle ground between "LLMs do nothing well" and the productivity revolution being claimed.
That being said, I'm glad it's working out so well for you, I really wish I had the same experience.
I'm starting to suspect this is the issue. Neither of these languages are in the top 5 languages so there is probably less to train on. It'd be interesting to see if this improves over time or if the gap between the languages become even more intense as it becomes favorable to use a language simply because LLMs are so much better at it.
There are a lot of interesting discussions to be had here:
- if the efficiency gains are real and llms don't improve in lesser used languages, one should expect that we might observe that companies that chose to use obscure languages and tech stacks die out as they become a lot less competitive against stacks that are more compatible with llms
- if the efficiency gains are real this might disincentivize new language adoption and creation unless the folks training models somehow address this
- languages like python with higher output acceptance rates are probably going to become even more compatible with llms at a faster rate if we extrapolate that positive reinforcement is probably more valuable than negative reinforcement for llms
I do wonder if the gap will keep widening though. If newer tools/tech don’t have enough training data, LLMs may struggle much more with them early on. Although it's possible that RAG and other optimization techniques will evolve fast enough to narrow the gap and prevent diminishing returns on LLM driven productivity.
As someone who works primarily within the Laravel stack, in PHP, the LLM's are wildly effective. That's not to say there aren't warts - but my productivity has skyrocketed.
But it's become clear that when you venture into the weeds of things that aren't very mainstream you're going to get wildly more hallucinations and solutions that are puzzling.
Another observation is that I believe that when you start getting outside of your expertise you're likely going to have a correlating amount of 'waste' time spent where the LLM is spitting out solutions that an expert in the domain would immediately recognize as problematic but the non-expert will see and likely reason that it seems reasonable/or, worse, not even look at the solution and just try to use it.
100% of the time that I've tried to get Claude/Gemini/ChatGPT to "one shot" a whole feature or refactor it's been a waste of time and tokens. But when I've spent even a minimal amount of energy to focus it in on the task, curate the context and then approach? Tremendously effective most times. But this also requires me to do enough mental work that I probably have an idea of how it should work out which primes my capability to parse the proposed solutions/code and pick up the pieces. Another good flow is to just prompt the LLM (in this case, Claude Code, or something with MCP/filesystem access) with the feature/refactor/request asking it to draw up the initial plan of implementation to feed to itself. Then iterate on that as needed before starting up a new session/context with that plan and hitting it one item at a time, while keeping a running {TASK_NAME}_WORKBOOK.md (that you task the llm to keep up to date with the relevant details) and starting a new session/context for each task/item on the plan, using the workbook to get the new sessions up to speed.
Also, this is just a hunch, but I'm generally a nocturnal creature and tend to be working in the evening into early mornings. Once 8am PST rolls around I really feel like Claude (in particular) just turns into mush. Responses get slower but it seems it loses context where it otherwise wouldn't start getting off topic/having to re-read files it should already have in context. (Note; I'm pretty diligent about refreshing/working with the context and something happens in the 'work' hours to make it terrible)
I'd imagine we're going to end up with language specific llms (though I have no idea, just seems logical to me) that a 'main' model pushes tasks/tool usage to. We don't need our "coding" LLM's to also be proficient on oceanic tidal patterns and 1800's boxing history. Those are all parameters that could have been better spent on the code.
Python is generally fine, as you've experienced, as is JavaScript/TypeScript & React.
I've had mixed results with C# and PowerShell. With PowerShell, hallucinations are still a big problem. Not sure if it's the Noun-Verb naming scheme of cmdlets, but most models still make up cmdlets that don't exist on the fly (though will correct itself once you correct it that it doesn't exist but at that point - why bother when I can just do it myself correctly the first time).
With C#, even with my existing code as context, it can't adhere to a consistent style, and can't handle nullable reference types (albeit, a relatively new feature in C#). It works, but I have to spend too much time correcting it.
Given my own experiences and the stacks I work with, I still won't trust an LLM in agent mode. I make heavy use of them as a better Google, especially since Google has gone to shit, and to bounce ideas off of, but I'll still write the code myself. I don't like reviewing code, and having LLMs write code for me just turns me into a full time code reviewer, not something I'm terribly interested in becoming.
I still get a lot of value out of the tools, but for me I'm still hesitant to unleash them on my code directly. I'll stick with the chat interface for now.
edit Golang is another language I've had problems relying on LLMs for. On the flip side, LLMs have been great for me with SQL and I'm grateful for that.
https://github.com/upstash/context7
LLMs just guess, so you have to give it a cheatsheet to help it guess closer to what you want.
Tell me about it. Thankfully I have not experienced it as much with Claude as I did with GPT. It can get quite annoying. GPT kept telling me to use this and that and none of them were real projects.
Either you have no idea how terrible real world commercial software (architecture) is or you're vastly underestimating newer LLMs or both.
But I wonder when we'll be happy? Do we expect colleagues friends and family to be 100% laser-accurate 100% of the time? I'd wager we don't. Should we expect that from an artificial intelligence too?
You could say that when I use my spanner/wrench to tighten a nut it works 100% of the time, but as soon as I try to use a screwdriver it's terrible and full of problems and it can't even reliably so something as trivially easy as tighten a nut, even though a screwdriver works the same way by using torque to tighten a fastener.
Well that's because one tool is designed for one thing, and one is designed for another.
Then why are we using them to write code, which should produce reliable outputs for a given input...much like a calculator.
Obviously we want the code to produce correct results for whatever input we give, and as it stands now, I can't trust LLM output without reviewing first. Still a helpful tool, but ultimately my desire would be to have them be as accurate as a calculator so they can be trusted enough to not need the review step.
Using an LLM and being OK with untrustworthy results, it'd be like clicking the terminal icon on my dock and sometimes it opens terminal, sometimes it might open a browser, or just silently fail because there's no reproducible output for any given input to an LLM. To me that's a problem, output should be reproducible, especially if it's writing code.
"AI"s are designed to be reliable; "AGI"s are designed to be intelligent; "LLM"s seem to be designed to make some qualities emerge.
> one tool is designed for one thing, and one is designed for another
The design of LLMs seems to be "let us see where the promise leads us". That is not really "design", i.e. "from need to solution".
So when I punched in 1/3 it was exactly 1/3.
- (1e(1e10) + 1) - 1e(1e10)
- sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2))
Your interaction with LLMs is categorically closer to interactions with people than with a calculator. Your inputs into it are language.
Of course the two are different. A calculator is a computer, an LLM is not. Comparing the two is making the same category error which would confuse Mr. Babbage, but in reverse.
(“On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question.”)
Mainly I meant to push back against the reflexive comparison to a friend or family member or colleague. AI is a multi-purpose tool that is used for many different kinds of tasks. Some of these tasks are analogues to human tasks, where we should anticipate human error. Others are not, and yet we often ask an LLM to do them anyway.
And it is also not just about the %. It is also about the type of error. Will we reach a point we change our perception and say these are expected non-human error?
Or could we have a specific LLM that only checks for these types of error?
And tools in the game, even more so (there's no excuse for the engineered).
Usually I’m using a minimum of 200k tokens to start with gemini 2.5.
200k tk = 1/3 200k words = 1/300 1/3 200k pages
That generally seems right to me, given how much we hold in our heads when you’re discussing something with a coworker.
While far from perfect for large projects, controlling the scope of individual requests (with orchestrator/boomerang mode, for example) seems to do wonders
Given the sheer, uh, variety of code I see day to day in an enterprise setting, maybe the problem isn't with Gemini?
I asked it a complicated question about the Scala ZIO framework that involved subtyping, type inference, etc. - something that would definitely be hard to figure out just from reading the docs. The first answer it gave me was very detailed, very convincing and very wrong. Thankfully I noticed it myself and was able to re-prompt it and I got an answer that is probably right. So it was useful in the end, but only because I realised that the first answer was nonsense.
Never seen it fumble that around
Swear people act like humans themselves don’t ever need to be asked for clarification
It'd make sense to rename WebDev Arena to React/Tailwind Arena. Its system prompt requires [1] those technologies and the entire tool breaks when requesting vanilla JS or other frameworks. The second-order implications of models competing on this narrow definition of webdev are rather troublesome.
[1] https://blog.lmarena.ai/blog/2025/webdev-arena/#:~:text=PROM...
Who will write the useful training data without LLMs? I feel we are getting less and less new things. Changes will be smaller and incremental.
To me it seems so strange that few good language designers and ml folks didn't group together to work on this.
It's clear that there is a space for some LLM meta language that could be designed to compile to bytecode, binary, JS, etc.
It also doesn't need to be textual like we code, but some form of AST llama can manipulate with ease.
Plenty of training data to go on, I'd imagine.
Funnily, training of these models feels getting cut mid of v3/v4 Tailwind release, and Gemini always try to correct my mistakes (… use v3 instead of v4)
Instead of learnable, stable, APIs for common components with well established versioning and well defined tokens, we've got people literally copying and pasting components and applying diffs so they can claim they "own them".
Except the vast majority of them don't ever change a line and just end up with a strictly worse version of a normal package (typically out of date or a hodgepodge of "versions" because they don't want to figure out diffs), and the few that do make changes don't have anywhere near the design sense to be using shadcn since there aren't enough tokens to keep the look and feel consistent across components.
The would be 1% who would change it and have their own well thought out design systems don't get a lift from shadcn either vs just starting with Radix directly.
-
Amazing spin job though with the "registry" idea too: "it's actually very good for AI that we invented a parallel distribution system for ad-hoc components with no standard except a loose convention around sticking stuff in a folder called ui"
Don't get me stared on how ugly the HTML becomes when most tags have 20 f*cking classes which could have been two.
In typical production environments tailwind is only around 10kb[1].
[1]: https://v3.tailwindcss.com/docs/optimizing-for-production
x = 1 // set X to 1
You get:
x = 1 // added this to set x to 1
And sometimes:
// x = 1 // removed this
These comments age really fast. They should be in a git commit not a comment.
As somebody who prefers code to self-describe what it is doing I find this behaviour a bit frustrating and I can't seem to context-prompt it away.
```python
def whatever():
```(it adds commented out code like that all the time, "just in case")
It's terrible.
I'm back to Claude Code.
In any case, it knows what good code looks like. You can say "take this code and remove spurious comments and prefer narrow exception handling over catch-all", and it'll do just fine (in a way it wouldn't do just fine if your prompt told it to write it that way the first time, writing new code and editing existing code are different tasks).
Just end your prompt with "no code comments"
Claude 3.7 Sonnet is much more restrained and does smaller changes.
I’ve started in a narrow niche of python/flask webapps and constrained to that stack for now, but if you’re interested I’ve just opened it for signups: https://codeplusequalsai.com
Would love feedback! Especially if you see promising results in not getting huge refactors out of small change requests!
(Edit: I also blogged about how the AST idea works in case you're just that curious: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...)
simonw has symbex which could be useful for you for python
Or something similar that does not rely on negation.
Besides, other models seems to handle negation correctly, not sure why it's so difficult for the Gemini family of models to understand.
I also try and get it to channel that energy into the doc strings, so it isn't buried in the source.
Refractor this. Do not write any comments.
<code to refractor>
As a reminder your task is to refractor the above code and do not write any comments.
Literally both of those are negations.
If you think negations never work tell Gemini 2.5 to "write 10 sentences that do not include the word the" and see what happens.
If it is ingesting data, there should also be a sample of the data in a comment.
"5. You must never output any comments about the progress or type of changes of your refactoring or generation. Example: you must NOT add comments like: 'Added dependency' or 'Changed to new style' or worst of all 'Keeping existing implementation'."
But I've heard "defensive code" used for the kind of code where almost every method validates its input parameters, wraps everything in a try-catch, returns nonsensical default values in failure scenarios, etc. This is a complete waste because the caller won't know what to do with the failed validations or thrown errors, and it's just unnecessary bloat that obfuscates the business logic. Validation, error handling and so on should be done in specific parts of the codebase (bonus points if you can encode the successful validation or the presence/absence of errors in the type system).
lots of hasattr("") rubbish, I've increased the amount of prompting but it still does this - basically it defers it's lack of compile time knowledge to runtime 'let's hope for the best, and see what happens!'
Trying to teach it FAIL FAST is an uphill struggle.
Oh and yes, returning mock objects if something goes wrong is a favourite.
It truly is an Idiot Savant - but still amazingly productive.
Removed from where? I use the attach code folder feature every day from the Gemini web app (with a script that clones a local repo that deletes .git and anything matching a gitignore pattern).
Which means if you try to force it to stop, the code quality will drop.
Models use comments to think, asking to remove will affect code quality.
What does that mean?
"Sometimes the big fish isn't only the fish"
I don't have problems with getting lot's of comments in the output, I am just deleting it while reading what it did
No comments yet
They measure the old gemini 2.5 generating proper diffs 92% of the time. I bet this goes up to ~95-98% https://aider.chat/docs/leaderboards/
Question for the google peeps who monitor these threads: Is gemini-2.5-pro-exp (free tier) updated as well, or will it go away?
Also, in the blog post, it says:
Does this mean gemini-2.5-pro-preview-03-25 now uses 05-06? Does the same apply to gemini-2.5-pro-exp-03-25?update: I just tried updating the date in the exp model (gemini-2.5-pro-exp-05-06) and that doesnt work.
I don't have a formal benchmark but there's a notable improvement in code generation due to this alone.
I've had gemini chug away on plans that have taken ~1 hour to implement. (~80mln tokens spent) A good portion of that energy was spent fixing mistakes made by cline/aider/roo due to search/replace mistakes. If this model gets anywhere close to 100% on diffs then this is a BFD. I estimate this will translate to a 50-75% productivity boost on long context coding tasks. I hope the initial results i'm seeing hold up!
I'm surprised by the reaction in the rest of the thread. A lot unproductive complaining, a lot of off topic stuff, nothing talking about the model itself.
Any thoughts from anyone else using the updated model?
Does this 2.5 pro "Preview" feel like an improvement if you had used the others?
[1] https://storage.googleapis.com/model-cards/documents/gemini-... [2] https://deepmind.google/technologies/gemini/
Fair enough, one could say, as these were all labeled as preview or experimental. Still, considering that the new model is slightly worse across the board in benchmarks (except for LiveCodeBench), it would have been nice to have the option to stick with the older version. Not everyone is using these models for coding.
I get it, chips are sparse and they want their capacity back, but it breaks trust with developers to just downgrade your model.
Call it gemini-latest and I understand that things will change. Call it *-03-25 and I want the same model that I got on 25th March.
I bet they kept training on coding tasks, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.
https://livebench.ai/#/
Is this maybe not the updated card, even though the blog post claims there is one? Sure, the timestamp is in late April, but I seem to remember that the first model card for 2.5 Pro was only released in the last couple of weeks.
If you're using these models to generate code daily, the costs add up.
Sure, I'll give a really tough problem to o3 (and probably over ChatGPT, not the API), but on general code tasks, there really isn't meaningful enough difference to justify 4x the cost.
There's something seriously dysfunctional and incompetent about the team that built that web app. What a way to waste the best LLM in the world.
Software that people truly love is impossible to build in there.
Just recently a lot of people (me included) got hit with a surprise bill, with some racking up $500 in cost for normal use
I certainly got burnt and removed my API key from my tools to not accidentally use it again
Example: https://x.com/pashmerepat/status/1918084120514900395?s=46
I bet they kept training on coding, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.
E.g. call it Gemini Pro 2.5.1.
But I do not want to have to build a network of bots with non-deterministic outputs to simply stay on top of versions
But yeah, some kind of deterministic way to get alerts would be better.
edit> Its gemini-2.5-pro-preview-05-06
edit>Cursor syas it doesnt have "good support" et, but im not sure if this is a defualt message when it doesnt recognise a model? is this a big deal? should I wait until its officially supported by cursor?
Just trying to save time here for everyone - anyone know the answer?
https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...
30,408 input, 8,535 output = 12.336 cents.
8,500 is a very long output! Finally a model that obeys my instructions to "go long" when summarizing Hacker News threads. Here's the script I used: https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...
Web chat interfaces are great, but copy/paste gets old fast.
Although it’s not for coding, I have noticed Gemini 2.5 pro Deep Research has surpassed Grok’s DeepSearch in thoroughness and research quality however.
My choice of LLMs was
Coding in cursor: Claude
General questions: Grok, if it fails then Gemini
Deep Research: Gemini (I don't have GPT plus, I heard it's better)
You must rename your files to .tsx.txt THEN IT ACCEPTS THEM and works perfectly fine writing TSX code.
This is absolutely bananas. How can such a powerful coding engine have this behavior?
https://show.franzai.com/a/star-zero-huge?nobuttons
- [codes] showing up instead of references,
- raw search tool output sliding across the screen,
- Gemini continusly answering questions asked two or more messages before but ignoring the most recent one (you need to ask Gemini an unrelated question for it to snap out of this bug for a few minutes),
- weird messages including text irrelevant to any of my chats with Gemini, like baseball,
- confusing its own replies with mine,
- not being able to run its own Python code due to some unsolvable formatting issue,
- timeouts, and more.
It turns a well readable code-snippet of 5 lines into a 30 line snippet full of comments and mostly unnecessary error handling. Code which becomes harder to reason about.
But for sysadmin tasks, like dealing with ZFS and LVM, it is absolutely incredible.
I'm assuming large companies are mandating it, but ultimately the work that these LLMs seem poised for would benefit smaller companies most and I don't think they can really afford using them? Are people here paying for a personal subscription and then linking it to their work machines?
Define "smaller"? In small companies, say 10 people, there are no hoops. That is the whole point of small companies!
"They trust me. Dumb ..."
Junior engineers will complete a task to update an API, or fix a bug on the front-end, within a couple days with lets say 80 percent certainty they hit the mark (maybe an inflated metric). How are people comparing the output of these models to that of a junior engineer if they generally just say "Here is some of my code, what's wrong with it?". That certainly isn't taking a real ticket and completing it in any capacity.
I am obviously very skeptical but mostly I want to try one of these models myself but in reality I think that my higher-ups would think that they introduce both risk AND the potential for major slacking off haha.
The latest SOTA models are definitely at the point where they can absolutely improve workflows and not get in your way too much.
I treat it a lot like an intern, “Here’s an api doc and spec, write me the boilerplate and a general idea about implementation”
Then I go in, review, rip put crud and add what I need.
It almost always gets architecture wrong, don’t expect that from it. However small functions and such is great.
When it comes to refactoring ask it for suggestions, eat the meat leave the bones.
Now there’s a big nugget to chew (LLMs) you’re seeing that latent capability come to life. This awakening feels more bottom-up driven than top down. Google’s a war machine chugging along nicely in peacetime, but now its war again!
Hats off to the engineers working on the tech. Excited to try it out!
No the top talent worked on exciting things like Fuchsia. Ad tech is boring stuff written by people who aren't enough of a snob to refuse working on ad tech.
Isn’t that a flower?
(Hopefully you see my point)
At first I was very impressed with it's coding abilities, switching off of Claud for it but recently I've been using GPT o3 which I find is much more concise and generally better at problem solving when you hit an error.
Also, why doesn't Ctrl+C work??
Drop me a line (see profile) if you're interested in beta testing it when it's out.
Currently Claude Code is a big value-add for Claude. Google has nothing equivalent; aider requires far more manual work.
But with my app: you can install the host anywhere and connect to it securely (via SSH forwarding or private VPN or what have you) so that workflow definitely still works!
https://github.com/plandex-ai/plandex
https://github.com/block/goose
How are they now? Sufficiently good? Competent? Competitive? Or limited? My needs are very consumer oriented, not programming/api stuff.
It really is wild to have seen this happen over the last year. The days of traditional "design-to-code" FE work are completely over. I haven't written a line of HTML/CSS in months. If you are still doing this stuff by hand, you need to adapt fast. In conjunction with an agentic coding IDE and a few MCP tools, weeks worth of UI work are now done in hours to a higher level of quality and consistency with practically zero effort.
The only disadvantage to not using these tools would be that your current output is slower. As soon as your employer asks for more or you're looking for a new job, you can just turn on AI and be as fast as everyone who already uses it.
Always "10x"/"100x" more productive with AI, "you will miss out if you don't adopt now"! Build a great company 100x faster and every rational actor in the market will notice, believe you and be begging to adopt your ways of working (and you will become filthy rich as a nice kicker).
The proof of the pudding is in the eating.
Because I don't get paid $150k/yr to write HTML and CSS. I get paid to provide technical solutions to business problems. And "chatbots" are a very useful new tool to aid in that.
That's true of all SWEs who write HTML and CSS, and it's the reason I don't think there's much downside for devs to not proactively start using these agentic tools.
If it truly turns weeks of work into hours as you say, then my managers will start asking me to use them, and I will use them. I won't be at a disadvantage compared to people who started using them a bit earlier than me.
If I am looking for a new job and find an employer that wants people to use agentic tools, then I will tell the hiring manager that I will use those tools. Again, no disadvantage.
Being outdated as a tech employee puts you at a disadvantage to the extent that there is a difficult-to-cross gap. If you are working in COBOL and the market demands Rust engineers, then you need a significant amount of learning/experience to catch up.
But a major pitch of AI tools is that it is not difficult to cross the gap. You draw on your domain experience to describe what you want, and it gives it to you. When it makes a mistake, you draw on your domain experience to tweak or fix things as needed.
Maybe someday there will be a gap. Maybe people will develop years of experience and intuition using particular AI tools that makes them much more attractive than somebody without this experience. But the tools are churning so quickly (Claude Code and Cursor are brand new, tools from 18 months ago are obsolete, newer and better tools are surely coming soon) that this seems far off.
However, just today i was building a website for fun with gemini and had to manually fix some issues with css that he struggled with. as it often happens, trying to let it repair the damage only made it go into a pit of despair (for me). i fixed the issues in about a glance and 5 minutes. This is not to say it's bad, but sometimes it still makes absurd mistakes and can't find a way to solve them
Tailwind (with utility classes) is the real key here. It provides a semantic layer over CSS that allows the LLM to reason about how things will actually look. Night and day difference from using stylesheets with custom classes.
However, I feel that there is a big difference between the models. In my tests, using Cursor, Clause 3.7 Sonnet has a much more refined "aesthetic sense" than other models. Many times I ask "make it more beautiful" and it manages to improve, where other models just can't understand it.
If we're talking about just slapping on tailwind+component-library(e.g. shadcn-ui, material), then that's just one step-above using no-code solutions. Which, yes, that works well. But if someone didn't need customized logic, then it was always possible to just hop on fiverr or use some very simple template-based tools to accomplish this.
If we're talking more advanced logic, understanding aesthetics, etc. Then I'd say it's much worse than other coding areas like backend, because they work on a visual and ux level beyond just code which is just text manipulation (and what llms excel at). In other words, I think the results are still very shallow beyond first impressions.
Framelink MCP (https://github.com/GLips/Figma-Context-MCP)
Playwright MCP (https://github.com/microsoft/playwright-mcp)
Pull down designs via Framelink, optionally enrich with PNG exports of nodes added as image uploads to the prompt, write out the components, test/verify via Playwright MCP.
Gemini has a 1M context size now, so this applies to large mature codebases as well as greenfield. The key thing here is the coding agent being really clever about maintaining its' context; you don't need to fit an entire codebase into a single prompt in the same way that you don't need to fit the entire codebase into your head to make a change, you just need enough context on the structure and form to maintain the correct patterns.
Indeed, in fact design has become the bottleneck now. Figma has really dropped the ball here WRT building out AI assisted (not driven) tooling for designers.
I'm not even a designer, but I care about the consistency of UI design and whether the overall experience is well-organized, aligned properly, things are placed in a logical flow for the user, and so on.
While I'm pro-AI tooling and use it heavily, and these models usually provide a good starting point, I can't imagine shipping the slop without writing/editing a line of HTML for anything that's interaction-heavy.
I find that I get the best results from 2.5 Pro via Google AI Studio with a low temperature (0.2-0.3).
Would be ideal if they incremented the version number or the like.
I have LiteLLM server running locally with Langfuse to view traces. You configure LiteLLM to connect directly to providers' APIs. This has the added benefit of being able to create LiteLLM API keys per project that proxies to different sets of provider API keys to monitor or cap billing usage.
I use https://github.com/LLemonStack/llemonstack/ to spin up local instances of LiteLLM and Langfuse.
Deepinfra token usage updates every time you switch to the tab if it is opened to the usage page so it is possible to see updates even every second
And before you ask: yes, for cached content and batch completion discounts you can accommodate both—just needs a bit of logic in your completion-layer code.
You have less than $10 million in spend you will be treated worse than cattle because at least farmers feed their cattle before they are milked
Not sure where your data is coming from but everything else is pointing to Google supremacy in AI right now. I look forward to some new models from Anthropic, xAi, Meta et al (remains to be seen if OpenAI has anything left apart from bluster). Exciting times.
1 - https://beta.lmarena.ai/leaderboard
2 - https://openrouter.ai/rankings
I have long stopped using OpenAI products, and all oX have been letdowns.
For coding it has been Claude 3.5 -> 3.7 -> Gemini 2.5 for me. For general use it has been chatgpt -> Gemini.
Google has retaken the ML crown for my use cases and it keeps getting better.
Gemini 2.0 flash was also the first LLM I put in production, because for my use case (summarizing news articles and translate them) it was way too fast, accurate and cheap to ignore whereas ChatGPT was consistently too slow and expensive to be even considered.
(aider joke)
Oof. G and others are way behind
I use Cursor when I code myself. But I don't use it's chat or agent features. I had replaced VS Code with it but at this point I could go back to VS Code, but I'm lazy.
Cursor agent/chat we're fine if you're bottlenecked by money. I have no idea why or how it uses things like the codebase embedding. An agent on top of a filesystem is a powerful thing. People also like Aider and RooCode for the CLI experience and I think they are affordable.
To make the most use of these things, you need to guide them and provide them adequate context for every task. For Claude Code I have built a meta management framework that works really well. If I were forced to use cursor I would use the same approach.