Nnd – a TUI debugger alternative to GDB, LLDB (github.com)
166 points by zX41ZdbW 5h ago 53 comments
Show HN: Plexe – ML Models from a Prompt (github.com)
56 points by vaibhavdubey97 3h ago 23 comments
Gemini 2.5 Pro Preview
334 meetpateltech 305 5/6/2025, 3:10:00 PM developers.googleblog.com ↗
There are still significant limitations, no amount of prompting will get current models to approach abstraction and architecture the way a person does. But I'm finding that these Gemini models are finally able to replace searches and stackoverflow for a lot of my day-to-day programming.
so it's a great tool in the hands of a creative architect, but it is not one in and by itself and I don't see yet how it can be.
my pet theory is that the human brain can't understand and formalize its creativity because you need a higher order logic to fully capture some other logic. I've been contested that the second Gödel incompleteness theorem "can't be applied like this to the brain" but I stubbornly insist yes, the brain implements _some_ formal system and it can't understand how that system works. tongue in cheek, somewhat, maybe.
but back to earth I agree llms are a great tool for a creative human mind.
I find this sentiment increasingly worrisome. It's entirely clear that every last human will be beaten on code design in the upcoming years (I am not going to argue if it's 1 or 5 years away, who cares?)
I wished people would just stop holding on to what amounts to nothing, and think and talk more about what can be done in a new world. We need good ideas and I think this could be a place to advance them.
The unfortunate state of open source funding makes buildings such simple tool a loosing adventure unfortunately.
- Determining what features to make for users
- Forecasting out a roadmap that are aligned to business goals
- Translating and prioritizing all of these to a developer (regardless of whether these developers are agentic or human)
Coincidentally these are the areas that frequently are the largest contributors to software businesses successes....not wether you use NextJs with a Go and Elixir backend against a multi-geo redundant multi sharded CockroachDB database, or that your code is clean/elegant.
The models are very impressive. But issues like these still make me feel they are still more pattern matching (although there's also some magic, don't get me wrong) but not fully reasoning over everything correctly like you'd expect of a typical human reasoner.
I assume that it's trickier than it seems as it hasn't happened yet.
And that's fine and useful.
That is, if it's true that abstraction and architecture are useful for a given product, then people who know how to do those things will succeed in creating that product, and those who don't will fail. I think this is true for essentially all production software, but a lot of software never reaches production.
Transitioning or entirely recreating "vibecoded" proofs of concept to production software is another skill that will be valuable.
Having a good sense for when to do that transition, or when to start building production software from the start, and especially the ability to influence decision makers to agree with you, is another valuable skill.
I do worry about what the careers of entry level people will look like. It isn't obvious to me how they'll naturally develop any of these skills.
The fact that you called it out as a PoC is already many bars above what most vibe coders are doing. Which is considering a barely functioning web app as proof that vibe coding is a viable solution for coding in general.
> I do worry about what the careers of entry level people will look like. It isn't obvious to me how they'll naturally develop any of these skills.
Exactly. There isn't really a path forward from vibe coding to anything productizable without actual, deep CS knowledge. And LLMs are not providing that.
They are the perfect "fake it till you make it" example cranked up to 11. They'll bullshit you, but will do it confidently and with proper grammar.
> Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.
I can see in some contexts that being desirable if it can be a parameter that can be tweaked. I guess it's not that easy, or we'd already have it.
Are we sure they know these things as opposed to being able to consistently guess correctly? With LLMs I'm not sure we even have a clear definition of what it means for it to "know" something.
But also things where guessing was desirable. For example with a riddle it would tell you it did not know or there wasn't enough information. After pressuring it to answer anyway it would correctly solve the riddle.
The official llama 2 finetune was pretty bad with this stuff.
What is the practical difference you're imagining between "consistently correct guess" and "knowledge"?
LLMs aren't databases. We have databases. LLMs are probabilistic inference engines. All they do is guess, essentially. The discussion here is about how to get the guess to "check itself" with a firmer idea of "truth". And it turns out that's hard because it requires that the guessing engine know that something needs to be checked in the first place.
Knowledge has an objective correctness. We know that there is a "right" and "wrong" answer and we know what a "right" answer is. "Consistently correct guesses", based on the name itself, is not reliable enough to actually be trusted. There's absolutely no guarantee that the next "consistently correct guess" is knowledge or a hallucination.
Also, too, there are whole subfields of philosophy that make your statement here kinda laughably naive. Suffice it to say that, no, knowledge as rigorously understood does not have "an objective correctness".
The fact that you are humanizing an LLM is honestly just plain weird. It does not have feelings. It doesn't care that it has to answer "is it correct?" and saying poor LLM is just trying to tug on heartstrings to make your point.
internet also helps.
Also having markdown files with the stack etc and any -rules-
To my surprise, Gemini got it spot on first time.
LLMs just guess, so you have to give it a cheatsheet to help it guess closer to what you want.
Tell me about it. Thankfully I have not experienced it as much with Claude as I did with GPT. It can get quite annoying. GPT kept telling me to use this and that and none of them were real projects.
https://github.com/upstash/context7
But I wonder when we'll be happy? Do we expect colleagues friends and family to be 100% laser-accurate 100% of the time? I'd wager we don't. Should we expect that from an artificial intelligence too?
You could say that when I use my spanner/wrench to tighten a nut it works 100% of the time, but as soon as I try to use a screwdriver it's terrible and full of problems and it can't even reliably so something as trivially easy as tighten a nut, even though a screwdriver works the same way by using torque to tighten a fastener.
Well that's because one tool is designed for one thing, and one is designed for another.
So when I punched in 1/3 it was exactly 1/3.
- (1e(1e10) + 1) - 1e(1e10)
- sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2))
Mainly I meant to push back against the reflexive comparison to a friend or family member or colleague. AI is a multi-purpose tool that is used for many different kinds of tasks. Some of these tasks are analogues to human tasks, where we should anticipate human error. Others are not, and yet we often ask an LLM to do them anyway.
Usually I’m using a minimum of 200k tokens to start with gemini 2.5.
200k tk = 1/3 200k words = 1/300 1/3 200k pages
I asked it a complicated question about the Scala ZIO framework that involved subtyping, type inference, etc. - something that would definitely be hard to figure out just from reading the docs. The first answer it gave me was very detailed, very convincing and very wrong. Thankfully I noticed it myself and was able to re-prompt it and I got an answer that is probably right. So it was useful in the end, but only because I realised that the first answer was nonsense.
Never seen it fumble that around
Swear people act like humans themselves don’t ever need to be asked for clarification
It'd make sense to rename WebDev Arena to React/Tailwind Arena. Its system prompt requires [1] those technologies and the entire tool breaks when requesting vanilla JS or other frameworks. The second-order implications of models competing on this narrow definition of webdev are rather troublesome.
[1] https://blog.lmarena.ai/blog/2025/webdev-arena/#:~:text=PROM...
```python
def whatever():
```(it adds commented out code like that all the time, "just in case")
It's terrible.
I'm back to Claude Code.
In any case, it knows what good code looks like. You can say "take this code and remove spurious comments and prefer narrow exception handling over catch-all", and it'll do just fine (in a way it wouldn't do just fine if your prompt told it to write it that way the first time, writing new code and editing existing code are different tasks).
Just end your prompt with "no code comments"
Claude 3.7 Sonnet is much more restrained and does smaller changes.
I’ve started in a narrow niche of python/flask webapps and constrained to that stack for now, but if you’re interested I’ve just opened it for signups: https://codeplusequalsai.com
Would love feedback! Especially if you see promising results in not getting huge refactors out of small change requests!
(Edit: I also blogged about how the AST idea works in case you're just that curious: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...)
simonw has symbex which could be useful for you for python
"5. You must never output any comments about the progress or type of changes of your refactoring or generation. Example: you must NOT add comments like: 'Added dependency' or 'Changed to new style' or worst of all 'Keeping existing implementation'."
Or something similar that does not rely on negation.
Besides, other models seems to handle negation correctly, not sure why it's so difficult for the Gemini family of models to understand.
I also try and get it to channel that energy into the doc strings, so it isn't buried in the source.
Refractor this. Do not write any comments.
<code to refractor>
As a reminder your task is to refractor the above code and do not write any comments.
Literally both of those are negations.
If it is ingesting data, there should also be a sample of the data in a comment.
But I've heard "defensive code" used for the kind of code where almost every method validates its input parameters, wraps everything in a try-catch, returns nonsensical default values in failure scenarios, etc. This is a complete waste because the caller won't know what to do with the failed validations or thrown errors, and it's just unnecessary bloat that obfuscates the business logic. Validation, error handling and so on should be done in specific parts of the codebase (bonus points if you can encode the successful validation or the presence/absence of errors in the type system).
lots of hasattr("") rubbish, I've increased the amount of prompting but it still does this - basically it defers it's lack of compile time knowledge to runtime 'let's hope for the best, and see what happens!'
Trying to teach it FAIL FAST is an uphill struggle.
Oh and yes, returning mock objects if something goes wrong is a favourite.
It truly is an Idiot Savant - but still amazingly productive.
Removed from where? I use the attach code folder feature every day from the Gemini web app (with a script that clones a local repo that deletes .git and anything matching a gitignore pattern).
Which means if you try to force it to stop, the code quality will drop.
Models use comments to think, asking to remove will affect code quality.
What does that mean?
I don't have problems with getting lot's of comments in the output, I am just deleting it while reading what it did
No comments yet
[1] https://storage.googleapis.com/model-cards/documents/gemini-... [2] https://deepmind.google/technologies/gemini/
Fair enough, one could say, as these were all labeled as preview or experimental. Still, considering that the new model is slightly worse across the board in benchmarks (except for LiveCodeBench), it would have been nice to have the option to stick with the older version. Not everyone is using these models for coding.
I get it, chips are sparse and they want their capacity back, but it breaks trust with developers to just downgrade your model.
Call it gemini-latest and I understand that things will change. Call it *-03-25 and I want the same model that I got on 25th March.
https://livebench.ai/#/
I bet they kept training on coding tasks, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.
They measure the old gemini 2.5 generating proper diffs 92% of the time. I bet this goes up to ~95-98% https://aider.chat/docs/leaderboards/
Question for the google peeps who monitor these threads: Is gemini-2.5-pro-exp (free tier) updated as well, or will it go away?
Also, in the blog post, it says:
Does this mean gemini-2.5-pro-preview-03-25 now uses 05-06? Does the same apply to gemini-2.5-pro-exp-03-25?update: I just tried updating the date in the exp model (gemini-2.5-pro-exp-05-06) and that doesnt work.
I don't have a formal benchmark but there's a notable improvement in code generation due to this alone.
I've had gemini chug away on plans that have taken ~1 hour to implement. (~80mln tokens spent) A good portion of that energy was spent fixing mistakes made by cline/aider/roo due to search/replace mistakes. If this model gets anywhere close to 100% on diffs then this is a BFD. I estimate this will translate to a 50-75% productivity boost on long context coding tasks. I hope the initial results i'm seeing hold up!
I'm surprised by the reaction in the rest of the thread. A lot unproductive complaining, a lot of off topic stuff, nothing talking about the model itself.
Any thoughts from anyone else using the updated model?
Is this maybe not the updated card, even though the blog post claims there is one? Sure, the timestamp is in late April, but I seem to remember that the first model card for 2.5 Pro was only released in the last couple of weeks.
There's something seriously dysfunctional and incompetent about the team that built that web app. What a way to waste the best LLM in the world.
Software that people truly love is impossible to build in there.
I bet they kept training on coding, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.
edit> Its gemini-2.5-pro-preview-05-06
edit>Cursor syas it doesnt have "good support" et, but im not sure if this is a defualt message when it doesnt recognise a model? is this a big deal? should I wait until its officially supported by cursor?
Just trying to save time here for everyone - anyone know the answer?
E.g. call it Gemini Pro 2.5.1.
But I do not want to have to build a network of bots with non-deterministic outputs to simply stay on top of versions
But yeah, some kind of deterministic way to get alerts would be better.
Although it’s not for coding, I have noticed Gemini 2.5 pro Deep Research has surpassed Grok’s DeepSearch in thoroughness and research quality however.
My choice of LLMs was
Coding in cursor: Claude
General questions: Grok, if it fails then Gemini
Deep Research: Gemini (I don't have GPT plus, I heard it's better)
Web chat interfaces are great, but copy/paste gets old fast.
It turns a well readable code-snippet of 5 lines into a 30 line snippet full of comments and mostly unnecessary error handling. Code which becomes harder to reason about.
But for sysadmin tasks, like dealing with ZFS and LVM, it is absolutely incredible.
- [codes] showing up instead of references,
- raw search tool output sliding across the screen,
- Gemini continusly answering questions asked two or more messages before but ignoring the most recent one (you need to ask Gemini an unrelated question for it to snap out of this bug for a few minutes),
- weird messages including text irrelevant to any of my chats with Gemini, like baseball,
- confusing its own replies with mine,
- not being able to run its own Python code due to some unsolvable formatting issue,
- timeouts, and more.
I'm assuming large companies are mandating it, but ultimately the work that these LLMs seem poised for would benefit smaller companies most and I don't think they can really afford using them? Are people here paying for a personal subscription and then linking it to their work machines?
Define "smaller"? In small companies, say 10 people, there are no hoops. That is the whole point of small companies!
"They trust me. Dumb ..."
Junior engineers will complete a task to update an API, or fix a bug on the front-end, within a couple days with lets say 80 percent certainty they hit the mark (maybe an inflated metric). How are people comparing the output of these models to that of a junior engineer if they generally just say "Here is some of my code, what's wrong with it?". That certainly isn't taking a real ticket and completing it in any capacity.
I am obviously very skeptical but mostly I want to try one of these models myself but in reality I think that my higher-ups would think that they introduce both risk AND the potential for major slacking off haha.
The latest SOTA models are definitely at the point where they can absolutely improve workflows and not get in your way too much.
I treat it a lot like an intern, “Here’s an api doc and spec, write me the boilerplate and a general idea about implementation”
Then I go in, review, rip put crud and add what I need.
It almost always gets architecture wrong, don’t expect that from it. However small functions and such is great.
When it comes to refactoring ask it for suggestions, eat the meat leave the bones.
Also, why doesn't Ctrl+C work??
At first I was very impressed with it's coding abilities, switching off of Claud for it but recently I've been using GPT o3 which I find is much more concise and generally better at problem solving when you hit an error.
https://github.com/plandex-ai/plandex
Drop me a line (see profile) if you're interested in beta testing it when it's out.
Currently Claude Code is a big value-add for Claude. Google has nothing equivalent; aider requires far more manual work.
But with my app: you can install the host anywhere and connect to it securely (via SSH forwarding or private VPN or what have you) so that workflow definitely still works!
https://github.com/block/goose
How are they now? Sufficiently good? Competent? Competitive? Or limited? My needs are very consumer oriented, not programming/api stuff.
Not sure where your data is coming from but everything else is pointing to Google supremacy in AI right now. I look forward to some new models from Anthropic, xAi, Meta et al (remains to be seen if OpenAI has anything left apart from bluster). Exciting times.
1 - https://beta.lmarena.ai/leaderboard
2 - https://openrouter.ai/rankings
I find that I get the best results from 2.5 Pro via Google AI Studio with a low temperature (0.2-0.3).
Would be ideal if they incremented the version number or the like.
It really is wild to have seen this happen over the last year. The days of traditional "design-to-code" FE work are completely over. I haven't written a line of HTML/CSS in months. If you are still doing this stuff by hand, you need to adapt fast. In conjunction with an agentic coding IDE and a few MCP tools, weeks worth of UI work are now done in hours to a higher level of quality and consistency with practically zero effort.
The only disadvantage to not using these tools would be that your current output is slower. As soon as your employer asks for more or you're looking for a new job, you can just turn on AI and be as fast as everyone who already uses it.
Always "10x"/"100x" more productive with AI, "you will miss out if you don't adopt now"! Build a great company 100x faster and every rational actor in the market will notice, believe you and be begging to adopt your ways of working (and you will become filthy rich as a nice kicker).
The proof of the pudding is in the eating.
Because I don't get paid $150k/yr to write HTML and CSS. I get paid to provide technical solutions to business problems. And "chatbots" are a very useful new tool to aid in that.
However, just today i was building a website for fun with gemini and had to manually fix some issues with css that he struggled with. as it often happens, trying to let it repair the damage only made it go into a pit of despair (for me). i fixed the issues in about a glance and 5 minutes. This is not to say it's bad, but sometimes it still makes absurd mistakes and can't find a way to solve them
Tailwind (with utility classes) is the real key here. It provides a semantic layer over CSS that allows the LLM to reason about how things will actually look. Night and day difference from using stylesheets with custom classes.
However, I feel that there is a big difference between the models. In my tests, using Cursor, Clause 3.7 Sonnet has a much more refined "aesthetic sense" than other models. Many times I ask "make it more beautiful" and it manages to improve, where other models just can't understand it.
If we're talking about just slapping on tailwind+component-library(e.g. shadcn-ui, material), then that's just one step-above using no-code solutions. Which, yes, that works well. But if someone didn't need customized logic, then it was always possible to just hop on fiverr or use some very simple template-based tools to accomplish this.
If we're talking more advanced logic, understanding aesthetics, etc. Then I'd say it's much worse than other coding areas like backend, because they work on a visual and ux level beyond just code which is just text manipulation (and what llms excel at). In other words, I think the results are still very shallow beyond first impressions.
I'm not even a designer, but I care about the consistency of UI design and whether the overall experience is well-organized, aligned properly, things are placed in a logical flow for the user, and so on.
While I'm pro-AI tooling and use it heavily, and these models usually provide a good starting point, I can't imagine shipping the slop without writing/editing a line of HTML for anything that's interaction-heavy.
Framelink MCP (https://github.com/GLips/Figma-Context-MCP)
Playwright MCP (https://github.com/microsoft/playwright-mcp)
Pull down designs via Framelink, optionally enrich with PNG exports of nodes added as image uploads to the prompt, write out the components, test/verify via Playwright MCP.
Gemini has a 1M context size now, so this applies to large mature codebases as well as greenfield. The key thing here is the coding agent being really clever about maintaining its' context; you don't need to fit an entire codebase into a single prompt in the same way that you don't need to fit the entire codebase into your head to make a change, you just need enough context on the structure and form to maintain the correct patterns.
I have LiteLLM server running locally with Langfuse to view traces. You configure LiteLLM to connect directly to providers' APIs. This has the added benefit of being able to create LiteLLM API keys per project that proxies to different sets of provider API keys to monitor or cap billing usage.
I use https://github.com/LLemonStack/llemonstack/ to spin up local instances of LiteLLM and Langfuse.
Deepinfra token usage updates every time you switch to the tab if it is opened to the usage page so it is possible to see updates even every second
And before you ask: yes, for cached content and batch completion discounts you can accommodate both—just needs a bit of logic in your completion-layer code.
You have less than $10 million in spend you will be treated worse than cattle because at least farmers feed their cattle before they are milked
(aider joke)
Oof. G and others are way behind