The most important thing is to have a strong plan cycle in front of you agent work, if you do that, agents are very reliable. You need to have a deep research cycle that basically collects a covering set of code that might need to be modified for a feature, feeds it into gemini/gpt5 to get a broad codebase level understanding, then has a debate cycle on how to address it, with the final artifact being a hyper detailed plan that goes file by file and provides an outline of changes required.
Beyond this, you need to maintain good test coverage, and you need to have agents red-team your tests aggressively to make sure they're robust.
If you implement these two steps your agent performance will skyrocket. The planning phase will produce plans that claude can iterate on for 3+ hours in some cases, if you tell it to complete the entire task in one shot, and the robust test validation / change set analysis will catch agents solving an easier problem because they got frustrated or not following directions.
skydhash · 7h ago
By that point I would have already produced the 20 line diff for the ticket. Huge commits (or change requests) are usually scaffolding, refactoring, or design changes to support new features. You also got generated code and verbose language like CSS. So stuff where the more knowledge you have about the code, the faster you can be.
The daily struggle was always those 10 line diffs where you have to learn a lot (from the stakeholder, by debugging, from the docs).
CuriouslyC · 6h ago
A deep plan cycle will find stuff like this, because it's looking at the whole relevant portion of your codebase at once (and optionally the web, your internal docs, etc). It'll just generate a very short plan for the agent.
The important thing is that this process is entirely autonomous. You create an issue, that hooks the planners, the completion of a plan artifact hooks a test implementer, the completion of tests hooks the code implementer(s, with cheaper models generating multiple solutions and taking the best diff works well), the completion of a solution + PR hooks code+security review, test red teaming, etc.
kanak8278 · 2h ago
Can anyone help me on how to integrate this with Claude-Code?
I went through it, I already follow few things manually but when I think of integrating most of the parts (not all), I don't know where should I put it for the Coding-LLM to understand. I fear if I put everything in Claude.md it will be just too much context for the CC.
wheelerwj · 43m ago
In Claude code you can definitely put most of this into claude.md, much of it in a global claude.md and some in the project claude.md.
which part are you specifically uncertain about?
jemiluv8 · 3h ago
Reminds me when I was demonstrating Claude code to a friend recently. My friend was a huge cursor user and was just curious about the cli tool and stuff.
In the end, regardless of framework or approach, I believe there is a way to go about using llms that will optimize work for developers. I worked with this tech lead who reviews all PRs and insists on imports arranged in a specific order. I found it insulting but did it anyway. Now I don’t - the not does.
The same way that llms can be really helpful in planning and building out specific things like rest endpoints, small web components, single functions or classes and so on.
Glad people are attempting to work on such potential solution for approaching work to take advantage of these new tools
perrygeo · 6h ago
There's some irony; far from handling the details, LLMs are forcing programmers to adopt hyper-detailed, disciplined practices. They've finally cajoled software developers into writing documentation! Worth noting we've always had the capacity to implement these practices to improve HUMAN collaboration, but rarely bothered.
grork · 5h ago
We’ve ultimately decided to treat the models with more respect, nurturing, and collaborative support than we ever did our follow human keyboard smashers. Writing all the documentation, detailed guidance, allowing them multiple attempts, to help the LLMs be successful. But Brenda, the early in career new grad? “please read this poorly written, 5 year-old, incomplete wiki, and don’t ask me questions”
I’ve been thinking about this for months, and still don’t know what to make of it.
ianbicking · 2h ago
Respect (or lack thereof) goes both ways: both the writer and reader. I have frequently felt disrespected by producing documentation, planning/etc that isn't read. In the end I mostly rely on oral transmission of knowledge because then at least I can read the room and know if I'm providing some value to people, and ultimately we're both trapped in the room together and have to invest the same amount of time.
The LLM isn't always smart, but it's always attentive. It rewards that effort in a way that people frequently don't. (Arguably this is a company culture issue, but it's also a widespread issue.)
henrebotha · 3h ago
I would also be motivated to write better documentation if I had a junior dev sitting right next to me, utterly incapable of doing any work unless I document how; but also instantly acting on documentation I produce, and giving me rapid feedback on which parts of the documentation are sending the wrong message.
pydry · 4h ago
The halo effect around LLMs is something crazy.
ianbicking · 2h ago
My experience writing in a professional setting is that people mostly don't read what I write, and the more effort I put into being thorough the less likely that it will be read.
wheelerwj · 40m ago
That is an interesting observation. You're correct, the LLM inheritly reads and digests ever token you offer it.
627467 · 3h ago
What i keep seeing missing for AI-labor replacement discussions is that technology may seem to replace human labor, but it doesn't really replace human accountability.
Organizations many times seem capable to diffuse blame for mistakes within their human beaurocracy but as beaurocracy is reduced with AI, individuals become more exposed.
This alone - in my view - is sufficient counterpressure to fully replace humans in organizations.
Shorter reply: if my AI setup fails I'm the one to blame. If I do a bad job at helping coworkers perform better is the blame fully mine?
ares623 · 2h ago
I wonder if this is what will kill LLMs in the software development domain.
It turns out that writing and maintaining documentation is just that universally hated.
j45 · 5h ago
Lol, Empathy and communication skills are important to develop after all.
rsecora · 1h ago
Back in the day, when business computing emerged (COBOL, Mainframes...), it appear the distinction between systems analysts and programmers. Analyst understand business needs, programmers implemented those specs in code.
Years later, the industry evolved to integrate both roles, and new methodologies and new roles appear.
Now humans write specs, and AI agents write code. Role separation is a principle of labor division since Plato.
Agents are really bad at planning, unless the agent is farming out the plan to a deep research tool, as your codebase grows things are gonna end badly.
No comments yet
mehdibl · 7h ago
The most important you need always to do:
1. Plan, review the plan.
2. Review the code during changes before even it finish and fix ASAP you see drift.
3. Then again review
4. Add tests & use all quality tools don't rely 100% on LLM.
5. Don't trust LLM reviews for own produced code as it's very biased.
This is basic steps that you do as you like.
Avoid FULL AUTOMATED AGENT pipeline where you review the code only at the end unless it's very small task.
CuriouslyC · 6h ago
LLMs can review their own code, but you must have a fresh context (so they don't know they wrote it) and you need to instruct them to be very strict. Also, some models are better at code review than others, Gemini/GPT5 are very good at it as long as you give them sufficient codebase context, Claude is not so great here.
jmull · 3h ago
Lucky LLMs. All I get are forwarded meandering email chains and to attend almost entirely discursive meetings.
sublinear · 8h ago
This may produce some successes, but it's so much more work than just writing the code yourself that it's pointless. This structured way of working with generative AI is so strict that there is no scaling it up either. It feels like years since this was established to be a waste of time.
If the goal is to start writing code not knowing much, it may be a good way to learn how and establish a similar discipline within yourself to tackle projects? I think there's been research that training wheels don't work either though. Whatever works and gets people learning to write code for real can't be bad, right?
weego · 8h ago
It's just a function of how much code you need to write, and how much un-interrupted time you have.
Editing this kind of configuration has far less cognitive load and loading time, so distractions aren't as destructive to the task as they are when coding. You can then also structure time so that productive agent coding can be happening while you're doing business critical tasks like meetings / calls etc.
I do think this is overkill though, and it's a bad plan and far too early to try and formalize The One Way To Instruct AI How To Code, but every advance is an opportunity to gain career traction so fair play.
jay-baleine · 7h ago
What tends to get overlooked is the actual development speeds these projects achieve.
The PhiCode runtime for example - a complete programming language with code conversion, performance optimization, and security validation. It was built in 14 days. The commit history provides trackable evidence; manual development of comparable functionality would require months of work as a solo developer.
The "more work" claim doesn't hold up to measurement. AI generates code faster than manual typing while systematic constraints prevent the architectural debt that creates expensive refactoring cycles later. The 5-minute setup phase establishes foundations that enable consistent development throughout the project.
On scalability, the runtime demonstrates 70+ modules maintaining architectural consistency. The 150-line constraint forced modularization that made managing these components feasible - each remains comprehensible and testable in isolation. The approach scales by sharing core context (main entry points, configuration, constants, benchmarks) rather than managing entire codebases.
Teams can collaborate effectively under shared architectural constraints without coordination overhead.
This isn't about training wheels or learning syntax. The methodology treats AI as a systematic development partner focused on architectural thinking rather than ad-hoc prompting. AI handles syntax perfectly - the challenge lies in directing it toward maintainable, scalable solutions at production speed.
Previous attempts at structured AI collaboration may have failed, but this approach addresses specific failure modes through empirical measurement rather than theoretical frameworks.
The perceived 'strictness' provides flexibility within proven constraints. Developers retain complete freedom in implementation approaches, but the constraints prevent common pitfalls like monolithic files or tangled dependencies - like guardrails that keep you on the road.
The project examples and commit histories provide concrete evidence for these development speeds and architectural outcomes.
gravypod · 6h ago
> The PhiCode runtime for example - a complete programming language with code conversion, performance optimization, and security validation. It was built in 14 days. The commit history provides trackable evidence; manual development of comparable functionality would require months of work as a solo developer.
I've been looking at the docs and something I don't fully understand is what PhiCode Runtime does? It seems like:
1. Mapping of ligatures -> keywords (ex: ƒ -> def).
2. Caching of 3 types (source content, python parsing, module imports, and python bytecode).
3. Call into phirust-transpiler which seems to try and convert things into rust code?
4. An http api for requesting these operations.
A lot of this seems to be done with regexs. Was there a motivation for doing string replace instead of python -> ast -> conversion -> new ast -> source? What is this code being used for?
CuriouslyC · 6h ago
Claude Code (and claude in general, which was 99% used here) likes regexes for this sort of thing. You have to tell it to use tree sitter, or it'll make a brittle solution by default.
jay-baleine · 6h ago
Your four points are correct:
1. Symbol mapping: Yes - ƒ → def, ∀ → for, λ → lambda, π → print, etc. Custom mappings are configurable.
2. Multi-layer caching: Confirmed - source content cache, transpiled Python cache, module import specs, and optimized bytecode with batch writes.
3. PhiRust acceleration: Clarification - it's a Rust-based transpiler that handles the symbol-to-Python conversion for performance, not converting Python to Rust. When files exceed 300KB, the system delegates transpilation to the Rust binary instead of using Python regex processing.
4. HTTP API: Yes - provides endpoints for transpilation, symbol mapping queries, and engine info to enable IDE integration.
The technical decision to use string replacement over AST manipulation came down to measured performance differences.
The benchmarks show 3,000,000+ chars/sec throughput on extreme stress tests and 1,200,000+ chars/sec on typical workloads. Where AST parsing, transformation, and regeneration introduces overhead that makes real-time symbol conversion impractical for large codebases.
The string replacement preserves exact formatting, comments, and whitespace while maintaining compatibility with any Python syntax. Including future language features that AST parsers might not support yet. Each symbol maps directly to its Python equivalent without intermediate representations that can introduce transformation errors.
The cache system includes integrity validation to detect corrupted cache entries and automatic cleanup of temporary files. Cache invalidation occurs when source files change, preventing stale transpilation results. Batch write operations with atomic file replacement ensure cache consistency under concurrent access.
The runtime serves cognitive improvements for domain-specific development. Mathematical algorithms become more readable when written with actual mathematical notation rather than verbose keywords. It can help in game development, where certain functions can benefit from different naming (eg.: def → skill, def → special, def → equipment).
The gradual adoption path matters for production environments. Teams can introduce custom syntax incrementally without rewriting existing codebases since the transpiled output remains standard Python. The multi-layer caching system ensures that symbol conversion overhead doesn't impact execution performance.
Domain-specific languages for mathematics, finance, education, or any field where visual clarity improves comprehension. The system maintains full Python compatibility while enabling cognitive improvements through customizable syntax.
UncleEntity · 3h ago
> Where AST parsing, transformation, and regeneration introduces overhead that makes real-time symbol conversion impractical for large codebases.
I don't really understand why you need to do anything different when using a parser than the regex method, there's no real reason to have to parse to an AST (with all the python goodness involved with that) at all when the parser can just do the string replacement the same as whatever PhiRust is doing.
I have this peg VM (based on the lpeg papers) I've been poking at for a little while now that, while admittedly I haven't actually tested its speed, I'd be amazed if it couldn't do 3Mb/s...in fact, the main limiting factor seems to be getting bytes off the disk and the parser runtime is just noise compared to that with all the 'musttail' shenanigans going on.
And even that is overkill for simple keyword replacement with all the work done over the years on macro systems needing to be blazing fast -- which is not something I've looked into at all to see how they do their magic except a brief peek at C's macro rules which are, let's just say, complicated.
visarga · 5h ago
> The perceived 'strictness' provides flexibility within proven constraints. Developers retain complete freedom in implementation approaches, but the constraints prevent common pitfalls like monolithic files or tangled dependencies - like guardrails that keep you on the road.
I agree, the only way to use AI is to constrain it, to provide a safe space where it can bang against the walls to iterate towards the solution. I use documentation, plans and tests as constraint system.
CuriouslyC · 6h ago
It's not. I can get a detailed spec in place via back and forth with chatgpt + some templates + a validation service in 10 minutes that will consistently get an agent to power for 3+ hours with the end result being 85% test coverage, E2E user story testing, etc so when I come back to the project I'm only doing acceptance testing.
The velocity taking yourself out of the loop with analytic guardrails buys is just insane, I can't overstate it. The clear plan/guardrails are important though, otherwise you end up with a pile of slop that doesn't work and is unmaintainable.
Beyond this, you need to maintain good test coverage, and you need to have agents red-team your tests aggressively to make sure they're robust.
If you implement these two steps your agent performance will skyrocket. The planning phase will produce plans that claude can iterate on for 3+ hours in some cases, if you tell it to complete the entire task in one shot, and the robust test validation / change set analysis will catch agents solving an easier problem because they got frustrated or not following directions.
The daily struggle was always those 10 line diffs where you have to learn a lot (from the stakeholder, by debugging, from the docs).
The important thing is that this process is entirely autonomous. You create an issue, that hooks the planners, the completion of a plan artifact hooks a test implementer, the completion of tests hooks the code implementer(s, with cheaper models generating multiple solutions and taking the best diff works well), the completion of a solution + PR hooks code+security review, test red teaming, etc.
which part are you specifically uncertain about?
In the end, regardless of framework or approach, I believe there is a way to go about using llms that will optimize work for developers. I worked with this tech lead who reviews all PRs and insists on imports arranged in a specific order. I found it insulting but did it anyway. Now I don’t - the not does.
The same way that llms can be really helpful in planning and building out specific things like rest endpoints, small web components, single functions or classes and so on.
Glad people are attempting to work on such potential solution for approaching work to take advantage of these new tools
I’ve been thinking about this for months, and still don’t know what to make of it.
The LLM isn't always smart, but it's always attentive. It rewards that effort in a way that people frequently don't. (Arguably this is a company culture issue, but it's also a widespread issue.)
Organizations many times seem capable to diffuse blame for mistakes within their human beaurocracy but as beaurocracy is reduced with AI, individuals become more exposed.
This alone - in my view - is sufficient counterpressure to fully replace humans in organizations.
Shorter reply: if my AI setup fails I'm the one to blame. If I do a bad job at helping coworkers perform better is the blame fully mine?
It turns out that writing and maintaining documentation is just that universally hated.
Years later, the industry evolved to integrate both roles, and new methodologies and new roles appear.
Now humans write specs, and AI agents write code. Role separation is a principle of labor division since Plato.
No comments yet
1. Plan, review the plan.
2. Review the code during changes before even it finish and fix ASAP you see drift.
3. Then again review
4. Add tests & use all quality tools don't rely 100% on LLM.
5. Don't trust LLM reviews for own produced code as it's very biased.
This is basic steps that you do as you like.
Avoid FULL AUTOMATED AGENT pipeline where you review the code only at the end unless it's very small task.
If the goal is to start writing code not knowing much, it may be a good way to learn how and establish a similar discipline within yourself to tackle projects? I think there's been research that training wheels don't work either though. Whatever works and gets people learning to write code for real can't be bad, right?
Editing this kind of configuration has far less cognitive load and loading time, so distractions aren't as destructive to the task as they are when coding. You can then also structure time so that productive agent coding can be happening while you're doing business critical tasks like meetings / calls etc.
I do think this is overkill though, and it's a bad plan and far too early to try and formalize The One Way To Instruct AI How To Code, but every advance is an opportunity to gain career traction so fair play.
The PhiCode runtime for example - a complete programming language with code conversion, performance optimization, and security validation. It was built in 14 days. The commit history provides trackable evidence; manual development of comparable functionality would require months of work as a solo developer.
The "more work" claim doesn't hold up to measurement. AI generates code faster than manual typing while systematic constraints prevent the architectural debt that creates expensive refactoring cycles later. The 5-minute setup phase establishes foundations that enable consistent development throughout the project.
On scalability, the runtime demonstrates 70+ modules maintaining architectural consistency. The 150-line constraint forced modularization that made managing these components feasible - each remains comprehensible and testable in isolation. The approach scales by sharing core context (main entry points, configuration, constants, benchmarks) rather than managing entire codebases.
Teams can collaborate effectively under shared architectural constraints without coordination overhead.
This isn't about training wheels or learning syntax. The methodology treats AI as a systematic development partner focused on architectural thinking rather than ad-hoc prompting. AI handles syntax perfectly - the challenge lies in directing it toward maintainable, scalable solutions at production speed.
Previous attempts at structured AI collaboration may have failed, but this approach addresses specific failure modes through empirical measurement rather than theoretical frameworks.
The perceived 'strictness' provides flexibility within proven constraints. Developers retain complete freedom in implementation approaches, but the constraints prevent common pitfalls like monolithic files or tangled dependencies - like guardrails that keep you on the road.
The project examples and commit histories provide concrete evidence for these development speeds and architectural outcomes.
I've been looking at the docs and something I don't fully understand is what PhiCode Runtime does? It seems like:
1. Mapping of ligatures -> keywords (ex: ƒ -> def).
2. Caching of 3 types (source content, python parsing, module imports, and python bytecode).
3. Call into phirust-transpiler which seems to try and convert things into rust code?
4. An http api for requesting these operations.
A lot of this seems to be done with regexs. Was there a motivation for doing string replace instead of python -> ast -> conversion -> new ast -> source? What is this code being used for?
1. Symbol mapping: Yes - ƒ → def, ∀ → for, λ → lambda, π → print, etc. Custom mappings are configurable.
2. Multi-layer caching: Confirmed - source content cache, transpiled Python cache, module import specs, and optimized bytecode with batch writes.
3. PhiRust acceleration: Clarification - it's a Rust-based transpiler that handles the symbol-to-Python conversion for performance, not converting Python to Rust. When files exceed 300KB, the system delegates transpilation to the Rust binary instead of using Python regex processing.
4. HTTP API: Yes - provides endpoints for transpilation, symbol mapping queries, and engine info to enable IDE integration.
The technical decision to use string replacement over AST manipulation came down to measured performance differences.
The benchmarks show 3,000,000+ chars/sec throughput on extreme stress tests and 1,200,000+ chars/sec on typical workloads. Where AST parsing, transformation, and regeneration introduces overhead that makes real-time symbol conversion impractical for large codebases.
The string replacement preserves exact formatting, comments, and whitespace while maintaining compatibility with any Python syntax. Including future language features that AST parsers might not support yet. Each symbol maps directly to its Python equivalent without intermediate representations that can introduce transformation errors.
The cache system includes integrity validation to detect corrupted cache entries and automatic cleanup of temporary files. Cache invalidation occurs when source files change, preventing stale transpilation results. Batch write operations with atomic file replacement ensure cache consistency under concurrent access.
The runtime serves cognitive improvements for domain-specific development. Mathematical algorithms become more readable when written with actual mathematical notation rather than verbose keywords. It can help in game development, where certain functions can benefit from different naming (eg.: def → skill, def → special, def → equipment).
The gradual adoption path matters for production environments. Teams can introduce custom syntax incrementally without rewriting existing codebases since the transpiled output remains standard Python. The multi-layer caching system ensures that symbol conversion overhead doesn't impact execution performance.
Domain-specific languages for mathematics, finance, education, or any field where visual clarity improves comprehension. The system maintains full Python compatibility while enabling cognitive improvements through customizable syntax.
I don't really understand why you need to do anything different when using a parser than the regex method, there's no real reason to have to parse to an AST (with all the python goodness involved with that) at all when the parser can just do the string replacement the same as whatever PhiRust is doing.
I have this peg VM (based on the lpeg papers) I've been poking at for a little while now that, while admittedly I haven't actually tested its speed, I'd be amazed if it couldn't do 3Mb/s...in fact, the main limiting factor seems to be getting bytes off the disk and the parser runtime is just noise compared to that with all the 'musttail' shenanigans going on.
And even that is overkill for simple keyword replacement with all the work done over the years on macro systems needing to be blazing fast -- which is not something I've looked into at all to see how they do their magic except a brief peek at C's macro rules which are, let's just say, complicated.
I agree, the only way to use AI is to constrain it, to provide a safe space where it can bang against the walls to iterate towards the solution. I use documentation, plans and tests as constraint system.
The velocity taking yourself out of the loop with analytic guardrails buys is just insane, I can't overstate it. The clear plan/guardrails are important though, otherwise you end up with a pile of slop that doesn't work and is unmaintainable.