Claude Code: An Agentic cleanroom analysis

58 hrishi 52 6/1/2025, 7:04:56 PM southbridge-research.notion.site ↗

Comments (52)

thegeomaster · 1d ago
This is kinda frustrating to read. The style is very busy, and it lacks a clear structure. It's basically an information dump without any acknowledgment of what's important or not. Big-O notation is provided for a lot of operations where you wouldn't really care about Big-O (in a system where calls to an LLM dominate, this is most operations). Big picture story about how Claude Code actually works, as in what happens when I type in a prompt (which I'm very much interested, given how much I use it) is lacking. Some diagrams are so nonsensical they become funny. Look at this: https://southbridge-research.notion.site/Prompt-Engineering-... In general, the prompt engineering page, which deserves maybe the most detailed treatment, is just a dump of prompts and LLM bullet point filler.

I don't want to be overly negative, but I think it's only fair given the author hasn't graced us with their own thoughts, instead offloading the actual writing to an LLM.

triyambakam · 1d ago
Claude Code with Sonnet 4 is so good I've stopped using Aider. This has been hugely productive. I've been able to write agents that Claude Code can spawn and call out to for other models, even.
__mharrison__ · 1d ago
What does it give you that enabling Sonnet as a backend for Aider doesn't?
SatvikBeri · 1d ago
It's much better at breaking down tasks on its own. All the tool use stuff is also deeply integrated. So I can reliably make a plan with Claude Code, then have it keep working on implementing until all tests pass.
ramoz · 1d ago
An entirely different software and agent architecture - built by the creators of Claude themselves.
maleldil · 1d ago
Claude Code is "agentic". Aider isn't. It can plan, use external tools, run the compiler, tests, linters, etc. You can do some of it with Aider, too, but Claude is more independent. The downside is that it can get very expensive, very fast.
thegeomaster · 1d ago
I've personally found that I reject around 80% of suggestions with "No, and tell Claude what to do differently". So it requires a lot of babysitting, and it usually means I cannot do another thing effectively while it's running. For this reason I've considered switching to something less agentic like Aider since it's more predictable. Curious to hear how others work around this.
maleldil · 8h ago
My experience has been quite different. I often give it a large block of instructions and let it run autonomously for a while. When I come back it often did what I expected. It doesn't have good taste with respect to APIs, though, so sometimes I need a heavier hand on that.

I find it helps to have a CLAUDE.md file with instructions and thorough documentation. This is on a ~30k LOC Python codebase with type-checking and tests. YMMV with other languages.

SatvikBeri · 1d ago
I found a huge difference between Sonnet 3.7 and Sonnet 4.0 here. Two weeks ago I rejected most suggestions, now I accept most of them.

In addition, after a 3 hour session I told it to create a CLAUDE.md that would help it program similarly to me, based on my preferences. I then edited that file a bit, and that has helped a lot.

rane · 16h ago
Give better and more precise instructions.

Use the @ thing to prod it read some relevant files for context (kind of like with Aider)

cedws · 1d ago
Could you briefly explain your workflow? I use Zed’s agent mode and I don’t really understand how people are doing it purely through the CLI. How do you get a decent workflow where you can approve individual hunks? Aren’t you missing out on LSP help doing it in the CLI?
mindwok · 1d ago
Claude code has a VS Code plugin now that lets you view and approve diffs in the editor. Before it did that, I really don't understand how people got anything of substance done because it simply isn't reliable enough over large codebases.
phillipcarter · 1d ago
I managed this (and still do) just fine by using source control. Commits are checkpoints and it's trivial (esp. with AI-assisted CLI) to roll things back if needed. Workflow is all about small diffs, with sessions being fine-grained, not about implementing entire features wholesale. Serialize overall plans for feature work as files in the codebase in the interim.
gaut · 1d ago
How is viewing and approving diffs in an editor any less reliable than viewing and approving them through the CLI? It won't make any changes without approval (unless you explicitly grant it auto-approval).
insane_dreamer · 22h ago
Before that it showed you diffs in the console, which worked ok.

Claude Code now also has a PyCharm plugin (and probably other JetBrains IDEs) that also shows you diffs in the pycharm editor.

scuff3d · 1d ago
If this is what software engineering is going to become I'm finding a new job.
elliotec · 1d ago
Better start now! It’s incredible and unbelievable how productive it is. In my opinion it still takes someone with a staff level of engineering experience to guide it through the hard stuff, but it does in a day with just me what multiple product teams would take months to do, and better.

I’m building a non-trivial platform as a solo project/business and have been working on it since about January. I’ve gotten more done in two nights than I did in 3 months.

I’m sure there are tons of arguments and great points against what I just said, but it’s my current reality and I still can’t believe it. I shelled out the $100/mo after one night of blowing through the $20 credits I used as a trial.

It does struggle with design and front end. But don’t we all.

petetnt · 1d ago
Designers and frontend developers don’t struggle with those. That’s why they are designers and frontend developers.

Before those 3 months you mentioned, how much did you spend time coding on average (at work, or as a hobby) percentagewise?

elliotec · 23h ago
Of course they do. I’ve been primarily a front end developer for 15 years. Working with designers. Shit takes so many iterations and so much time. Claude is faster but still “struggles” compared to basic rails work and API calls and test writing and whatnot.

I’m not sure how to answer the question on percentage of time coding. I quit my job as a director where coding wasn’t part of the job but have kept up on side stuff and architecture at work. Since the new year when I started this it’s been in bursts, some weeks or nights I’ll go super hard coding and others I’ll focus on other stuff. I go to conferences and study a lot on the subject of the industry so that’s what I do in bursts of the non-coding time.

I hired a virtual assistant to help with the non-coding things so lately it’s been a lot more.

In general I’d estimate at least 50% of my work on this thing since January has been coding but it’s really hard to gauge. Claude over the past 3 days has surpassed my personal coding productivity over the past 3 months though, if it wasn’t clear what I was saying.

john2x · 1d ago
Will these kinds of software end up like a programs written with bespoke Lisp macros? Lots of power but only one person actually knows it by heart.
scuff3d · 22h ago
Hit me up when you release your product. I keep seeing stuff like this and never see any proof. Companies aren't releasing 10x the features/patches/bug fixes/products, open source isn't getting 10x the number of quality PRs, absolutely no real evidence that the massive productivity gains actually exist.

What I've seen is people feel more productive, until the reality of all the subtle problems start to set in. Even skilled engineers usually only end up with 10 or 20% productivity gains by the time they reduce its usecase to where it's actually not total dog shit, or by the time they go back around and fix all the problems.

The highest quality product I know of where the creator has talked about his use of AI is ghostty, and he's not claiming massive improvements, just that it's definitely helpful.

elliotec · 18h ago
I’ll happily let you know when I release. Goal date for public beta is the 15th. I’d love eyes and feedback on it ASAP.

Hopefully it’s obvious that Claude will not have simply written the entire thing but you might get a sense of what it can do quickly as part of a whole - maybe similar to your last sentence but I suppose I am claiming massive improvements (in productivity, no warranty on quality yet).

Also keep in mind I’m entirely solo here. I fully agree with your points that the proof is in the pudding and obviously there’s nuance to all of it. But yeah, I’m not exaggerating with my commentary above.

scuff3d · 17h ago
If you don't mind me asking a couple questions, what percentage of your code would you say is AI generated, meaning you promoted an AI and it went off and wrote code that you used (with or without modification)?

And how much time would you say you spend wrangling the AI, meaning either repromting or substantially editing what you get back?

shostack · 1d ago
What are some examples of agents you've created that you regularly use?
rane · 1d ago
Have you been able to interface Claude Code with Gemini 2.5 Pro? I'm finding that Gemini 2.5 Pro is still better at solving certain problems and architecture and it would be great to be able to consult directly in CC.
ramoz · 1d ago
I do it indirectly. Gemini is my architecture goto. Claude Code for execution. It's just way more efficient to feed large portions of codebase at once to Gemini, pump out a plan and feed it to Claude Code. https://x.com/backnotprop/status/1929020702453100794
SparkyMcUnicorn · 18h ago
My most recent flow is very similar, but I use AiderDesk[0] instead of Prompt Tower for easier creation/editing of plan files.

AiderDesk lets you save snapshots of a point in time, so I create "presets" to restore sets of context files and/or conversation history (you can restore one or both), which is a really nice bonus. You can also add/remove context as needed without the manual copy/pasting work when I forget to include something or accidentally included too much. Its VS Code extension makes adding/removing files from context seamless.

[0] https://github.com/hotovo/aider-desk

triyambakam · 1d ago
Well a quick hack is to tell Claude Code to make "AI!" comments in the code which Aider can be configured to watch for, then Gemini 2.5 Pro can do those tasks. Yes I really like Gemini still too
InGoldAndGreen · 1d ago
The "LLMs perspective" section is hiding at the end of this notion is a literal goldmine
numeri · 1d ago
No, it's completely useless, and puts the entire rest of the analysis in a bad light.

LLMs have next to no understanding of their own internal processes. There's a significant amount of research that demonstrates this. All explanations of an internal thought process in an LLM are completely reverse engineered to fit the final answer (interestingly, humans are also prone to this – seen especially in split brain experiments).

In addition, the degree to which the author must have prompted the LLM to get it to anthropomorphize this hard makes the rest of the project suspect. How many of the results are repeated human prompting until the author liked the results, and how many come from actual LLM intelligence/analysis skill?

sebnado · 1d ago
By saying that's its gold mine, I think OP meant that's it's funny, not that it brings valuable insight. ie: THEY KNOW -> that made me laugh

and as the article said "an LLM who just spent thousands of words explaining why they're not allowed to use thousands of words", its just funny to read.

jerpint · 1d ago
The fact that they produce this as “default” response is an interesting insight regardless of its internal mechanisms. I don’t understand my neurons but can still articulate how I feel
ramoz · 1d ago
It is completely reasonable and often - very - useful to evaluate and interpret instructions with LLMs.

You're stuck on the anthropomorphize semantics, but that wasn't the purpose of the exercise.

mholm · 1d ago
It's sure phrased like one, but I'd be careful to attribute LLM thought process to what it says it's thinking. LLMs are experts at working backwards to justify why they came to an answer, even when it's entirely fabricated
doctoboggan · 1d ago
> even when it's entirely fabricated

I would go further and say it's _always_ fabricated. LLMs are no better able to explain their inner workings than you are able to explain which neurons are firing for a particular thought in your head.

Note, this isn't a statement on the usefulness of LLMs, just their capability. An LLM may eventually be given a tool to enable it to introspect, but IMO its not natively possible with the LLM architectures today.

gpm · 1d ago
There's a slight exception to this, in that LLMs are able to accurately describe portions of the buffer that are arbitrarily hidden from the user.

An LLM that says "I said orcs are green because I recalled a scene in lord of the rings..." is fabricating*. An LLM that says "I talked about white genocide because my system prompt told me to" is very likely telling the truth because it can literally see the system prompt as it generates the output. Even though in the situation I'm referring to the system prompt was hidden from users. It's a logical conclusion from the combination of the system prompt and its previous output that that is why its previous output is what it is (that anyone could make with the same degree of confidence if they had access to the full buffer).

* Unless it's reading back from a <thinking> section of the buffer that was potentially hidden from the user.

demarq · 1d ago
It's the best thing I've read from an LLM!

It sounds a lot like like the Murderbot character in the AppleTV show!

roxolotl · 1d ago
Right… because these things are trained on sci-fi and so when asked to describe an internal monologue they create text that reads like an internal monologue from a sci-fi character.

Maybe there’s genuine sentience there, maybe not. Maybe that text explains what’s happening, maybe not.

demarq · 1d ago
> Maybe that text explains what’s happening, maybe not

It would have been cool to see what prompt was used for that page!

numeri · 1d ago
Yes, so that one can use it for more creative writing exercises. It was pretty creative, I'll give it that.
MoonGhost · 1d ago
> Your browser is not compatible with Notion. Please upgrade to the latest browser version, or visit our help center for more information.

FireFox 113.0.2, how come?

vohk · 1d ago
That's a two year old release now. I think Notion may have a fair point on this one.
ramoz · 1d ago
For large codebases and task management, you can empower Claude Code with a simple filesystem framework. https://x.com/backnotprop/status/1929020702453100794

It sees everything it needs to in one pass, no extra reasoning or instruction tokens around things like MCP that abstract and create hops to simple understanding of where things are at.

ramoz · 1d ago
There is something here about the native filesystem and tooling. & some type of insight into what agentic software engineering will look like - I mostly feel like an orchestrator, or validator in the terminal window next to Claude Code where i run tests/related things.

I was never a great terminal developer, I cant even type right - but Claude Code by far provides the best software engineering interface in there terms of LLM/agent UX.

owebmaster · 1d ago
I have nothing against LLM-generated content. But when publishing, make sure the content is displayed correctly and that it is enjoyable to read.
29athrowaway · 1d ago
If you want an example of a good analytical breakdown of some technology, look at this as an example: https://fabiensanglard.net/quake3/network.php

It is good because it highlights the relevant aspects of the design and you can use this, plus some other resources, to replicate the idea.

fullstackchris · 1d ago
also, i will say, (if we can trust the findings in these notes are relatively accurate of the real implementation) is a PERFECT example of the real level of complexity used in cutting edge configuration of using LLM... its not just some complex fancy prompt you give to a model in a chat window... there is so much important stuff happening behind the scenes... though i suppose the people who complain about LLMs hallucinating / screwing up havent tried claude code or any agentic work flows - or, it could be their architecture / code is so poorly written and poorly organized that even the LLM itself struggles to modify it properly
girvo · 1d ago
> or, it could be their architecture / code is so poorly written and poorly organized that even the LLM itself struggles to modify it properly

You wrote this like this is some rare occurrence, and not a description of a bulk of the production code that exists today, even at high level tech companies.

fullstackchris · 1d ago
interesting... the analysis finds that the MCP supports websockets as a transport... when there is big drama going on right now that anthropic said "they will never support that", folks hating SSE, and so on and so forth
numeri · 1d ago
Is the analysis right, or did the LLM hallucinate this?