You didn't reduce e2e test time. You've reduced the e2e test coverage by only running those tests that the LLM tells you to run.
un_montagnard · 3h ago
Reminds me that time when I was doing some POC on an existing code base and I reduced build time by deleting all unit tests
sghiassy · 3h ago
+1
At Meta we do similar heuristics on which tests to run per PR. When the system gets it wrong, which is often, its painful and leads to merged code that broke unrun tests
unshavedyak · 2h ago
Fwiw i feel like you could still do both, no? Ie run all tests daily or whatever cadence is not painful but still protective, but _also_ use heuristics to improve PR responsiveness.
Like anything it's about tradeoffs. Though if it were me i'd simply write a mechanism which deterministically decides which areas of code pertain to which tests and use that simple program to determine which tests run. The algorithm could be loosely as simple as things like code owners and git-blame, but relative to some big set of Code->Test that you can have Claude/etc build ahead of time. The difference being it's deterministic between PRs and can be audited by humans for anything obviously poor or w/e.
As much as LLMs are interesting to me i hate using them in places where i want consistency. CI seems terrible for them.. to me at least.
sevazhidkov · 51m ago
To be fair, the most interesting bugs that can be caught with tests always feel like “I would’ve never guessed that part of system actually depends on this one”.
manojlds · 3h ago
Why do you respond to the title when that's addressed in the opening.
spaceywilly · 2h ago
Yeah I’m not seeing any evidence that this actually works. I would’ve like to see some testing where they intentionally introduce a bug (ideally a tricky bug in a part of code that isn’t directly changed by the diff) and see if Claude catches it.
A good middle ground could be to allow the diff to land once the “AI quick check” passed, then keep the full test suite running in the background. If they run them side by side for a while and see that the AI quick check caught the failing test every time, I’d be convinced.
jryio · 3h ago
This is a hacky joke. No sane engineer would ever sign off on this. Even for a 1-5 person team, why would I want a probabilistic selection of test execution?
The solution to running only e2e tests on affected files has been around long before LLM. This is a bandage on poor CI.
johnfn · 3h ago
I have worked at large, competent companies, and the problem of "which e2e tests to execute" is significantly more complicated than you seem to suggest that it is. I've worked with smart engineers that put a lot of time into this problem to only get only middling results.
Yoric · 3h ago
...and I'm not confident at all that Claude can do anything at that level.
johnfn · 3h ago
How does that reconcile with the article, which states:
> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.
If you have some particular issue with the author's methodology, you should state that.
cerved · 2h ago
Well since it never broke for some rando on the internet, surely that means it will always work for everyone
johnfn · 2h ago
If you have some particular issue with the article, you should state that. Otherwise, the most charitable interpretation of your position I can come up with is "the article is wrong for some reason I refuse to specify", which doesn't lead to a productive dialogue.
ambicapter · 22m ago
I think you're the one being uncharitable here. The meaning of what he's saying is very clear. You can't say this probabilistic method (using LLMs to decide your e2e test plan) works if you only have a single example of it working.
It seems to me like we have the answers to all those questions.
- Do we know which projects people work on?
It's pretty easy to discover that OP works on https://livox.com.br/en/, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.
- Do we know which codebases (greenfield, mature, proprietary etc.) people work on
The e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.
- Do we know the level of expertise the people have?
It seems like they work on nontrivial production apps.
- How much additional work did they have reviewing, fixing, deploying, finishing etc.?
The article says very little.
troupo · 53m ago
> The article says very little.
And that's the crux, isn't it. Because that checklist really is just the tip of the iceberg.
Oh, don't get me wrong: I like the idea. I would trust LLMs with this idea about as far as I could throw them.
bgwalter · 2h ago
If the author can keep the whole function code_change -> relevant E2E_TESTS in his head, it seems to be a trivial application.
We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.
jampa · 3h ago
I think you might be confusing end-to-end (E2E) tests with other types of testing, such as unit and integration tests. No one is advocating this approach for unit tests, which should still run in their entirety on every pull request.
Running all E2E tests in a pipeline isn't feasible due to time constraints (takes hours). Most companies just run these tests nightly (and we still do). Which means we would still catch any issues that slip through the initial screening. But so far, nothing did.
trenchpilgrim · 3h ago
> The solution to running only e2e tests on affected files has been around long before LLM.
This doesn't work in distributed systems, since changing the behavior of one file that's compiled in one binary can cause a downstream issue in a separate binary that sends a network call to the first. e.g. A programmer makes a behavioral change to binary #1 that falls within defined behavior, but encounters Hyrum's Law because of a valid behavior of binary #2.
hamandcheese · 2h ago
That's easy:
- avoid distributed systems at all costs
- if you can't avoid them, never make breaking API changes
madeofpalk · 2h ago
Determining breaking API changes is the whole point of tests.
trenchpilgrim · 1h ago
While we're at it, give things good names and don't invalidate caches at the wrong times!
lukan · 59m ago
Also always keep your documentation updated and complete.
hamandcheese · 2h ago
I would sign off on it. The only evidence I would need to see is some analysis of whether the risks are worth the benefits.
Risks: missing e2e tests that should have run letting bugs into production, more time spent chasing down flakes due to non determinism.
Benefits: increased productivity, catch bugs sooner (since you can run e2e tests more often).
brynary · 3h ago
Historically, this kind of test optimization was done either with static analysis to understand dependency graphs and/or runtime data collected from executing the app.
However, those methods are tightly bound to programming languages, frameworks, and interpreters so they are difficult to support across technology stacks.
This approach substitutes the intelligence of the LLM to make educated guesses about what tests execute, to achieve the same goal of executing all of the tests that could fail and none of the rest (balancing a precision/recall tradeoff). What’s especially interesting about this to me is that the same technique could be applied to any language or stack with minimal modification.
Has anyone seen LLMs in other contexts being substituted for traditional analysis to achieve language agnostic results?
wer232essf · 3h ago
That’s a great point — the portability angle is definitely one of the most intriguing aspects here. Traditional static analysis tools are powerful but usually brittle outside of their home ecosystem, and the investment required to build/maintain them across multiple languages is huge. Using an LLM essentially offloads that specialization into the model’s “latent knowledge,” which could make it a lot more accessible for polyglot teams or systems with mixed stacks.
One interesting side effect I’ve noticed when experimenting with LLM-driven heuristics is that, while they lack the determinism of static/runtime analysis, they can sometimes capture semantic relationships that are invisible to traditional tooling. For example, a change in a configuration file or documentation might not show up in a dependency graph, but an LLM can still reason that it’s likely to impact certain classes of tests. That fuzziness can introduce false positives, but it also opens the door to catching categories of risk that would normally be missed.
I think the broader question is how comfortable teams are with probabilistic guarantees in their workflows. For some, the precision/recall tradeoff is acceptable if it means faster feedback and reduced CI bills. For others, the lack of hard guarantees makes it a non-starter. But I do see a pattern emerging where LLM-based analysis isn’t necessarily a replacement for traditional methods, but a complementary layer that can generalize across stacks and fill in gaps where traditional tools don’t reach.
gregorriegler · 21m ago
This is called Test Impact Analysis and it is something worth making deterministic. Like with an algorithm, and without an LLM.
And people have already done this.
For example: SeaLights is a product that does this.
meisel · 3h ago
I don’t think you truly get the “best” of both worlds, because the rate of accidentally omitting a broken test and letting something slip into master is now non-zero (flaky tests aside). This is still a tradeoff. But maybe it’s a good one!
I do wonder if this is as feasible at scale, where breaking master can be extremely costly (although at least it’s not running all tests for all commits, so a broken test won’t break all CI runs). Maybe it could be paired with, say, running all E2E tests post-merge and reporting breakages ASAP.
meisel · 2h ago
Another idea would be to still run the rest of the E2E tests pre-merge, but as a separate job that only makes itself known if a failure occurs.
cerved · 2h ago
I think the point is, E2E takes too long, so it was only run nightly. Now it's also run, selectively, as well as nightly
johnfn · 3h ago
> The key phrase here is "think deep". This tells Claude Code not to be lazy with its analysis (while spending more thinking tokens). Without it, the output was very inconsistent. I used to joke that without it, Claude runs in “engineering manager mode” by delegating the work.
This section really stood out to me. I knew that asking GPT-5 to think gets better results, but I didn't know Claude had the same behavior. I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to Claude?
calesennett · 3h ago
In Anthropic's Claude Code Best Practices [1] (section 3.a.2), they note prompting with certain phrases corresponds to increasing levels of "thinking" for Claude Code:
Since this is just a keyword, I'd expect you to prepend the entire prompt with a single line that says "think deep" or whichever think-level you want, rather than put "think deep" somewhere that makes sense in english within the prompt, as they do in the article.
johnfn · 3h ago
Wow! Thanks for sharing - I didn't know this!
lupusreal · 3h ago
Man this is useful, I wish I knew earlier. I should really read the documentation. Or have Claude read it for me.
skippyboxedhero · 3h ago
There is "think" and "super think" (there may be something in the middle).
If you use the word "think" at any point, this mode will also trigger...which is sometimes inconvenient.
Plan mode is probably better for these situations though. Claude seems to have got much worse recently but even before, if it was stuck, then asking it so "super think" never dislodged it (presumably because AI agents have no capacity for creativity, telling them to think more is just looping harder).
The loop of plan -> do is significantly more effective as most AI agents that I have used extensively will get lost on trivial tasks if you just have the "do" phase without a clear plan (some, such as GPT, also appear unable to plan effectively...we have a contract with OpenAI at work, I have no idea how people use it).
aaviator42 · 3h ago
think / think hard / think harder / ultrathink
These are all valid claude commands/tokens to enhance its "thinking" abilities.
troupo · 1h ago
> I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to Claude
Any such % would be meaningless because both are non-deterministic blackboxes with literally undefined behavior. Any % you'd be seeing could be just differences in available compute as the US is waking up
johnfn · 57m ago
I am describing a simple eval. This is a strategy that determines how effective an AI system is, they have existed long before LLMs were ever a thing, and they are perfectly happy to deal with non-deterministic behavior. You are acting as if LLMs being non-deterministic means you can't say anything about them, but before LLMs were non-deterministic, traditional ML systems were non-deterministic, and we had no problem building systems to probabilistically evaluate them.
troupo · 50m ago
A "simple eval" would give you nothing. Anthropoc running out of available compute will affect the output significantly more than a change in the model or "thinking".
> we had no problem building systems to probabilistically evaluate them.
Don't do this. If you have test time issues that are clogging the build pipeline, you can run a merge queue that creates ghost branches to test patches in sets rather than one by one.
vinnymac · 1h ago
Agreed, this is just bad engineering. I hope we don’t see more ideas like this bleed into actual release processes.
Outside of a fun project to tinker on and see the results, I wouldn’t use this for anything that runs in production. It is better to use an LLM to assist you in building better static analysis tools, than it is to use the proposed technique in the article.
politelemon · 2h ago
> Yes, and I'm not exaggerating. Claude never missed a relevant E2E test.
I was waiting to see this part demonstrated and validated - for a given pr, whether you created an expected set of tests that would run and then compare it to what actually ran. Without a baseline, as any tester would tell you, the LLM output has been trusted without checking.
jampa · 2h ago
I should have added that to the post, but we verify by running a full test suite in a cron job, so if a bug slips, we will know. The full suite sometimes breaks, but these are due to changes outside of PR (e.g., a vendor goes down).
kelnos · 1h ago
That doesn't verify what you think it does, though. It doesn't verify that the LLM ran all the tests necessary to exercise the code paths that were changed. "Running all necessary tests" is indistinguishable from "we did a good job on these PRs and didn't write any new bugs that would have been caught by the tests". This is the classic situation where you can prove a negative (if all the LLM-selected tests pass, but there's a new failure in the full suite), but can't prove a positive (if all the LLM-selected tests pass, and the full suite also passes, it could just mean there were no new bugs, not that the LLM selected the right tests).
The only way to verify that is to do what GP suggested: for each PR, manually make a list of the tests that you believe should be run, and then compare that to the list of tests that the LLM actually runs. Obviously you aren't going to do this forever, but you should either do it for every single PR for some number of PRs, or pick a representative sample over time instead.
vladdoster · 2h ago
Few questions:
1. Are you seeing a lower number of production hot fixes?
2. Have you tried other models? Or thought about using Output of different models and using combining the output if they have a delta.
3. Other than time & cost, what benchmarks in terms of software are considered (i.e., less hot fixes, etc.)?
This is really cool, btw
jampa · 2h ago
Thanks!
> Are you seeing a lower number of production hot fixes?
Yes, with E2E tests in general: They are more effective at stopping incidents than other tests, but they require more effort to write. In my estimation, we prevent about 2-3 critical bugs per month from being merged into main (and consequently deployed).
For this project specifically: I think the critical bugs would have been caught in our overnight full E2E run anyway. The biggest gain was that E2E tests took too much time in the pipeline, and finding the root cause of bugs in nightly tests took even more time. When a test fails in the PR, we can quickly fix it before merging.
> Have you tried other models? Or thought about using output from different models and combining the results when they differ?
Not yet, but I think we need to start experimenting. Claude went offline for 30 minutes over the last 2 days, and engineers were blocked from merging because of it. I'm planning to add claude-code-router as a fallback.
manojlds · 3h ago
That's why we have smoke tests. Let your full suite run later.
ares623 · 3h ago
But doesn’t that defeat (part of) the purpose of the E2E test? i.e. you want to test _unrelated_ parts of the system that might have broke.
johnfn · 3h ago
Sometimes. In my experience they bifurcate into:
- Smoke tests (does this page load at all)
- More narrow tests (does this particular flow work)
I think you're referring to smoke tests, but you likely always want to run smoke tests. It's the narrow tests that you are safe with removing.
duncanfwalker · 3h ago
As other comments have said, I'd prefer other solutions to get by all the tests to run faster. It would be interesting to see if it could be used to prioritise tests - get the tests more likely to fail to run sooner.
davemo · 3h ago
I can appreciate the effort put into the goal of optimization shared in the post, even if I disagree with the conclusions. All of that effort would be much better directed at doing a manual (or LLM-assisted) audit of the E2E tests and choosing what to prune to reduce CI runtime.
DHH recently described[0] the approach they've taken at BaseCamp, reducing ~180 comprehensive-yet-brittle system tests down to 10 good-enough smoke tests, and it feels much more in spirit with where I would recommend folks invest effort: teams have way more tests than they need for an adequate level of confidence. Code and tests are a liability, and, to paraphrase Kent Beck[1], we should strive to write the minimal amount of tests and code to gain the maximal amount of confidence.
The other wrinkle here is that we're often paying through the nose in costs (complexity, actual dollars spent on CI services) by choosing to run all the tests all the time. It's a noble and worthy goal to figure out how not to do that, _but_, I think the conclusion shouldn't be to throw more $$$ into that money-pit, but rather just use all the power we have in our local dev workstations + trust to verify something is in a shippable state, another idea DHH covers[2] in the Rails World 2025 keynote; the whole thing is worth watching IMO.
Agreed. When you have multiple developers working on the same code, you end up with overlapping test coverage as time goes on. You also end up with test coverage that was initially written with good intentions, but ultimately you'll later find that some of it just isn't necessary for confidence, or isn't even testing what you think it is.
Teams need to periodically audit their tests, figure out what covers what, figure out what coverage is actually useful, and prune stuff that is duplicative and/or not useful.
OP says that ultimately their costs went down: even though using Claude to make these determinations is not cheap, they're saving more than they're paying Claude by running fewer tests (they run tests on a mobile device test farm, and I expect that can get pricey). But ultimately they might be able to save even more money by ditching Claude and deleting tests, or modifying tests to reduce their scope and runtime.
And at this point in the sophistication level of LLMs, I would feel safer about not having an LLM deciding which tests actually need to run to ensure a PR is safe to merge. I know OP says that so far they believe it's doing the right thing, but a) they mention their methodology for verifying this in a comment here[0], and I don't agree that it's a sound methodology[1], and b) LLMs are not deterministic and repeatable, so it could choose two very different sets of tests if run twice against the exact same PR. The risk of that happening may be acceptable, though; that's for each individual to decide.
> ...Claude Code strategically examines specific files, searches for patterns, traces dependencies, and incrementally builds up an understanding of your changes.
So, it builds a dependency graph?
I've been playing with graph related things lately and it seems like there might be more efficient ways to do this than asking a daffy robot to do the job instead of a specific (AI crafted?) tool.
One could even get all fancy with said tool and use it to do fun and exciting things with the cross file dependencies like track down unused includes (or whatever) to improve build times.
breppp · 3h ago
Next up: How I saved 53% of storage space by asking an LLM "will this be needed in the future?" for every write request
golergka · 3h ago
All E2E tests should run before deploy. Probably on every commit on develop branch, even. But there really is no need to run all E2E suite on every PR. In this case, the failure mode of this system, where PR automation failed to flag a breakage, is acceptable if it’s rare enough, so probabilistic solutions are OK.
jampa · 3h ago
We still run E2E tests before deployment, but running them on pull requests also eliminates the question: "We want to deploy, but have a bug. Which PR caused it? and how do we fix it?" This approach essentially saves you from having to perform a git bisect and engineers getting blocked because there is a bug unrelated to their task.
catlifeonmars · 3h ago
Yeah this. You can commit optimistically and worst case can always revert a handful of commits eg on the nightly builds running e2e and manually root cause the breaking commit(s).
catigula · 3h ago
>Overall, we're saving money, developer time, and preventing bugs that would make it to production. So it's a win-win-win!
"We" and "win" are both doing a lot of heavy-lifting here, as they are whenever talks about LLM labor destruction.
kelnos · 1h ago
I mean, ok, maybe in a general sense, but do you really think the company is going to hire and employ someone just to sit and decide what E2E tests to run on every PR? No, of course not. What's going to happen instead is that either 1) all tests will continue to run, and productivity will continue to drop, or 2) PR authors will manually decide what tests to run, wasting their time (and they'll get it wrong sometimes, or have subtle unconscious biases that make them less likely to run the tests that take a longer amount of time, and/or the ones more likely to surface bugs in their PR).
So this is in theory a good thing: an LLM replacing a tedious task that no one was going to be hired to do anyway.
And besides, labor destruction could be a truly wonderful thing, if we had a functional, empathetic society in which we a) ensure that people have paths to retrain for new jobs while their basic needs are met by the state (or by the companies doing the labor destruction), and/or b) allow people whose jobs are destroyed to just not work at all, but provide them with enough resources to still have a decent life (UBI plus universal health care basically).
My utopia is one where there is no scarcity, and no one has to work to survive and have a good life. If I could snap my fingers and eliminate the need for every single job, and replace it with post-scarcity abundance, I would. People would build things and do science and help others because it gives them pleasure to do so, not because they have to in order to survive. And for people who just want to live lives of leisure, that would be fine, too. I don't think humanity will ever get to this state, mind you, but I can dream of a better world.
catigula · 1h ago
Why do engineers care about productivity again?
Disposal8433 · 3h ago
Flagged because:
> what if we could run only the relevant E2E tests
The real title should be "Using Claude Code to Reduce E2E Tests by 84%."
kelnos · 1h ago
That's not a good reason to flag; you can just comment on the bad title (as you did), or even email hn@ycombinator.com and ask them to fix it.
Article flags should be reserved for things you don't believe should be on HN at all.
At Meta we do similar heuristics on which tests to run per PR. When the system gets it wrong, which is often, its painful and leads to merged code that broke unrun tests
Like anything it's about tradeoffs. Though if it were me i'd simply write a mechanism which deterministically decides which areas of code pertain to which tests and use that simple program to determine which tests run. The algorithm could be loosely as simple as things like code owners and git-blame, but relative to some big set of Code->Test that you can have Claude/etc build ahead of time. The difference being it's deterministic between PRs and can be audited by humans for anything obviously poor or w/e.
As much as LLMs are interesting to me i hate using them in places where i want consistency. CI seems terrible for them.. to me at least.
A good middle ground could be to allow the diff to land once the “AI quick check” passed, then keep the full test suite running in the background. If they run them side by side for a while and see that the AI quick check caught the failing test every time, I’d be convinced.
The solution to running only e2e tests on affected files has been around long before LLM. This is a bandage on poor CI.
> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.
If you have some particular issue with the author's methodology, you should state that.
There's a handy list to check against the article here: https://dmitriid.com/everything-around-llms-is-still-magical... starting at "For every description of how LLMs work or don't work we know only some, but not all of the following"
- Do we know which projects people work on?
It's pretty easy to discover that OP works on https://livox.com.br/en/, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.
- Do we know which codebases (greenfield, mature, proprietary etc.) people work on
The e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.
- Do we know the level of expertise the people have?
It seems like they work on nontrivial production apps.
- How much additional work did they have reviewing, fixing, deploying, finishing etc.?
The article says very little.
And that's the crux, isn't it. Because that checklist really is just the tip of the iceberg.
Some people have completely opposite experiences: https://news.ycombinator.com/item?id=45152139
Others question the validity of the approach entirely: https://news.ycombinator.com/item?id=45152668
Oh, don't get me wrong: I like the idea. I would trust LLMs with this idea about as far as I could throw them.
We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.
Running all E2E tests in a pipeline isn't feasible due to time constraints (takes hours). Most companies just run these tests nightly (and we still do). Which means we would still catch any issues that slip through the initial screening. But so far, nothing did.
This doesn't work in distributed systems, since changing the behavior of one file that's compiled in one binary can cause a downstream issue in a separate binary that sends a network call to the first. e.g. A programmer makes a behavioral change to binary #1 that falls within defined behavior, but encounters Hyrum's Law because of a valid behavior of binary #2.
- avoid distributed systems at all costs
- if you can't avoid them, never make breaking API changes
Risks: missing e2e tests that should have run letting bugs into production, more time spent chasing down flakes due to non determinism.
Benefits: increased productivity, catch bugs sooner (since you can run e2e tests more often).
However, those methods are tightly bound to programming languages, frameworks, and interpreters so they are difficult to support across technology stacks.
This approach substitutes the intelligence of the LLM to make educated guesses about what tests execute, to achieve the same goal of executing all of the tests that could fail and none of the rest (balancing a precision/recall tradeoff). What’s especially interesting about this to me is that the same technique could be applied to any language or stack with minimal modification.
Has anyone seen LLMs in other contexts being substituted for traditional analysis to achieve language agnostic results?
One interesting side effect I’ve noticed when experimenting with LLM-driven heuristics is that, while they lack the determinism of static/runtime analysis, they can sometimes capture semantic relationships that are invisible to traditional tooling. For example, a change in a configuration file or documentation might not show up in a dependency graph, but an LLM can still reason that it’s likely to impact certain classes of tests. That fuzziness can introduce false positives, but it also opens the door to catching categories of risk that would normally be missed.
I think the broader question is how comfortable teams are with probabilistic guarantees in their workflows. For some, the precision/recall tradeoff is acceptable if it means faster feedback and reduced CI bills. For others, the lack of hard guarantees makes it a non-starter. But I do see a pattern emerging where LLM-based analysis isn’t necessarily a replacement for traditional methods, but a complementary layer that can generalize across stacks and fill in gaps where traditional tools don’t reach.
I do wonder if this is as feasible at scale, where breaking master can be extremely costly (although at least it’s not running all tests for all commits, so a broken test won’t break all CI runs). Maybe it could be paired with, say, running all E2E tests post-merge and reporting breakages ASAP.
This section really stood out to me. I knew that asking GPT-5 to think gets better results, but I didn't know Claude had the same behavior. I'd love to see some sort of % success before and after "think deep" was added. Should I be adding "think deep" to all my non-trivial queries to Claude?
"think" < "think hard" < "think harder" < "ultrathink"
[1] https://www.anthropic.com/engineering/claude-code-best-pract...
If you use the word "think" at any point, this mode will also trigger...which is sometimes inconvenient.
Plan mode is probably better for these situations though. Claude seems to have got much worse recently but even before, if it was stuck, then asking it so "super think" never dislodged it (presumably because AI agents have no capacity for creativity, telling them to think more is just looping harder).
The loop of plan -> do is significantly more effective as most AI agents that I have used extensively will get lost on trivial tasks if you just have the "do" phase without a clear plan (some, such as GPT, also appear unable to plan effectively...we have a contract with OpenAI at work, I have no idea how people use it).
Any such % would be meaningless because both are non-deterministic blackboxes with literally undefined behavior. Any % you'd be seeing could be just differences in available compute as the US is waking up
> we had no problem building systems to probabilistically evaluate them.
Something tells me the "no problem" in that statement is from Death of Stalin when you get to the actual details of that evaluation: https://youtu.be/kasSSZlBFDs?si=-5kv7LPWi_YStR3C
Outside of a fun project to tinker on and see the results, I wouldn’t use this for anything that runs in production. It is better to use an LLM to assist you in building better static analysis tools, than it is to use the proposed technique in the article.
I was waiting to see this part demonstrated and validated - for a given pr, whether you created an expected set of tests that would run and then compare it to what actually ran. Without a baseline, as any tester would tell you, the LLM output has been trusted without checking.
The only way to verify that is to do what GP suggested: for each PR, manually make a list of the tests that you believe should be run, and then compare that to the list of tests that the LLM actually runs. Obviously you aren't going to do this forever, but you should either do it for every single PR for some number of PRs, or pick a representative sample over time instead.
> Are you seeing a lower number of production hot fixes?
Yes, with E2E tests in general: They are more effective at stopping incidents than other tests, but they require more effort to write. In my estimation, we prevent about 2-3 critical bugs per month from being merged into main (and consequently deployed).
For this project specifically: I think the critical bugs would have been caught in our overnight full E2E run anyway. The biggest gain was that E2E tests took too much time in the pipeline, and finding the root cause of bugs in nightly tests took even more time. When a test fails in the PR, we can quickly fix it before merging.
> Have you tried other models? Or thought about using output from different models and combining the results when they differ?
Not yet, but I think we need to start experimenting. Claude went offline for 30 minutes over the last 2 days, and engineers were blocked from merging because of it. I'm planning to add claude-code-router as a fallback.
- Smoke tests (does this page load at all) - More narrow tests (does this particular flow work)
I think you're referring to smoke tests, but you likely always want to run smoke tests. It's the narrow tests that you are safe with removing.
DHH recently described[0] the approach they've taken at BaseCamp, reducing ~180 comprehensive-yet-brittle system tests down to 10 good-enough smoke tests, and it feels much more in spirit with where I would recommend folks invest effort: teams have way more tests than they need for an adequate level of confidence. Code and tests are a liability, and, to paraphrase Kent Beck[1], we should strive to write the minimal amount of tests and code to gain the maximal amount of confidence.
The other wrinkle here is that we're often paying through the nose in costs (complexity, actual dollars spent on CI services) by choosing to run all the tests all the time. It's a noble and worthy goal to figure out how not to do that, _but_, I think the conclusion shouldn't be to throw more $$$ into that money-pit, but rather just use all the power we have in our local dev workstations + trust to verify something is in a shippable state, another idea DHH covers[2] in the Rails World 2025 keynote; the whole thing is worth watching IMO.
[0] - https://youtu.be/gcwzWzC7gUA?si=buSEYBvxcxNkY6I6&t=1752
[1] - https://stackoverflow.com/questions/153234/how-deep-are-your...
[2] - https://youtu.be/gcwzWzC7gUA?si=9zL-xWG4FUxYZMC5&t=1977
Teams need to periodically audit their tests, figure out what covers what, figure out what coverage is actually useful, and prune stuff that is duplicative and/or not useful.
OP says that ultimately their costs went down: even though using Claude to make these determinations is not cheap, they're saving more than they're paying Claude by running fewer tests (they run tests on a mobile device test farm, and I expect that can get pricey). But ultimately they might be able to save even more money by ditching Claude and deleting tests, or modifying tests to reduce their scope and runtime.
And at this point in the sophistication level of LLMs, I would feel safer about not having an LLM deciding which tests actually need to run to ensure a PR is safe to merge. I know OP says that so far they believe it's doing the right thing, but a) they mention their methodology for verifying this in a comment here[0], and I don't agree that it's a sound methodology[1], and b) LLMs are not deterministic and repeatable, so it could choose two very different sets of tests if run twice against the exact same PR. The risk of that happening may be acceptable, though; that's for each individual to decide.
[0] https://news.ycombinator.com/item?id=45152504
[1] https://news.ycombinator.com/item?id=45152668
So, it builds a dependency graph?
I've been playing with graph related things lately and it seems like there might be more efficient ways to do this than asking a daffy robot to do the job instead of a specific (AI crafted?) tool.
One could even get all fancy with said tool and use it to do fun and exciting things with the cross file dependencies like track down unused includes (or whatever) to improve build times.
"We" and "win" are both doing a lot of heavy-lifting here, as they are whenever talks about LLM labor destruction.
So this is in theory a good thing: an LLM replacing a tedious task that no one was going to be hired to do anyway.
And besides, labor destruction could be a truly wonderful thing, if we had a functional, empathetic society in which we a) ensure that people have paths to retrain for new jobs while their basic needs are met by the state (or by the companies doing the labor destruction), and/or b) allow people whose jobs are destroyed to just not work at all, but provide them with enough resources to still have a decent life (UBI plus universal health care basically).
My utopia is one where there is no scarcity, and no one has to work to survive and have a good life. If I could snap my fingers and eliminate the need for every single job, and replace it with post-scarcity abundance, I would. People would build things and do science and help others because it gives them pleasure to do so, not because they have to in order to survive. And for people who just want to live lives of leisure, that would be fine, too. I don't think humanity will ever get to this state, mind you, but I can dream of a better world.
> what if we could run only the relevant E2E tests
The real title should be "Using Claude Code to Reduce E2E Tests by 84%."
Article flags should be reserved for things you don't believe should be on HN at all.