Some engineers on my team at Assembled and I have been a part of the alpha test of Codex, and I'll say it's been quite impressive.
We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much. But Codex shines in a few areas:
Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)
It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
Model quality is good, but hard to say it's that much better than other models. In side-by-side tests with Cursor + Gemini 2.5-pro, naming, style and logic are relatively indistinguishable, so quality meets our bar but doesn’t yet exceed it.
criddell · 3h ago
If you aren't hiring junior engineers to do these kinds of things, where do you think the senior engineers you need in the future will come from?
My kid recently graduated from a very good school with a degree in computer science and what she's told me about the job market is scary. It seems that, relatively speaking, there's a lot of postings for senior engineers and very little for new grads.
My employer has hired recently and the flood of resumes after posting for a relatively low level position was nuts. There was just no hope of giving each candidate a fair chance and that really sucks.
My kid's classmates who did find work did it mostly through personal connections.
voidspark · 1h ago
This is exactly the problem. The top level executives are setting up to retire with billions in the bank, while the workers develop their own replacements before they retire with millions in the bank. Senior developers will be mostly obsolete too.
I have mentored junior developers and found it to be a rewarding part of the job. My colleagues mostly ignore juniors, provide no real guidance, couldn't care less. I see this attitude from others in the comments here, relieved they don't have to face that human interaction anymore. There are too many antisocial weirdos in this industry.
Without a strong moral and cultural foundation the AGI paradigm will be a dystopia. Humans obsolete across all industries.
oytis · 1h ago
> I have mentored junior developers and found it to be a rewarding part of the job.
Can totally relate. Unfortunately the trend for all-senior teams and companies has started long before ChatGPT, so the opportunities have been quite scarce, at least in a professional environment.
criddell · 1h ago
> I have mentored junior developers and found it to be a rewarding part of the job.
That's really awesome. I hope my daughter finds a job somewhere that values professional development. I'd hate for her to quit the industry before she sees just how interesting and rewarding it can be.
I didn't have many mentors when starting out, but the ones I had were so unbelievably helpful both professionally and personally. If I didn't have their advice and encouragement, I don't think I'd still be doing what I'm doing.
aprdm · 4m ago
She can try to reach out to possible mentors / people on Linkedin. A bit like cold calling. It works, people (usually) want to help and don't mind sharing their experiences / tips. I know I have helped many random linedin cold messages from recent grads/people in uni
sam0x17 · 1h ago
Hiring of juniors is basically dead these days and it has been like this for about 10 years and I hate it. I remember when I was a junior in 2014 there were actually startups who would hire cohorts of juniors (like 10 at a time, fresh out of CS degree sort of folks with almost no applied coding experience) and then train them up to senior for a few years, and then a small number will stay and the rest will go elsewhere and the company will hire their next batch of juniors. Now no one does this, everyone wants a senior no matter how simple the task. This has caused everyone in the industry to stuff their resume, so you end up in a situation where companies are looking for 10 years of experience in ecosystems that are only 5 years old.
That said, back in the early 00s there was much more of a culture of everyone is expected to be self-taught and doing real web dev probably before they even get to college, so by the time they graduate they are in reality quite senior. This was true for me and a lot of my friends, but I feel like these days there are many CS grads who haven't done a lot of applied stuff. But at the same time, to be fair, this was a way easier task in the early 00s because if you knew JS/HTML/CSS/SQL, C++ and maybe some .NET language that was pretty much it you could do everything (there were virtually no frameworks), now there are thousands of frameworks and languages and ecosystems and you could spend 5+ years learning any one of them. It is no longer possible for one person to learn all of tech, people are much more specialized these days.
But I agree that eventually someone is going to have to start hiring juniors again or there will be no seniors.
dgb23 · 15m ago
I recently read an article about the US having a relatively weak occupational training.
To contrast, CH and GER are known to have very robust and regulated apprenticeship programs. Meaning you start working at a much earlier age (16) and go to vocational school at the same time for about 4 years. This path is then supported with all kinds of educational stepping stones later down the line.
There are many software developers who went that route in CH for example, starting with an application development apprenticeship, then getting to technical college in their mid 20's and so on.
I think this model has a lot of advantages. University is for kids who like school and the academic approach to learning. Apprenticeships plus further education or an autodidactic path then casts a much broader net, where you learn practical skills much earlier.
There are several advantages and disadvantages of both paths. In summary I think the academic path provides deeper CS knowledge which can be a force multiplier. The apprenticeship path leads to earlier high productivity and pragmatism.
My opinion is that in combination, both being strongly supported paths, creates more opportunities for people and strengthens the economy as a whole.
oytis · 6m ago
I know about this system, but I am not convinced it can work in such a dynamic field as software. When tools change all the time, you need strong fundamentals to stay afloat - which is what universities provide.
Vocational training focusing on immediate fit for the market is great for companies that want to extract maximal immediate value from labour for minimal cost, but longer term is not good for engineers themselves.
polskibus · 4m ago
I think the bigger problem, that started around 2022 is much lower volume of jobs in software development. Projects were shutdown, funding was retracted, even the big wave of migrations to the cloud died down.
Today startups mostly wrap LLMs as this is what VCs expect. Larger companies have smaller IT budgets than before (adjusted for inflation). This is the real problem that causes the jobs shortage.
_bin_ · 2h ago
This is a bit of a game theory problem. "Training senior engineers" is an expensive and thankless task: you bear essentially all the cost, and most of the total benefit accrues to others as a positive externality. Griping at companies that they should undertake to provide this positive externality isn't really a constructive solution.
I think some people are betting on the fact that AI can replace junior devs in 2-5 years and seniors in 10-20, when the old ones are largely gone. But that's sort of beside the point as far as most corporate decision-making.
dorian-graph · 1h ago
This hyper-fixation on replacing engineers in writing code is hilarious, and dangerous, to me. Many people, even in tech companies, have no idea how software is built, maintained, and run.
I think instead we should focus on getting rid of managers and product owners.
jchanimal · 1h ago
The real judge will be survivorship bias and as a betting man, I might think product owners are the ones with the entrepreneurial spirit to make it to the other side.
QuadmasterXLII · 47m ago
it’s obviously intensely correlated: the vast majority of scenarios either both are replaced or neither
nopinsight · 2h ago
With Agentic RL training and sufficient data, AI operating at the level of average senior engineers should become plausible in a couple to a few years.
Top-tier engineers who integrate a deep understanding of business and user needs into technical design will likely be safe until we get full-fledged AGI.
al_borland · 1h ago
That sounds like a dangerous bet.
_bin_ · 1h ago
As I see it, it's actually the only safe bet.
Case 1: you keep training engineers.
Case 1.1: AGI soon, you don't need juniors or seniors besides a very few. You cost yourself a ton of money that competitors can reinvest into R&D, use to undercut your prices, or return to keep their investors happy.
Case 1.2: No AGI. Wages rise, a lot. You must remain in line with that to avoid losing those engineers you trained.
Case 2: You quit training juniors and let AI do the work.
Case 2.1: AGI soon, you have saved yourself a bundle of cash and remain mostly in in line with the market.
Case 2.2: no AGI, you are in the same bidding war for talent as everyone else, the same place you'd have been were you to have spent all that cash to train engineers. You now have a juicier balance sheet with which to enter this bidding war.
The only way out of this, you can probably see, is some sort of external co-ordination, as is the case with most of these situations. The high-EV move is to quit training juniors, by a mile, independently of whether AI can replace senior devs in a decade.
SketchySeaBeast · 1h ago
Sounds like a bet a later CEO will need to check.
johnjwang · 40m ago
To be clear, we still hire engineers who are early in their careers (and we've found them to be some of the best folks on our team).
All the same principles apply as before: smart, driven, high ownership engineers make a huge difference to a company's success, and I find that the trend is even stronger now than before because of all the tools that these early career engineers have access to. Many of the folks we've hired have been able to spin up on our codebase much faster than in the past.
We're mainly helping them develop taste for what good code / good practices look like.
criddell · 33m ago
> we still hire engineers who are early in their careers
That's really great to hear.
Your experience that a new engineer equipped with modern tools is more effective and productive than in the past is important to highlight. It makes total sense.
startupsfail · 28m ago
More recent models are not without drive and are not stupid either.
There’s still quite a bit of a gap in terms of trust.
oytis · 1h ago
I guess the industry leaders think we'll not need senior engineers either as capabilities evolve.
But also, I think this underestimates significantly what junior engineers do. Junior engineers are people who have spent 4 to 6 years receiving a specialised education in a university - and they normally need to be already good at school math. All they lack is experience applying this education on a job - but they are professionals - educated, proactive and mostly smart.
The market is tough indeed, and as much it is tough for a senior engineer like myself, I don't envy the current cohort of fresh grads. It being tough is only tangentially related to the AI though. Main factor is the general economic slowdown, with AI contributing by distracting already scarce investment from non-AI companies and producing a lot of uncertainty in how many and what employees companies will need in the future. Their current capabilities are nowhere near to having a real economic impact.
Wish your kid and you a lot of patience, grit and luck.
hintymad · 3h ago
> If you aren't hiring junior engineers to do these kinds of things, where do you think the senior engineers you need in the future will come from?
Unfortunately this is not how companies think. I read somewhere more than 20 years ago about outsourcing and manufacturing offshoring. The author basically asked the same: if we move out the so-called low-end jobs, where do we think we will get the senior engineers? Yet companies continued offshoring, and the western lost talent and know-how, while watching our competitor you-know-who become the world leader in increasingly more industries.
lurking_swe · 2h ago
ahh, the classic “i shall please my investors next quarter while ignoring reality, so i can disappoint my shareholders in 10 years”. lol.
As you say, happens all the time. Also doesn’t make sense because so few people are buying individual stocks anyway. Goal should be to consistently outperform over the long term. Wall street tends to be very myopic.
Thinking long term is a hard concept for the bean counters at these tech companies i guess…
echelon · 3h ago
It's happening to Hollywood right now. In the past three years, since roughly 2022, the majority of IATSE folks (film crew, grips, etc.) have seen their jobs disappear to Eastern Europe where the labor costs one tenth of what it does here. And there are no rules for maximum number of consecutive hours worked.
ilaksh · 2h ago
I don't think jobs are necessarily a good plan at all anymore. Figure out how to leverage AIs and robots as cheap labor, and sell services or products. But if someone is trying to get a job, I get the impression that networking helps more than anything.
sandspar · 2h ago
Yeah, the value of the typical job application meta is trending to zero very quickly. Entrepreneurship has a steep learning curve; you should start learning it as soon as possible. Don't waste your time learning to run a straight line - we're entering off-road territory.
layer8 · 1h ago
I share your worries, but the time horizon for the supply of senior engineers drying up is just too long for companies to care at this time, in particular if productivity keeps increasing. And it’s completely unclear what the state of the art will be in 20 years; the problem might mostly solve itself.
dgb23 · 33m ago
AI might play a role here. But there's also a lot of economic uncertainty.
It's not long ago when the correction of the tech job market started, because it got blown up during and after covid. The geopolitical situation is very unstable.
I also think there is way too much FUD around AI, including coding assistants, than necessary. Typically coming either from people who want to sell it or want to get in on the hype.
Things are shifting and moving, which creates uncertainty. But it also opens new doors. Maybe it's a time for risk takers, the curious, the daring. Small businesses and new kinds of services might rise from this, like web development came out of the internet revolution. To me, it seems like things are opening up and not closing down.
Besides that, I bet there are more people today who write, read or otherwise deal directly with assembly code than ever before, even though we had higher level languages for many decades.
As for the job market specifically: SWE and CS (adjacent) jobs are still among the fastest growing, coming up in all kinds of lists.
ikiris · 32m ago
Much like everything in the economy currently, externalities are to be shouldered by "others" and if there is no "other" in aggregate, well, it's not our problem. Yet.
DGAP · 2h ago
There aren't going to be senior engineers in the future.
slater · 3h ago
> If you aren't hiring junior engineers to do these kinds of things, where do you think the senior engineers you need in the future will come from?
Money number must always go up. Hiring people costs money. "Oh hey I just read this article, sez you can have A.I. code your stuff, for pennies?"
kypro · 3h ago
> If you aren't hiring junior engineers to do these kinds of things, where do you think the senior engineers you need in the future will come from?
They'll probably just need to learn for longer and if companies ever get so desperate for senior engineers then just take the most able/experienced junior/mid level dev.
But I'd argue before they do that if companies can't find skilled labour domestically they should consider bringing skilled workers from abroad. There are literally hundreds of millions of Indians who got connected to the internet over the last decade. There's no reason a company should struggle to find senior engineers.
rboyd · 1h ago
India coming online just in time for AI is awkward
voidspark · 1h ago
Perfect answer. Replace Americans with hundreds of millions of Indians. Problem solved.
I hope he forwards your reply to his daughter to cheer her up.
oytis · 1h ago
So basically all education facilities should go abroad too if no one needs Western fresh grads. Will provide a lot of shareholder value, but there are some externalities too.
echelon · 3h ago
The never ending march of progress.
It's probably over for these folks.
There will likely(?, hopefully?) be new adjacent gradients for people to climb.
In any case, I would worry more about your own job prospects. It's coming for everyone.
voidspark · 2h ago
It's his daughter. He is worried about his daughter first and foremost. Weird reply.
echelon · 1h ago
I'm sorry. I was skimming. I had no idea he mentioned his kid.
I was running a quick errand between engineering meetings and saw the first few lines about hiring juniors, and I wrote a couple of comments about how I feel about all of this.
I'm not always guilty of skimming, but today I was.
hintymad · 1h ago
It looks we are in this interesting cycle: millions of engineers contribute to open-source on github. The best of our minds use the code to develop powerful models to replace exactly these engineers. In fact, the more code a group contributes to github, the easier it is for the companies to replace this group. Case in point, frontend engineers are impacted most so far.
Does this mean people will be less incentivized to contribute to open source as time goes by?
P.S., I think the current trend is a wakeup call to us software engineers. We thought we were doing highly creative work, but in reality we spend a lot of time doing the basic job of knowledge workers: retrieving knowledge and interpolating some basic and highly predictable variations. Unfortunately, the current AI is really good at replacing this type of work.
My optimistic view is that in long term we will have invent or expand into more interesting work, but I'm not sure how long we will have to wait. The current generation of software engineers may suffer high supply but low demand of our profession for years to come.
lispisok · 1h ago
As much as I support community developed software and "free as in freedom", "Open Source" got completely perverted into tricking people to work for free for huge financial benefits for others. Your comment is just one example of that.
For that reason all my silly little side projects are now in private repos. I dont care the chance somebody builds a business around them is slim to none. Dont think putting a license will protect you either. You'd have to know somebody is violating your license before you can even think about doing anything and that's basically impossible if it gets ripped into a private codebase and isnt obvious externally.
hintymad · 48m ago
> "Open Source" got completely perverted into tricking people to work for free for huge financial benefits for others
I'm quite conflicted on this assessment. On one hand, I was wondering if we would get better job market if there were not much open-sourced systems. We may have had a much slower growth, but we would see our growth last for a lot more years, which mean we may enjoy our profession until our retirement and more. On the other hand, open source did create large cakes, right? Like the "big data" market, the ML market, the distributed system market, and etc. Like the millions of data scientists who could barely use Pandas and scipy, or hundreds of thousands of ML engineers who couldn't even bother to know what semi positive definite matrix is.
Interesting times.
Daishiman · 1h ago
> P.S., I think the current trend is a wakeup call to us software engineers. We thought we were doing highly creative work, but in reality we spend a lot of time doing the basic job of knowledge workers: retrieving knowledge and interpolating some basic and highly predictable variations. Unfortunately, the current AI is really good at replacing this type of work.
Most of the waking hours of most creative work have this type of drudgery. Professional painters and designers spend most of their time replicating ideas that are well fleshed-out. Musicians spend most of their time rehearsing existing compositions.
There is a point to be made that these repetitive tasks are a prerequisite to come up with creative ideas.
rowanG077 · 1h ago
I disagree. AI have shown to most capable in what we consider creative jobs. Music creation, voice acting, text/story writing, art creation, video creation and more.
roflyear · 43m ago
If you mean create as in literally, sure. But not in being creative. AI can't solve novel problems yet. The person you're replying to obviously means being creative not literally creating something.
crat3r · 12m ago
What is the qualifier for this? Didn't one of the models recently create a "novel" algorithm for a math problem? I'm not sure this holds water anymore.
electrondood · 1h ago
> doing the basic job of knowledge workers
If you extrapolate and generalize further... what is at risk is any task that involves taking information input (text, audio, images, video, etc.), and applying it to create some information output or perform some action which is useful.
That's basically the definition of work. It's not just knowledge work, it's literally any work.
fourside · 4h ago
> You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
One issue with junior devs is that because they’re not fully autonomous, you have to spend a non trivial amount of time guiding them and reviewing their code. Even if I had easy access to a lot of them, pretty quickly that overhead would become the bottleneck.
Did you think that managing a lot of these virtual devs could get overwhelming or are they pretty autonomous?
fabrice_d · 4h ago
They wrote "You still need to do a lot of work to get it production ready". So I would say it's not much better than real colleagues. Especially since junior devs will improve to a point they don't need your hand holding (remember you also were a junior at some point), which is not proven will happen with AI tools.
bmcahren · 2h ago
Counter-point A: AI coding assistance tools are rapidly advancing at a clip that is inarguably faster than humans.
Counter-point B: AI does not get tired, does not need space, does not need catering to their experience. AI is fine being interrupted and redirected. AI is fine spending two days on something that gets overwritten and thrown away (no morale loss).
HappMacDonald · 2h ago
Counter-counter-point A: If I work with a human Junior and they make an error or I familiarize them with any quirk of our workflow, and I correct them, they will recall that correction moving forward. An AI assistant either will not remember 5 minutes later (in a different prompt on a related project) and repeat the mistake, or I'll have to take the extra time to code some reminder into the system prompt for every project moving forward.
Advancements in general AI knowledge over time will not correlate to improvements in remembering any matters as colloquial as this.
Counter-counter-point B: AI absolutely needs catering to their experience. Prompter must always learn how to phrase things so that the AI will understand them, adjust things when they get stuck in loops by removing confusing elements from the prompt, etc.
SketchySeaBeast · 1h ago
I find myself thinking about juniors vs AI as babies vs cats. A cat is more capable sooner, you can trust it when you leave the house for two hours, but it'll never grow past shitting in a box and needing to be fed.
rfoo · 4h ago
You don't need to be nice to your virtual junior devs. Saves quite a lot time too.
As long as I spend less time reviewing and guiding than doing it myself it's a win for me. I don't have any fun doing these things and I'd rather yelling at a bunch of "agents". For those who enjoy doing bunch of small edits I guess it's the opposite.
HappMacDonald · 2h ago
I'm definitely wary of the concept of dismissing courtesy when working with AI agents, because I certainly don't want to lose that habit when I turn around and have to interact with humans again.
woah · 4h ago
> Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)
> It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
What's the benefit of this? It sounds like it's just a gimmick for the "AI will replace programmers" headlines. In reality, LLMs complete their tasks within seconds, and the time consuming part is specifying the tasks and then reviewing and correcting them. What is the point of parallelizing the fastest part of the process?
johnjwang · 3h ago
In my experience, it still does take quite a bit of time (minutes) to run a task on these agentic LLMs (especially with the latest reasoning models), and in Cursor / Cline / other code editor versions of AI, it's enough time for you to get distracted, lose context, and start working on another task.
So the benefit is really that during this "down" time, you can do multiple useful things in parallel. Previously, our engineers were waiting on the Cursor agent to finish, but the parallelization means you're explicitly turning your brain off of one task and moving on to a different task.
woah · 54m ago
In my experience in Cursor with Claude 3.5 and Gemini 2.5, if an agent has run for more than a minute it has usually lost the plot. Maybe model use in Codex is a new breed?
kfajdsl · 3h ago
A single response can take a few seconds, but tasks with agentic flows can be dozens of back and forths. I've had a fairly complicated Roo Code task take 10 minutes (multiple subtasks).
ctoth · 4h ago
> Each task is processed independently in a separate, isolated environment preloaded with your codebase. Codex can read and edit files, as well as run commands including test harnesses, linters, and type checkers. Task completion typically takes between 1 and 30 minutes, depending on complexity, and you can monitor Codex’s progress in real time.
dakiol · 12m ago
This whole "LLMs == junior engineers" is so pedantic. Don't we realize that the same way senior engineers thinkg that LLMs can just replace junior engineers, high-level executives think that LLMs will soon replace senior ones?
Junior engineers are not cattle. They are the future senior ones, they bring new insights into teams, new perspectives; diversity. I can tell you the times I have learnt so many valuable things from so-called junior engineers (and not only tech-wise things).
LLMs have their place, but ffs, stop with the "junior engineer replacement" shit.
obsolete_wagie · 8m ago
You need someone thats technical to look at the agent output, senior engineers will be around. Junior engineers are certainly being replaced
dakiol · 4m ago
Thanks, Sherlock. Now, tell me, when senior engineers start to retire, who will replace them? Ah, yeah, I can hear you say "LLMs!". And LLMs will rewrite themselves so we won't need seniors anymore writing code. And LLMs will write all the code companies need. So obvious, of course. We won't need a single senior because we won't have them, because they are not hired these days anymore. Perfect plan.
Jimmc414 · 4h ago
> We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much.
If you don't mind, what were the strengths and limitations of Claude Code compared to Codex? You mentioned parallel task execution being a standout feature for Codex - was this a particular pain point with Claude Code? Any other insights on how Claude Code performed for your team would be valuable. We are pleased with Claude Code at the moment and were a bit underwhelmed by comparable Codex CLI tool OAI released earlier this month.
t_a_mm_acq · 4h ago
Post realizing CC can operate same code base, same file tree on different terminals instances, it's been a significant unlock for us. Most devs have 3 running concurrently. 1. master task list + checks for completion on tasks. 2. operating on current task + documentation. 3. side quests, bugs, additional context.
rinse and repeat once task done, update #1 and cycle again. Add in another CC window if need more tasks concurrently.
downside is cost but if not an issue, it's great for getting stuff done across distributed teams..
naiv · 3h ago
do you have then instance 2 and 3 listening to instance 1 with just a prompt? or how does this work?
runako · 3h ago
> Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling.
This is also part of a recent update to Zed. I typically use Zed with my own Claude API key.
ai-christianson · 3h ago
Is Zed managing the containerized dev environments, or creating multiple worktrees or anything like that? Or are they all sharing the same work tree?
runako · 3h ago
As far as I know, they are sharing a single work tree. So I suppose that could get messy by default.
That said, it might be possible to tell each agent to create a branch and do work there? I haven't tried that.
I haven't seen anything about Zed using containers, but again you might be able to tell each agent to use some container tooling you have in place since it can run commands if you give it permission.
_bin_ · 2h ago
I believe cursor now supports parallel tasks, no? I haven't done much with it personally but I have buddies who have.
If you want one idiot's perspective, please hyper-focus on model quality. The barrier right now is not tooling, it's the fact that models are not good enough for a large amount of work. More importantly, they're still closer to interns than junior devs: you must give them a ton of guidance, constant feedback, and a very stern eye for them to do even pretty simple tasks.
I'd like to see something with an o1-preview/pro level of quality that isn't insanely expensive, particularly since a lot of programming isn't about syntax (which most SotA modls have down pat) but about understanding the underlying concepts, an area in which they remain weak.
Atp I really don't care if the tooling sucks. Just give me really, really good mdoels that don't cost a kidney.
quantumHazer · 2h ago
CTO of an AI agents company (which has worked with AI labs) says agents works fine. Nothing new under the sun.
NewEntryHN · 4h ago
The advantage of Cursor is the reduced feedback loop where you watch it live and can intervene at any moment to steer it in the right direction. Is Codex such a superior model that it makes sense to take the direction of a mostly background agent, on which you seemingly have a longer feedback loop?
strangescript · 4h ago
it feels like openai are at a ceiling with their models, codex1 seems to be another RLHF derivative from the same base model. You can see this in their own self reported o3-high comparison where at 8 tries they converge at the same accuracy.
It also seems very telling they have not mentioned o4-high benchmarks at all. o4-mini exists, so logically there is an o4 full model right?
aorobin · 3h ago
Seems likely that they are waiting to release o4 full results until the gpt-5 release later this year, presumably because gpt-5 is bundled with a roughly o4 level reasoning capability, and they want gpt-5 to feel like a significant release.
losvedir · 1h ago
Do you still think there will be a gpt-5? I thought the consensus was GPT-5 never really panned out and was released with little fanfare as 4.1.
nadis · 3h ago
In the preview video, I appreciated Katy Shi's comment on "I think this is a reflection of where engineering work has moved over the past where a lot of my time now is spent reviewing code rather than writing it."
As I think about what "AI-native" or just the future of building software loos like, its interesting to me that - right now - developers are still just reading code and tests rather than looking at simulations.
While a new(ish) concept for software development, simulations could provide a wider range of outcomes and, especially for the front end, are far easier to evaluate than just code/tests alone. I'm biased because this is something I've been exploring but it really hit me over the head looking at the Codex launch materials.
ai-christianson · 3h ago
> rather than looking at simulations
You mean like automated test suites?
tough · 3h ago
automated visual fuzzy-testing with some self-reinforcement loops
There's already library's for QA testing and VLM's can give critique on a series of screenshots automated by a playwright script per branch
ai-christianson · 2h ago
Cool. Putting vision in the loop is a great idea.
Ambitious idea, but I like it.
tough · 2h ago
SmolVLM, Gemma, LlaVa, in case you wanna play with some of the ones i've tried.
recently both llama.cpp and ollama got better support for them too, which makes this kind of integration with local/self-hosted models now more attainable/less expensive
[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.
Snuggly73 · 3h ago
I can be completely off base, but it feels to me like benchmaxxing is going on with swe-bench.
Can someone please tell me if I am being too paranoid about being ok with Cursor working on parts of my project, but me being uncomfortable uploading my whole repo to the cloud for Codex to work?
Or have I misunderstood something?
What if there is a hack and everyones repos are exposed?
Im kind of weirded out noone has a problem with this, I know we use github ( mostly ) but this would be another point of failure, so a 2x riskier setup?
Sorry maybe Im being obtuse
ionwake · 1h ago
Im sorry if Im being silly, but I have paid for the Pro version, $200 a month, everytime I click on Try Codex, it takes me to a pricing page with the "Team Plan" https://chatgpt.com/codex#pricing.
Is this still rolling out? I dont need the team plan too do I?
I have been using openAI products for years now and I am keen to try but I have no idea what I am doing wrong.
throwaway314155 · 24m ago
They do this with every major release. Never going to understand why.
jdee · 57m ago
im the same, and it appeared for me 2 mins ago. looks like its still rolling out
mr_north_london · 58m ago
It's still rolling out
ionwake · 26m ago
Thx for the reply, Im in london too ( atm )
bionhoward · 3h ago
What about privacy, training opt out?
What about using it for AI / developing models that compete with our new overlords?
Seems like using this is just asking to get rug pulled for competing with em when they release something that competes with your thing. Am I just an old who’s crowing about nothing? It’s ok for them to tell us we own outputs we can’t use to compete with em?
piskov · 3h ago
What the video: there is an explicit switch at one of the steps about (not) allowing to train on your repo.
lurking_swe · 2h ago
That’s nice. And we trust that it does what it says because…? The AI company (openai, anthropic, etc) pinky promised? Have we seen their source code? How do you know they don’t train?
Facebook has been caught in recent DOJ hearings breaking the law with how they run their business, just as one example. They claimed under oath, previously, to not be doing X, and then years later there was proof they did exactly that.
A companies “word” means nothing imo. None of this makes sense if i’m being honest. Unless you personally have a negotiated contract with the provider, and can somehow be certain they are doing what they claim, and can later sue for damages, all of this is just crossing your fingers and hoping for the best.
tough · 2h ago
On the other hand you can enable explicit sharing of your data and get a few million free tokens daily
blixt · 4h ago
They mentioned "microVM" in the live stream. Notably there's no browser or internet access. It makes sense, running specialized Firecracker/Unikraft/etc microkernels is way faster and cheaper so you can scale it up. But there will be a big technical scalability difficulty jump from this to the "agents with their own computers". ChatGPT Operator already does have a browser, so they definitely can do this, but I imagine the demand is orders of magnitudes different.
There must be room for a Modal/Cloudflare/etc infrastructure company that focuses only on providing full-fledged computer environments specifically for AI with forking/snapshotting (pause/resume), screen access, human-in-the-loop support, and so forth, and it would be very lucrative. We have browser-use, etc, but they don't (yet) capture the whole flow.
alvis · 5h ago
I used to work for a bank and the legal team used to ping us to make tiny changes to the app for compliance related issues. Now they can fix themselves. I think they’d be very proud and happy
ajkjk · 4h ago
Hopefully nobody lets legal touch anything without the ability to run the code to test it, plus code reviews. So probably not.
singularity2001 · 4h ago
that will be an interesting new Bug tracker: anyone in the company will be able to report any bug or add any future request, if the model will be able to solve it automatically perfect otherwise some human might take over. The interesting question then will be what code changes are legal and within the standards of what the company wants. So non-technical code/issue reviewer will become a super important and ubiquitous job.
SketchySeaBeast · 52m ago
Not just legal/within the standards, but which actually meet the unspoken requirements of the request. "We just need a new checkbox that asks if you're left handed" might seem easy, but then it has ramifications for the Application PDF that gets generated, as well as any systems downstream, and maybe it requires a data conversion of some sort somewhere. I know that the PO's I work with miss stuff or assume that the request will just have features by default.
asdev · 2h ago
I promise you the legal team is not pushing any code changes
No comments yet
SketchySeaBeast · 1h ago
Is this the same idea as when we switched to multicore machines? The rate of change on the capabilities of a single agent has slowed enough now the only way for OpenAI to appearing to be making decent progress is to have many?
yanis_t · 5h ago
So it's looking like it's only running in the cloud, that is it will push commits to my remote repo before I have a chance to see if it works?
When I'm using aider, after it make a commit what I do, I then immediately run git reset HEAD^ and then git diff (actually I use github desktop client to see the diff) to evaluate what exactly it did, and if I like it or not. Then I usually make some adjustments and only after that commit and push.
danielbln · 4h ago
You may want to pass --no-auto-commits to Aider if you peel them off HEAD afterwards anyway.
flakiness · 4h ago
You can think of this as a managed (cloud) version of their codex command line tool, which runs locally on your laptop.
The secret sauce here seems like their new model, but I expect it to come to API at some point.
codemac · 4h ago
watch the live stream, it shows you the diff as the completed task, you decide whether or not to generate a github pr when you see the diff.
asdev · 2h ago
is the point of this to actually assign tasks to an AI to complete end to end? Every task I do with AI requires atleast some bit of hand holding, sometimes reprompting etc. So I don't see why I would want to run tasks in parallel, I don't think it would increase throughput. Curious if others have better experiences with this
simianwords · 5h ago
I wonder if tools like these are best for semi structured refactors like upgrade to python3, migrate to postgres etc
sudohalt · 3h ago
When it runs the code I assume it does so via a docker container, does anyone know how it is configured? Assuming the user hasn't specified an AGENTS.md file or a Dockerfile in the repo. Does it generate it via LLM based on the repo, and what it thinks is needed? Does it use static analysis (package.json, requirements txt, etc)? Do they just have a super generic Dockerfile that can handle most envs? Combination of different things?
I think they mentioned it was a similar environment to what it trains on, so maybe they have a default Dockerfile. Of course containers can also install additional packages or at least python packages.
nkko · 1h ago
Yes, and one test failed as it missed pydantic dependency
I remember HN had a repeating popular post on the the most important data structures. They are all the basic ones that a first-year college student can learn. The youngest one was skiplist, which was invented in 1990. When I was a student, my class literally read the original paper and implemented the data structure and analyzed the complexity in our first data structure course.
This seems imply that the software engineering as a profession has been quite mature and saturated for a while, to the point that a model can predict most of the output. Yes, yes, I know there are thousands of advanced algorithms and amazing systems in production. It's just that the market does not need millions of engineers for such advanced skills.
Unless we get yet another new domain like cloud or like internet, I'm afraid the core value of software engineers: trailblazing for new business scenarios, will continue diminishing and being marginalized by AI. As a result, we get way less demand for our job, and many of us will either take a lower pay, or lose our jobs for extended time.
tptacek · 5h ago
Maddening: "codex" is also the name of their open-source Claude-Code-alike, and was previously the name of an at-the-time frontier coding model. It's like they name things just to fuck with us.
tekacs · 5h ago
So -- that client-side thing is _technically_ called `codex-cli` (in the parent 'codex' repo, which looks like a monorepo?).
Still super confusing, though!
I feel like companies working with and shipping LLMs would do well to remember that it's not just humans who get confused by this, but LLMs themselves... it makes for a painful time, sending off a request and noting that a third of the way into its reasoning that the model has gotten tow things with almost-identical names confused.
tough · 5h ago
they also have a dual implementation on rust and typescript there's codex-rs in that monorepo
fabmilo · 4h ago
more excited about the rust impl than the typescript one.
tptacek · 4h ago
Besides packaging of their releases, what possible difference could that make in this problem domain?
tough · 3h ago
I just think it's nice to have open source code to reference so maybe he meant just in that -educational- way, certainly more to learn from the rust one than the TS one for most folks? even if the problem-space doesn't require system-level safety code indeed
quantadev · 4h ago
If it's name is 'codex-cli' then that means "Codex Command Line Interface" so the name is absolutely codex.
scottfalconer · 4h ago
Next week: OpenAI rebrands Windsurf as Codex.
manojlds · 5h ago
And with themselves and their models. The Codex open source had prompt to disambiguate it from the model.
asadm · 5h ago
Is there an open source version of this? that essentially uses microvms to git clone my repo and essentially run codex-cli or equivalent and sends me a PR.
> To balance safety and utility, Codex was trained to identify and precisely refuse requests aimed at development of malicious software, while clearly distinguishing and supporting legitimate tasks.
I can't say I am a big fan of neutering these paradigm-shifting tools according to one culture's code of ethics / way of doing business / etc.
One man's revolutionary is another's enemy combatant and all that. What if we need top-notch malware to take down the robot dogs lobbing mortars at our madmaxian compound?!
GolfPopper · 4h ago
>What if we need top-notch malware to take down the robot dogs lobbing mortars at our madmaxian compound?!
I wouldn't sweat it. According to it's developers, Codex understands 'malicious software', it has just been trained to say, "But I won't do that" when such requests are made to it. Judging from the recent past [1][2] getting LLMs to bypass such safeguards is pretty easy.
You gotta think about it in terms of cost vs benefit. How much damage will a malicious AI do, vs how much value will you get out of non-neutered model?
No comments yet
rowanG077 · 55m ago
Agreed, I'm a big proponent that people should be in control of the tools they use. I don't think the approach where there is wise dicator enforcing I can't use my flathead screwdriver to screw down a phillips head screw is good. I think it's actively undermining people.
amarcheschi · 5h ago
If I had to guess, only for the general public they'll be neutered, not for the 3 letters agencies
pixl97 · 4h ago
TLA's have very few of their own coders, they contract everything out. Now I'm sure OAI will lend an unrestricted model to groups that pay large private contracts they won't disclose.
scudsworth · 5h ago
pleased to see a paragraph-long comment in the examples. now thats good coding.
2OEH8eoCRo0 · 2h ago
More generated slop for a real human to sift through. Can I get an ai summary of that comment?
alvis · 5h ago
Is it surprising? Hmm perhaps nope. But is it better than cursor etc? Hmm perhaps it’s a wrong question.
Feels like codex is for product managers to fix bugs without touching any developer resources. Then it’s insanely surprising!
bhl · 5h ago
I've been contracting with a startup. The bottleneck is not the lack of tools; it's agency. There's so much work, it becomes work to assign and organize work.
But now who's going to do that work? Still engineers.
gbalduzzi · 5h ago
It sounds nice, but are product managers able to spot regressions or other potential issues (performance, data protection, legal, etc) in the codex result?
alvis · 5h ago
If codex can analyze the whole code base, I can’t see why not? I can even imagine one can set up a CI task that any committed code must pass all sort of legal/data protection requirements too
kenjackson · 3h ago
Exactly this. In fact the product manager should be the one that knows what the set of checks that need to be done over the code base. You need a dev though to do make sure the last mile is doing what you expect it to do.
RhysabOweyn · 1h ago
I believe that code from one of these things will eventually cause a disaster affecting the capital owners. Then all of a sudden you will need a PE license, ABET degree, 5 years working experience, etc. to call yourself a software engineer. It would not even be historically unique. Charlatans are the reason that lawyers, medical doctors, and civil engineers have to go through lots of education, exams, and vocational training to get into their profession. AI will probably force software engineering as a profession into that category as well.
On the other hand, if your job was writing code at certain companies whose profits were based on shoving ads in front of people then I would agree that no one will care if it is written by a machine or not. The days of those jobs making >$200k a year are numbered.
No comments yet
prhn · 5h ago
Is anyone using any of these tools to write non boilerplate code?
I'm very interested.
In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.
These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.
And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.
That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.
What am I missing?
Workaccount2 · 5h ago
>What am I missing?
That you are trying to use LLMs to create giant sprawling codebase feature packed software packages that define the modern software landscape. What's being missed is that any one user might only utilize 5% of the code base on any given day. Software is written to accommodate every need every user could have in one package. Then the users just use the small slice that accommodates their specific needs.
I have now created 5 hyper narrow programs that are used daily by my company to do work. I am not a programmer and my company is not a tech company located in a tech bubble. We are a tiny company that does old school manufacturing.
To give a quick general example, Betty uses Excel to manage payroll. A list of employees, a list of wages, a list of hours worked (which she copys from the time clock software .csv that she imports to excel).
Excel is a few million LOC program and costs ~$10/mo. Betty needs maybe 2k LOC to do what she uses excel for. Something an LLM can do easily, a python GUI wrapper on an SQLite DB. And she would be blown away at how fast it is, and how it is written for her use specifically.
How software is written and how it is used will change to accommodate LLMs. We didn't design cars to drive on horse paths, we put down pavement.
kridsdale3 · 5h ago
The Romans put down paved roads to make their horse paths more reliable.
But yes, I hope we get away from the giant conglomeration of everything, ESPECIALLY the reality of people doing 90% of their business inside a Google Chrome widow. Move towards the UNIX philosophy of tiny single-purpose programs.
lispisok · 1h ago
A lot of people are deeply invested in these things being better than they really are. From the OpenAI's and Google's spending $100s of billions EACH developing LLMs to VC backed startups promising their "AI agent" can replace entire teams of white collar employees. That's why your experience matches mine and every other developer I personally know but you see comments everywhere making much grander claims.
triMichael · 11m ago
I agree, but I'd add that it's not just the tech giants who want them to be better than they are, but also non-programmers.
IMO LLMs are actually pretty good at writing small scripts. First, it's much more common for a small script to be in the LLM's training data, and second, it's much easier to find and fix a bug. So the LLM actually does allow a non-programmer to write correct code with minimal effort (for some simple task), and then they are blown away thinking writing software is a solved problem. However, these kinds of people have no idea of the difference between a hundred line script where an error is easily found and isn't a big deal and a million line codebase where an error can be invisible and shut everything down.
Worst of all is when the two sides of tech-giants and non-programmers meet. These two sides may sound like opposites but they really aren't. In particular, there are plenty of non-programmers involved at the C-level and the HR levels of tech companies. These people are particularly vulnerable to being wowed by LLMs seemingly able to do complex tasks that in their minds are the same tasks their employees are doing. As a result, they stop hiring new people and tell their current people to "just use LLMs", leading to the current hiring crisis.
browningstreet · 5h ago
I've built a number of personal data-oriented and single purpose tools in Replit. I've constrained my ambitions to what I think it can do but I've added use cases beyond my initial concept.
In short, the tools work. I've built things 10x faster than doing it from scratch. I also have a sense of what else I'll be able to build in a year. I also enjoy not having to add cycles to communicate with external contributors -- I think, then I do, even if there's a bit of wrestling. Wrangling with a coding agent feels a bit like "compile, test, fix, re-compile". Re-compiling generally got faster in subsequent generations of compiler releases.
My company is building internal business functions using AI right now. It works too. We're not putting that stuff in front of our customers yet, but I can see that it'll come. We may put agents into the product that let them build things for themselves.
I get the grumpiness & resistance, but I don't see how it's buying you anything. The puck isn't underfoot.
Cu3PO42 · 5h ago
Occasionally. I find that there is a certain category of task that I can hand over to an LLM and get a result that takes me significantly less time to clean up than it would have taken me to write from scratch.
A recent example from a C# project I was working in. The project used builder classes that were constructed according to specified rules, but all of these builders were written by hand. I wanted to automatically generate these builders, and not using AI, just good old meta-programming.
Now I knew enough to know that I needed a C# source generator, but I had absolutely no experience with writing them. Could I have figured this out in an hour or two? Probably. Did I write a prompt in less than five minutes and get a source generator that worked correctly in the first shot? Also yes. I then spent some time cleaning up that code and understanding the API it uses to hook into everything and was done in half an hour and still learnt something from it.
You can make the argument that this source generator is in itself "boilerplate", because it doesn't contain any special sauce, but I still saved significant time in this instance.
spariev · 5h ago
I think it all depends on your platform and use cases. In my experience AI tools work best with Python and JS/Typescript and some simple use cases (web apps, basic data science etc). Also, I've found they can be of great help with refactorings and cases when you need to do something similar to already existing code, but with a twist or change.
arkmm · 4h ago
I think most code these days is boilerplate, though the composition of boilerplate snippets can become something unique and differentiated.
volkk · 5h ago
you might be missing small things to create more guardrails like effective prompting and maintaining what's been done using files, carefully controlling context, committing often in-between changes, but largely, you're not missing anything. i use AI constantly, but always for subtasks of a larger complicated thing that my brain has thought through. and often use higher cost models to help me abstractly think through complex things/point me in the right directions.
personally, i've always operated in a codebase in a way that i _need_ to understand how things work for me to be productive and make the right decisions. I operate the same way with AI. every change is carefully reviewed, if it's dumb, i make it redo it and explain why it's dumb. and if it gets caught in a loop, i reset the context and try to reframe the problem. overall, i'm definitely more productive, but if you truly want to be hands off--you're in for a very bad time. i've been there.
lastly, some codebases don't work well with AI. I was working on a problem that was a bit more novel/out there and no model could solve it. Just yapped endlessly about these complex, very potentially smart sounding solutions that did absolutely nothing. went all the way to o1-pro. the craziest part to me was the fact that across claude, deepseek and openai, they used the same specific vernacular for this particular problem which really highlights how a lot of these models are just a mish-mash of the same underlying architecture/internet data. some of these models use responses from other models for their training data, which to me is like incest. you won't get good genetical results
icapybara · 5h ago
It’s probably what you’re asking. You can’t just say “write me an app”, you have to break a big problem into small problems for it.
asadm · 5h ago
yes, think of it as search engine that auto-applies that stackoverflow fix to your code.
But I have done larger tasks (write device drivers) using gemini.
uludag · 5h ago
I feel things get even worse when you use a more niche language. I get extremely disappointed any time I try to get it do anything useful in Clojure. Even as a search engine, especially when asking it about libraries, these tools completely fail expectation.
I can't even fathom how frustrating such tools would be with poorly written confusing Clojure code using some niche dependency.
That being said, I can imagine a whole class of problems where this could succeed very well at and provide value. Then again, the type of problems that I feel these systems could get right 99% of the time are problems that a skilled developer could fix in minutes.
sottol · 5h ago
I tried using Gemini 2.5 Pro for a side-side-project, seemed like a good project to explore LLMs and how they'd fit into my workflow. 2-3 weeks later it's around 7k loc of Python auto-gerating about 35k loc of C from JSON spec.
This project is not your typical Webdev project, so maybe that's an interesting case-study. It takes a C-API spec in JSON, loads and processes it in Python and generates a C-library that turns a UI marked up YAML/JSON into C-Api calls to render that UI. [1]
The result is pretty hacky code (by my design, can't/won't use FFI) that's 90% written by Gemini 2.5 Pro Pre/Exp but it mostly worked. It's around 7k lines of Python that generate a 30-40k loc C-library from a JSON LVGL-API-spec to render an LVGL UI from YAML/JSON markup.
I probably spent 2-3 weeks on this, I might have been able to do something similar in maybe 2x the time but this is about 20% of the mental overhead/exhaustion it would have taken me otherwise. Otoh, I would have had a much better understanding of the tradeoffs and maybe a slightly cleaner architecture if I would have to write it. But there's also a chance I would have gotten lost in some of the complexity and never finished (esp since it's a side-project that probably no-one else will ever see).
What worked well:
* It mostly works(!). Unlike previous attempts with Gemini 1.5 where I had to spend about as much or more time fixing than it'd have taken me to write the code. Even adding complicated features after the fact usually works pretty well with minor fixing on my end.
* Lowers mental "load" - you don't have to think so much about how to tackle features, refactors, ...
Other stuff:
* I really did not like Cursor or Windsurf - I half-use VSCode for embedded hobby projects but I don't want to then have another "thing" on top of that. Aider works, but it would probably require some more work to get used to the automatic features. I really need to get used to the tooling, not an insignificant time investment. It doesn't vibe with how I work, yet.
* You can generate a *significant* amount of code in a short time. It doesn't feel like it's "your" code though, it's like joining a startup - a mountain of code, someone else's architecture, their coding style, comment style, ... and,
* there's this "fog of code", where you can sorta bumble around the codebase but don't really 100% understand it. I still have mid/low confidence in the changes I make by hand, even 1 week after the codebase has largely stabilized. Again, it's like getting familiar with someone else's code.
* Code quality is ok but not great (and partially my fault). Probably depends on how you got to the current code - ie how clean was your "path". But since it is easier to "evolve" the whole project (I changed directions once or twice when I sort of hit a wall) it's also easier to end up with a messy-ish codebase. Maybe the way to go is to first explore, then codify all the requirements and start afresh from a clean slate instead of trying to evolve the code-base. But that's also not an insignificant amount of work and also mental load (because now you really need to understand the whole codebase or trust that an LLM can sufficiently distill it).
* I got much better results with very precise prompts. Maybe I'm using it wrong, ie I usually (think I) know what I want and just instruct the LLM instead of having an exploratory chat but the more explicit I am, the more closely the output is to what I'd like to see. I've tried to discuss proposed changes a few times to generate a spec to implement in another session but it takes time and was not super successful. Another thing to practice.
* A bit of a later realization, but modular code and short, self-contained modules are really important though this might depend on your workflow.
To summarize:
* It works.
* It lowers initial mental burden.
* But to get really good results, you still have to put a lot of effort into it.
* At least right now, it seems you will still eventually have to put in the mental effort at some point, normally it's "front-loaded" where you have to do the design and think about it hard, whereas the AI does all the initial work but it becomes harder to cope with the codebase once you reach a certain complexity. Eventually you will have to understand it though even if just to instruct the LLM to make the exact changes you want.
It may depend on what you consider boilerplate. I use them quite a bit for scripting outside of direct product code development. Essentially, AI coding tools have moved this chart's decision making math for me: https://xkcd.com/1205/ The cost to automate manual tasking is now significantly lower so I end up doing more of it.
IXCoach · 4h ago
Hey there!
Lots missing here, but I had the same issues, it takes iteration and practice. I use claude code in terminal windows, and text expander to save explicit reminders that I have to inject super regularly because anthropic obscures access to system prompts.
For example, I have 3 to 8 paragraph long instructions I will place regularly about not assuming, checking deterministically etc. and for most things I have the agents write a report with a specific instruction set.
I pop the instructions into text expander so I just type - docs when saying go figure this out, and give me the path to the report when done.
They come back with a path, and I copy it and search vscode
It opens as an md and i use preview mode, its similar to a google doc.
And ill review it. always, things will be wrong, tons of assumptions, failures to check determistically, etc... but I see that in the doc and have it fix it. correct misunderstandings, update the doc until its perfect.
From there ill say add a plan in a table with status for each task based on this ( another text expander snippet with instructions )
And WHEN thats 100% right, Ill say implement and update as you go. The update as you go forces it to recognize and remember the scope of the task.
Greatest points of failure in the system is misalignment. Ethics teams got that right. It compounds FAST if allowed. you let them assume things, they state assumptions as facts, that becomes what other agents read and you get true chaos unchecked.
I started rebuilding claude code from scratch literally because they block us from accessing system prompts and I NEED these agents to stop lying to me about things that are not done or assumed, which highlights the true chaos possible when applied to system critical operations in governance or at scale.
I also built my own tool like codex for managing agent tasks and making this simpler, but getting them to use it without getting confused is still a gap.
Let me know if you have any other questions. I am performing the work of 20 Engineers as of today, rewrote 2 years of back end code that required a team of 2 engineers full time work in 4 weeks by myself with this system... so I am, I guess quite good at it.
I need to push my edges further into this latest tech, have not tried codex cli or the new tool yet.
IXCoach · 4h ago
Its a total of about 30 snippets, avg 6 paragraphs long, that I have to inject. for each role switch it goes through i have to re inject them.
its a pain but it works.
Even TDD it will hallucinate the mocks without management. and hallucinate the requirements. Each layer has to be checked atomically, but the text expander snippets done right can get it close to 75% right.
My main project faces 5000 users so I cant let the agents run freely, whereas with isolated projects in separate repos I can let them run more freely, then review in gitkraken before committing.
Rudybega · 2h ago
You could just use something like roo code with custom modes rather than manually injecting them. The orchestrator mode can decide on the other appropriate modes to use for subtasks.
You can customize the system prompts, baseline propmts, and models used for every single mode and have as many or as few as you want.
skovati · 5h ago
I'm curious how many ICs are truly excited about these advancements in coding agents. It seems to me the general trend is we become more like PMs managing agents and reviewing PRs, all for the sake of productivity gains.
I imagine many engineers are like myself in that they got into programming because they liked tinkering and hacking and implementation details, all of which are likely to be abstracted over in this new era of prompting.
ramoz · 4h ago
I see it differently. Like a kid with legos.
We had to tinker piece by piece to build a miniature castle. Over many hours.
Now I can tinker concept by concept, and build much larger castles, much faster. Like waving a wand, seeing my thoughts come to fruition in near real time.
No vanity lost in my opinion. Possibly more to be gained.
whyowhy3484939 · 44m ago
> build much larger castles, much faster
See that never was the purpose.. going bigger and faster, towards what exactly? Chaos? By the way we never managed to fully tackle manual software development by trained professionals and we now expect Shangri-La by throwing everything and the kitchen sink into giant inscrutable matrices. This time by amateurs as well. I'm sure this will all turn out very well and very, very productive.
nluken · 2h ago
I think there's a disconnect between what you and the person you're replying to are defining as "tinkering". Your conception of it seems more focused on the end product when, to use your analogy, the original comment seems unconcerned with the size of castles.
If you derive enjoyment from actually assembling the castle, you lose out on that by using the wand that makes it happen instantly. Sure wand's castles may be larger, but you don't put a Lego castle together for the finished product.
CapcomGo · 3h ago
I think the bigger issue with this is that the number of developer jobs will shrink.
lherron · 2h ago
Factorio blueprints in action.
chilmers · 4h ago
While I share your reservations, how many millions of people have experienced the exact same disruption to their jobs and industries because of software that we, software engineers, have created? It’s a bit too late, and a touch hypocritical, for us to start complaining about technology now it is disrupting our way of working in a way we don’t like.
kridsdale3 · 5h ago
I do feel that way, so I'll still do bespoke creation when I want to. But this is like a sewing machine. My job is to design fashion, and a whole line of it. I can do that when a machine is making the stitches instead of my using a needle in hand.
davedx · 4h ago
I think the death of our craft is around the corner. It doesn't fill me with joy.
evantbyrne · 33m ago
Software engineering requires a fair amount of intelligence, so if these tools ever get to replacement levels of quality then it's not just developers that will be out of jobs. ARC-AGI-2, the countless anecdotes from professionals I've seen across the industry, and personal experience all very clearly point to a significant gap between the tools that exist today and general intelligence. I would recommend keeping an eye on improvements just because of the sheer capital investments going into it, but I won't be losing any sleep waiting for the rapture.
manojlds · 5h ago
We (dare I say we instead of I) like talking to computers and AI is another computer you talk with. So I am still all excited. It's people that I want to avoid :)
qntmfred · 4h ago
people can still write code by hand for fun
people who want to make software that enables people to accomplish [task] will get the software they need quicker.
awestroke · 5h ago
At the end of the day, it's your job to deliver value. If a tool allows you to deliver more faster, without sacrificing quality, it's your responsibility to use that tool. You'll just have to make sure you can fully take responsibility for the end deliverables. And these tools are not only useful for writing the final code
whyowhy3484939 · 39m ago
It's actually not. My job description does not say "deliver value" and nobody talks about my work like that so I'm not quite sure what to make of that.
> without sacrificing quality
Right..
> it's your responsibility to use that tool
Again, it's actually not. It's my responsibility to do my job, not to make my boss' - or his boss' - car nicer. I know that's what we all know will create "job security" but let's not conflate these things. My job is to do my end of the bargain. My boss' job is paying me for doing that. If he deems it necessary to force me to use AI bullshit, I will of course, but it is definitely not my responsibility to do so autonomously.
enjoylife · 4h ago
> these tools are not only useful for writing the final code
This sparked a thought in how a large part of the job is often the work needed to demonstrate impact. I think this aspect is often overlooked by some of the good engineers not yet taking advantage of the AI tooling. LLM loops may not yet be good enough to produce shippable code by themselves, but they sure are capable to help reduce the overhead of these up and out communicative tasks.
tough · 4h ago
you mean like hacking a first POC with AI to sell a product/feature internally to get buy-in from the rest of the team before actually shipping production version of it?
ilaksh · 5h ago
As someone who works on his own open source agent framework/UI (https://github.com/runvnc/mindroot), it's kind of interesting how announcements from vendors tend to mirror features that I am working on.
For example, in the last month or so, I added a job queue plugin. The ability to run multiple tasks that they demoed today is quite similar. The issue I ran into with users is that without Enterprise plans, complex tasks run into rate limits when trying to run concurrently.
So I am adding an ability to have multiple queues, with each possibly using different models and/or providers, to get around rate limits.
By the way, my system has features that are somewhat similar not only to this tool they are showing but also things like Manus. It is quite rough around the edges though because I am doing 100% of it myself.
But it is MIT Licensed and it would be great if any developer on the planet wanted to contribute anything.
No comments yet
ianbutler · 4h ago
Im super curious to see how this actually does at finding significant bugs, we've been working in the space on https://www.bismuth.sh for a while and one of the things we're focused on is deep validation of the code being outputted.
There's so many of these "vibe coding" tools and there has to be real engineering rigor at some point. I saw them demo "find the bug" but the bugs they found were pretty superficial and thats something we've seen in our internal benchmark from both Devin and Cursor. A lot of noise and false positives or superficial fixes.
No comments yet
tough · 5h ago
so i just upgraded to pro plan but yet https://chatgpt.com/codex doesnt work for me and asks me to -try chatgpt pro- and shows me the upsell modal, even if already on the higher tier
sigh
jiocrag · 4h ago
same here. Paying for Pro ($200) but the "try it" link just leads to the Pro sign up page, where it says I'm already on Pro. Hyper intelligent coding agents, but can't make their website work.
tough · 3h ago
> Hyper intelligent coding agents, but can't make their website work.
I know right
also no human to contact on support... tempted to cancel the sub lol i'll give them 24h
modeless · 5h ago
You mean Pro? It's only in the $200 Pro tier.
tough · 5h ago
Yes sorry meant pro,
I just enabled on Settings > Connectors > Github
hoping that makes it work
... still doesnt work, is it geo-restricted maybe? idk
gizmodo59 · 2h ago
It says "Rolling out to users on the ChatGPT Pro Plan today" So it ll happen throughout the day
fear91 · 4h ago
Same here, paying for Pro but I just get redirected to vanilla version...
piskov · 3h ago
> will be rolling
≠ available now to all pro users
tough · 3h ago
ok but I baited the hook and now am waiting.
Every -big- release they gatekeep something to pro I pay for it like every 3 months, then cancel after the high
when will i learn
rapfaria · 5h ago
They said Plus soon, not today.
energy123 · 5h ago
Where can I read OpenAI's promise that it won't use the repos I upload for training?
No comments yet
adamTensor · 5h ago
not buying windsurf then???
motoxpro · 4h ago
This would be the why of that acquisition as this needs a more integrated UI. Guessing by the speed at which this came out, this was in the works long before that acquisition.
adamTensor · 4h ago
it is not even clear *if* they are going to buy windsurf at all. And thats a big if. This might've just been the 'why' that deal is not happening.
shmoogy · 3h ago
This probably came out to beat Google I/O or something similar - odd Friday release otherwise.
DGAP · 2h ago
If you still don't think software engineering as a high paying job is over, I don't know what to tell you.
We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much. But Codex shines in a few areas:
Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)
It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
Model quality is good, but hard to say it's that much better than other models. In side-by-side tests with Cursor + Gemini 2.5-pro, naming, style and logic are relatively indistinguishable, so quality meets our bar but doesn’t yet exceed it.
My kid recently graduated from a very good school with a degree in computer science and what she's told me about the job market is scary. It seems that, relatively speaking, there's a lot of postings for senior engineers and very little for new grads.
My employer has hired recently and the flood of resumes after posting for a relatively low level position was nuts. There was just no hope of giving each candidate a fair chance and that really sucks.
My kid's classmates who did find work did it mostly through personal connections.
I have mentored junior developers and found it to be a rewarding part of the job. My colleagues mostly ignore juniors, provide no real guidance, couldn't care less. I see this attitude from others in the comments here, relieved they don't have to face that human interaction anymore. There are too many antisocial weirdos in this industry.
Without a strong moral and cultural foundation the AGI paradigm will be a dystopia. Humans obsolete across all industries.
Can totally relate. Unfortunately the trend for all-senior teams and companies has started long before ChatGPT, so the opportunities have been quite scarce, at least in a professional environment.
That's really awesome. I hope my daughter finds a job somewhere that values professional development. I'd hate for her to quit the industry before she sees just how interesting and rewarding it can be.
I didn't have many mentors when starting out, but the ones I had were so unbelievably helpful both professionally and personally. If I didn't have their advice and encouragement, I don't think I'd still be doing what I'm doing.
That said, back in the early 00s there was much more of a culture of everyone is expected to be self-taught and doing real web dev probably before they even get to college, so by the time they graduate they are in reality quite senior. This was true for me and a lot of my friends, but I feel like these days there are many CS grads who haven't done a lot of applied stuff. But at the same time, to be fair, this was a way easier task in the early 00s because if you knew JS/HTML/CSS/SQL, C++ and maybe some .NET language that was pretty much it you could do everything (there were virtually no frameworks), now there are thousands of frameworks and languages and ecosystems and you could spend 5+ years learning any one of them. It is no longer possible for one person to learn all of tech, people are much more specialized these days.
But I agree that eventually someone is going to have to start hiring juniors again or there will be no seniors.
To contrast, CH and GER are known to have very robust and regulated apprenticeship programs. Meaning you start working at a much earlier age (16) and go to vocational school at the same time for about 4 years. This path is then supported with all kinds of educational stepping stones later down the line.
There are many software developers who went that route in CH for example, starting with an application development apprenticeship, then getting to technical college in their mid 20's and so on.
I think this model has a lot of advantages. University is for kids who like school and the academic approach to learning. Apprenticeships plus further education or an autodidactic path then casts a much broader net, where you learn practical skills much earlier.
There are several advantages and disadvantages of both paths. In summary I think the academic path provides deeper CS knowledge which can be a force multiplier. The apprenticeship path leads to earlier high productivity and pragmatism.
My opinion is that in combination, both being strongly supported paths, creates more opportunities for people and strengthens the economy as a whole.
Vocational training focusing on immediate fit for the market is great for companies that want to extract maximal immediate value from labour for minimal cost, but longer term is not good for engineers themselves.
Today startups mostly wrap LLMs as this is what VCs expect. Larger companies have smaller IT budgets than before (adjusted for inflation). This is the real problem that causes the jobs shortage.
I think some people are betting on the fact that AI can replace junior devs in 2-5 years and seniors in 10-20, when the old ones are largely gone. But that's sort of beside the point as far as most corporate decision-making.
I think instead we should focus on getting rid of managers and product owners.
Top-tier engineers who integrate a deep understanding of business and user needs into technical design will likely be safe until we get full-fledged AGI.
Case 1: you keep training engineers.
Case 1.1: AGI soon, you don't need juniors or seniors besides a very few. You cost yourself a ton of money that competitors can reinvest into R&D, use to undercut your prices, or return to keep their investors happy.
Case 1.2: No AGI. Wages rise, a lot. You must remain in line with that to avoid losing those engineers you trained.
Case 2: You quit training juniors and let AI do the work.
Case 2.1: AGI soon, you have saved yourself a bundle of cash and remain mostly in in line with the market.
Case 2.2: no AGI, you are in the same bidding war for talent as everyone else, the same place you'd have been were you to have spent all that cash to train engineers. You now have a juicier balance sheet with which to enter this bidding war.
The only way out of this, you can probably see, is some sort of external co-ordination, as is the case with most of these situations. The high-EV move is to quit training juniors, by a mile, independently of whether AI can replace senior devs in a decade.
All the same principles apply as before: smart, driven, high ownership engineers make a huge difference to a company's success, and I find that the trend is even stronger now than before because of all the tools that these early career engineers have access to. Many of the folks we've hired have been able to spin up on our codebase much faster than in the past.
We're mainly helping them develop taste for what good code / good practices look like.
That's really great to hear.
Your experience that a new engineer equipped with modern tools is more effective and productive than in the past is important to highlight. It makes total sense.
There’s still quite a bit of a gap in terms of trust.
But also, I think this underestimates significantly what junior engineers do. Junior engineers are people who have spent 4 to 6 years receiving a specialised education in a university - and they normally need to be already good at school math. All they lack is experience applying this education on a job - but they are professionals - educated, proactive and mostly smart.
The market is tough indeed, and as much it is tough for a senior engineer like myself, I don't envy the current cohort of fresh grads. It being tough is only tangentially related to the AI though. Main factor is the general economic slowdown, with AI contributing by distracting already scarce investment from non-AI companies and producing a lot of uncertainty in how many and what employees companies will need in the future. Their current capabilities are nowhere near to having a real economic impact.
Wish your kid and you a lot of patience, grit and luck.
Unfortunately this is not how companies think. I read somewhere more than 20 years ago about outsourcing and manufacturing offshoring. The author basically asked the same: if we move out the so-called low-end jobs, where do we think we will get the senior engineers? Yet companies continued offshoring, and the western lost talent and know-how, while watching our competitor you-know-who become the world leader in increasingly more industries.
As you say, happens all the time. Also doesn’t make sense because so few people are buying individual stocks anyway. Goal should be to consistently outperform over the long term. Wall street tends to be very myopic.
Thinking long term is a hard concept for the bean counters at these tech companies i guess…
It's not long ago when the correction of the tech job market started, because it got blown up during and after covid. The geopolitical situation is very unstable.
I also think there is way too much FUD around AI, including coding assistants, than necessary. Typically coming either from people who want to sell it or want to get in on the hype.
Things are shifting and moving, which creates uncertainty. But it also opens new doors. Maybe it's a time for risk takers, the curious, the daring. Small businesses and new kinds of services might rise from this, like web development came out of the internet revolution. To me, it seems like things are opening up and not closing down.
Besides that, I bet there are more people today who write, read or otherwise deal directly with assembly code than ever before, even though we had higher level languages for many decades.
As for the job market specifically: SWE and CS (adjacent) jobs are still among the fastest growing, coming up in all kinds of lists.
Money number must always go up. Hiring people costs money. "Oh hey I just read this article, sez you can have A.I. code your stuff, for pennies?"
They'll probably just need to learn for longer and if companies ever get so desperate for senior engineers then just take the most able/experienced junior/mid level dev.
But I'd argue before they do that if companies can't find skilled labour domestically they should consider bringing skilled workers from abroad. There are literally hundreds of millions of Indians who got connected to the internet over the last decade. There's no reason a company should struggle to find senior engineers.
I hope he forwards your reply to his daughter to cheer her up.
It's probably over for these folks.
There will likely(?, hopefully?) be new adjacent gradients for people to climb.
In any case, I would worry more about your own job prospects. It's coming for everyone.
I was running a quick errand between engineering meetings and saw the first few lines about hiring juniors, and I wrote a couple of comments about how I feel about all of this.
I'm not always guilty of skimming, but today I was.
Does this mean people will be less incentivized to contribute to open source as time goes by?
P.S., I think the current trend is a wakeup call to us software engineers. We thought we were doing highly creative work, but in reality we spend a lot of time doing the basic job of knowledge workers: retrieving knowledge and interpolating some basic and highly predictable variations. Unfortunately, the current AI is really good at replacing this type of work.
My optimistic view is that in long term we will have invent or expand into more interesting work, but I'm not sure how long we will have to wait. The current generation of software engineers may suffer high supply but low demand of our profession for years to come.
For that reason all my silly little side projects are now in private repos. I dont care the chance somebody builds a business around them is slim to none. Dont think putting a license will protect you either. You'd have to know somebody is violating your license before you can even think about doing anything and that's basically impossible if it gets ripped into a private codebase and isnt obvious externally.
I'm quite conflicted on this assessment. On one hand, I was wondering if we would get better job market if there were not much open-sourced systems. We may have had a much slower growth, but we would see our growth last for a lot more years, which mean we may enjoy our profession until our retirement and more. On the other hand, open source did create large cakes, right? Like the "big data" market, the ML market, the distributed system market, and etc. Like the millions of data scientists who could barely use Pandas and scipy, or hundreds of thousands of ML engineers who couldn't even bother to know what semi positive definite matrix is.
Interesting times.
Most of the waking hours of most creative work have this type of drudgery. Professional painters and designers spend most of their time replicating ideas that are well fleshed-out. Musicians spend most of their time rehearsing existing compositions.
There is a point to be made that these repetitive tasks are a prerequisite to come up with creative ideas.
If you extrapolate and generalize further... what is at risk is any task that involves taking information input (text, audio, images, video, etc.), and applying it to create some information output or perform some action which is useful.
That's basically the definition of work. It's not just knowledge work, it's literally any work.
One issue with junior devs is that because they’re not fully autonomous, you have to spend a non trivial amount of time guiding them and reviewing their code. Even if I had easy access to a lot of them, pretty quickly that overhead would become the bottleneck.
Did you think that managing a lot of these virtual devs could get overwhelming or are they pretty autonomous?
Counter-point B: AI does not get tired, does not need space, does not need catering to their experience. AI is fine being interrupted and redirected. AI is fine spending two days on something that gets overwritten and thrown away (no morale loss).
Advancements in general AI knowledge over time will not correlate to improvements in remembering any matters as colloquial as this.
Counter-counter-point B: AI absolutely needs catering to their experience. Prompter must always learn how to phrase things so that the AI will understand them, adjust things when they get stuck in loops by removing confusing elements from the prompt, etc.
As long as I spend less time reviewing and guiding than doing it myself it's a win for me. I don't have any fun doing these things and I'd rather yelling at a bunch of "agents". For those who enjoy doing bunch of small edits I guess it's the opposite.
> It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
What's the benefit of this? It sounds like it's just a gimmick for the "AI will replace programmers" headlines. In reality, LLMs complete their tasks within seconds, and the time consuming part is specifying the tasks and then reviewing and correcting them. What is the point of parallelizing the fastest part of the process?
So the benefit is really that during this "down" time, you can do multiple useful things in parallel. Previously, our engineers were waiting on the Cursor agent to finish, but the parallelization means you're explicitly turning your brain off of one task and moving on to a different task.
Junior engineers are not cattle. They are the future senior ones, they bring new insights into teams, new perspectives; diversity. I can tell you the times I have learnt so many valuable things from so-called junior engineers (and not only tech-wise things).
LLMs have their place, but ffs, stop with the "junior engineer replacement" shit.
If you don't mind, what were the strengths and limitations of Claude Code compared to Codex? You mentioned parallel task execution being a standout feature for Codex - was this a particular pain point with Claude Code? Any other insights on how Claude Code performed for your team would be valuable. We are pleased with Claude Code at the moment and were a bit underwhelmed by comparable Codex CLI tool OAI released earlier this month.
rinse and repeat once task done, update #1 and cycle again. Add in another CC window if need more tasks concurrently.
downside is cost but if not an issue, it's great for getting stuff done across distributed teams..
This is also part of a recent update to Zed. I typically use Zed with my own Claude API key.
That said, it might be possible to tell each agent to create a branch and do work there? I haven't tried that.
I haven't seen anything about Zed using containers, but again you might be able to tell each agent to use some container tooling you have in place since it can run commands if you give it permission.
If you want one idiot's perspective, please hyper-focus on model quality. The barrier right now is not tooling, it's the fact that models are not good enough for a large amount of work. More importantly, they're still closer to interns than junior devs: you must give them a ton of guidance, constant feedback, and a very stern eye for them to do even pretty simple tasks.
I'd like to see something with an o1-preview/pro level of quality that isn't insanely expensive, particularly since a lot of programming isn't about syntax (which most SotA modls have down pat) but about understanding the underlying concepts, an area in which they remain weak.
Atp I really don't care if the tooling sucks. Just give me really, really good mdoels that don't cost a kidney.
It also seems very telling they have not mentioned o4-high benchmarks at all. o4-mini exists, so logically there is an o4 full model right?
Preview video from Open AI: https://www.youtube.com/watch?v=hhdpnbfH6NU&t=878s
As I think about what "AI-native" or just the future of building software loos like, its interesting to me that - right now - developers are still just reading code and tests rather than looking at simulations.
While a new(ish) concept for software development, simulations could provide a wider range of outcomes and, especially for the front end, are far easier to evaluate than just code/tests alone. I'm biased because this is something I've been exploring but it really hit me over the head looking at the Codex launch materials.
You mean like automated test suites?
There's already library's for QA testing and VLM's can give critique on a series of screenshots automated by a playwright script per branch
Ambitious idea, but I like it.
https://huggingface.co/blog/smolvlm
recently both llama.cpp and ollama got better support for them too, which makes this kind of integration with local/self-hosted models now more attainable/less expensive
Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/
swe polybench - https://amazon-science.github.io/SWE-PolyBench/
Kotlin bench - https://firebender.com/leaderboard
Or have I misunderstood something? What if there is a hack and everyones repos are exposed?
Im kind of weirded out noone has a problem with this, I know we use github ( mostly ) but this would be another point of failure, so a 2x riskier setup?
Sorry maybe Im being obtuse
Is this still rolling out? I dont need the team plan too do I?
I have been using openAI products for years now and I am keen to try but I have no idea what I am doing wrong.
What about using it for AI / developing models that compete with our new overlords?
Seems like using this is just asking to get rug pulled for competing with em when they release something that competes with your thing. Am I just an old who’s crowing about nothing? It’s ok for them to tell us we own outputs we can’t use to compete with em?
Facebook has been caught in recent DOJ hearings breaking the law with how they run their business, just as one example. They claimed under oath, previously, to not be doing X, and then years later there was proof they did exactly that.
https://youtu.be/7ZzxxLqWKOE?si=_FD2gikJkSH1V96r
A companies “word” means nothing imo. None of this makes sense if i’m being honest. Unless you personally have a negotiated contract with the provider, and can somehow be certain they are doing what they claim, and can later sue for damages, all of this is just crossing your fingers and hoping for the best.
There must be room for a Modal/Cloudflare/etc infrastructure company that focuses only on providing full-fledged computer environments specifically for AI with forking/snapshotting (pause/resume), screen access, human-in-the-loop support, and so forth, and it would be very lucrative. We have browser-use, etc, but they don't (yet) capture the whole flow.
No comments yet
When I'm using aider, after it make a commit what I do, I then immediately run git reset HEAD^ and then git diff (actually I use github desktop client to see the diff) to evaluate what exactly it did, and if I like it or not. Then I usually make some adjustments and only after that commit and push.
The secret sauce here seems like their new model, but I expect it to come to API at some point.
This seems imply that the software engineering as a profession has been quite mature and saturated for a while, to the point that a model can predict most of the output. Yes, yes, I know there are thousands of advanced algorithms and amazing systems in production. It's just that the market does not need millions of engineers for such advanced skills.
Unless we get yet another new domain like cloud or like internet, I'm afraid the core value of software engineers: trailblazing for new business scenarios, will continue diminishing and being marginalized by AI. As a result, we get way less demand for our job, and many of us will either take a lower pay, or lose our jobs for extended time.
Still super confusing, though!
I feel like companies working with and shipping LLMs would do well to remember that it's not just humans who get confused by this, but LLMs themselves... it makes for a painful time, sending off a request and noting that a third of the way into its reasoning that the model has gotten tow things with almost-identical names confused.
I made one for github action but it's not as realtime and is 2 years old now: https://github.com/asadm/chota
A not open-source option this looks close to is also https://githubnext.com/projects/copilot-workspace (released April 2024, but I'm not sure it's gotten any significant updates since)
But they aren't moving nearly as fast as OpenAI. And it remains to be seen if first mover will mean anything.
(Im trying something)
what would be an impressive program that an agent should be able to one-shot in one go?
This should be possible today and surely Linus would also see this in the future.
I can't say I am a big fan of neutering these paradigm-shifting tools according to one culture's code of ethics / way of doing business / etc.
One man's revolutionary is another's enemy combatant and all that. What if we need top-notch malware to take down the robot dogs lobbing mortars at our madmaxian compound?!
I wouldn't sweat it. According to it's developers, Codex understands 'malicious software', it has just been trained to say, "But I won't do that" when such requests are made to it. Judging from the recent past [1][2] getting LLMs to bypass such safeguards is pretty easy.
1.https://hiddenlayer.com/innovation-hub/novel-universal-bypas... 2.https://cyberpress.org/researchers-bypass-safeguards-in-17-p...
No comments yet
Feels like codex is for product managers to fix bugs without touching any developer resources. Then it’s insanely surprising!
But now who's going to do that work? Still engineers.
On the other hand, if your job was writing code at certain companies whose profits were based on shoving ads in front of people then I would agree that no one will care if it is written by a machine or not. The days of those jobs making >$200k a year are numbered.
No comments yet
I'm very interested.
In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.
These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.
And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.
That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.
What am I missing?
That you are trying to use LLMs to create giant sprawling codebase feature packed software packages that define the modern software landscape. What's being missed is that any one user might only utilize 5% of the code base on any given day. Software is written to accommodate every need every user could have in one package. Then the users just use the small slice that accommodates their specific needs.
I have now created 5 hyper narrow programs that are used daily by my company to do work. I am not a programmer and my company is not a tech company located in a tech bubble. We are a tiny company that does old school manufacturing.
To give a quick general example, Betty uses Excel to manage payroll. A list of employees, a list of wages, a list of hours worked (which she copys from the time clock software .csv that she imports to excel).
Excel is a few million LOC program and costs ~$10/mo. Betty needs maybe 2k LOC to do what she uses excel for. Something an LLM can do easily, a python GUI wrapper on an SQLite DB. And she would be blown away at how fast it is, and how it is written for her use specifically.
How software is written and how it is used will change to accommodate LLMs. We didn't design cars to drive on horse paths, we put down pavement.
But yes, I hope we get away from the giant conglomeration of everything, ESPECIALLY the reality of people doing 90% of their business inside a Google Chrome widow. Move towards the UNIX philosophy of tiny single-purpose programs.
IMO LLMs are actually pretty good at writing small scripts. First, it's much more common for a small script to be in the LLM's training data, and second, it's much easier to find and fix a bug. So the LLM actually does allow a non-programmer to write correct code with minimal effort (for some simple task), and then they are blown away thinking writing software is a solved problem. However, these kinds of people have no idea of the difference between a hundred line script where an error is easily found and isn't a big deal and a million line codebase where an error can be invisible and shut everything down.
Worst of all is when the two sides of tech-giants and non-programmers meet. These two sides may sound like opposites but they really aren't. In particular, there are plenty of non-programmers involved at the C-level and the HR levels of tech companies. These people are particularly vulnerable to being wowed by LLMs seemingly able to do complex tasks that in their minds are the same tasks their employees are doing. As a result, they stop hiring new people and tell their current people to "just use LLMs", leading to the current hiring crisis.
In short, the tools work. I've built things 10x faster than doing it from scratch. I also have a sense of what else I'll be able to build in a year. I also enjoy not having to add cycles to communicate with external contributors -- I think, then I do, even if there's a bit of wrestling. Wrangling with a coding agent feels a bit like "compile, test, fix, re-compile". Re-compiling generally got faster in subsequent generations of compiler releases.
My company is building internal business functions using AI right now. It works too. We're not putting that stuff in front of our customers yet, but I can see that it'll come. We may put agents into the product that let them build things for themselves.
I get the grumpiness & resistance, but I don't see how it's buying you anything. The puck isn't underfoot.
A recent example from a C# project I was working in. The project used builder classes that were constructed according to specified rules, but all of these builders were written by hand. I wanted to automatically generate these builders, and not using AI, just good old meta-programming.
Now I knew enough to know that I needed a C# source generator, but I had absolutely no experience with writing them. Could I have figured this out in an hour or two? Probably. Did I write a prompt in less than five minutes and get a source generator that worked correctly in the first shot? Also yes. I then spent some time cleaning up that code and understanding the API it uses to hook into everything and was done in half an hour and still learnt something from it.
You can make the argument that this source generator is in itself "boilerplate", because it doesn't contain any special sauce, but I still saved significant time in this instance.
personally, i've always operated in a codebase in a way that i _need_ to understand how things work for me to be productive and make the right decisions. I operate the same way with AI. every change is carefully reviewed, if it's dumb, i make it redo it and explain why it's dumb. and if it gets caught in a loop, i reset the context and try to reframe the problem. overall, i'm definitely more productive, but if you truly want to be hands off--you're in for a very bad time. i've been there.
lastly, some codebases don't work well with AI. I was working on a problem that was a bit more novel/out there and no model could solve it. Just yapped endlessly about these complex, very potentially smart sounding solutions that did absolutely nothing. went all the way to o1-pro. the craziest part to me was the fact that across claude, deepseek and openai, they used the same specific vernacular for this particular problem which really highlights how a lot of these models are just a mish-mash of the same underlying architecture/internet data. some of these models use responses from other models for their training data, which to me is like incest. you won't get good genetical results
But I have done larger tasks (write device drivers) using gemini.
I can't even fathom how frustrating such tools would be with poorly written confusing Clojure code using some niche dependency.
That being said, I can imagine a whole class of problems where this could succeed very well at and provide value. Then again, the type of problems that I feel these systems could get right 99% of the time are problems that a skilled developer could fix in minutes.
This project is not your typical Webdev project, so maybe that's an interesting case-study. It takes a C-API spec in JSON, loads and processes it in Python and generates a C-library that turns a UI marked up YAML/JSON into C-Api calls to render that UI. [1]
The result is pretty hacky code (by my design, can't/won't use FFI) that's 90% written by Gemini 2.5 Pro Pre/Exp but it mostly worked. It's around 7k lines of Python that generate a 30-40k loc C-library from a JSON LVGL-API-spec to render an LVGL UI from YAML/JSON markup.
I probably spent 2-3 weeks on this, I might have been able to do something similar in maybe 2x the time but this is about 20% of the mental overhead/exhaustion it would have taken me otherwise. Otoh, I would have had a much better understanding of the tradeoffs and maybe a slightly cleaner architecture if I would have to write it. But there's also a chance I would have gotten lost in some of the complexity and never finished (esp since it's a side-project that probably no-one else will ever see).
What worked well:
* It mostly works(!). Unlike previous attempts with Gemini 1.5 where I had to spend about as much or more time fixing than it'd have taken me to write the code. Even adding complicated features after the fact usually works pretty well with minor fixing on my end.
* Lowers mental "load" - you don't have to think so much about how to tackle features, refactors, ...
Other stuff:
* I really did not like Cursor or Windsurf - I half-use VSCode for embedded hobby projects but I don't want to then have another "thing" on top of that. Aider works, but it would probably require some more work to get used to the automatic features. I really need to get used to the tooling, not an insignificant time investment. It doesn't vibe with how I work, yet.
* You can generate a *significant* amount of code in a short time. It doesn't feel like it's "your" code though, it's like joining a startup - a mountain of code, someone else's architecture, their coding style, comment style, ... and,
* there's this "fog of code", where you can sorta bumble around the codebase but don't really 100% understand it. I still have mid/low confidence in the changes I make by hand, even 1 week after the codebase has largely stabilized. Again, it's like getting familiar with someone else's code.
* Code quality is ok but not great (and partially my fault). Probably depends on how you got to the current code - ie how clean was your "path". But since it is easier to "evolve" the whole project (I changed directions once or twice when I sort of hit a wall) it's also easier to end up with a messy-ish codebase. Maybe the way to go is to first explore, then codify all the requirements and start afresh from a clean slate instead of trying to evolve the code-base. But that's also not an insignificant amount of work and also mental load (because now you really need to understand the whole codebase or trust that an LLM can sufficiently distill it).
* I got much better results with very precise prompts. Maybe I'm using it wrong, ie I usually (think I) know what I want and just instruct the LLM instead of having an exploratory chat but the more explicit I am, the more closely the output is to what I'd like to see. I've tried to discuss proposed changes a few times to generate a spec to implement in another session but it takes time and was not super successful. Another thing to practice.
* A bit of a later realization, but modular code and short, self-contained modules are really important though this might depend on your workflow.
To summarize:
* It works.
* It lowers initial mental burden.
* But to get really good results, you still have to put a lot of effort into it.
* At least right now, it seems you will still eventually have to put in the mental effort at some point, normally it's "front-loaded" where you have to do the design and think about it hard, whereas the AI does all the initial work but it becomes harder to cope with the codebase once you reach a certain complexity. Eventually you will have to understand it though even if just to instruct the LLM to make the exact changes you want.
[1] https://github.com/thingsapart/lvgl_ui_preview
Lots missing here, but I had the same issues, it takes iteration and practice. I use claude code in terminal windows, and text expander to save explicit reminders that I have to inject super regularly because anthropic obscures access to system prompts.
For example, I have 3 to 8 paragraph long instructions I will place regularly about not assuming, checking deterministically etc. and for most things I have the agents write a report with a specific instruction set.
I pop the instructions into text expander so I just type - docs when saying go figure this out, and give me the path to the report when done.
They come back with a path, and I copy it and search vscode
It opens as an md and i use preview mode, its similar to a google doc.
And ill review it. always, things will be wrong, tons of assumptions, failures to check determistically, etc... but I see that in the doc and have it fix it. correct misunderstandings, update the doc until its perfect.
From there ill say add a plan in a table with status for each task based on this ( another text expander snippet with instructions )
And WHEN thats 100% right, Ill say implement and update as you go. The update as you go forces it to recognize and remember the scope of the task.
Greatest points of failure in the system is misalignment. Ethics teams got that right. It compounds FAST if allowed. you let them assume things, they state assumptions as facts, that becomes what other agents read and you get true chaos unchecked.
I started rebuilding claude code from scratch literally because they block us from accessing system prompts and I NEED these agents to stop lying to me about things that are not done or assumed, which highlights the true chaos possible when applied to system critical operations in governance or at scale.
I also built my own tool like codex for managing agent tasks and making this simpler, but getting them to use it without getting confused is still a gap.
Let me know if you have any other questions. I am performing the work of 20 Engineers as of today, rewrote 2 years of back end code that required a team of 2 engineers full time work in 4 weeks by myself with this system... so I am, I guess quite good at it.
I need to push my edges further into this latest tech, have not tried codex cli or the new tool yet.
its a pain but it works.
Even TDD it will hallucinate the mocks without management. and hallucinate the requirements. Each layer has to be checked atomically, but the text expander snippets done right can get it close to 75% right.
My main project faces 5000 users so I cant let the agents run freely, whereas with isolated projects in separate repos I can let them run more freely, then review in gitkraken before committing.
You can customize the system prompts, baseline propmts, and models used for every single mode and have as many or as few as you want.
I imagine many engineers are like myself in that they got into programming because they liked tinkering and hacking and implementation details, all of which are likely to be abstracted over in this new era of prompting.
We had to tinker piece by piece to build a miniature castle. Over many hours.
Now I can tinker concept by concept, and build much larger castles, much faster. Like waving a wand, seeing my thoughts come to fruition in near real time.
No vanity lost in my opinion. Possibly more to be gained.
See that never was the purpose.. going bigger and faster, towards what exactly? Chaos? By the way we never managed to fully tackle manual software development by trained professionals and we now expect Shangri-La by throwing everything and the kitchen sink into giant inscrutable matrices. This time by amateurs as well. I'm sure this will all turn out very well and very, very productive.
If you derive enjoyment from actually assembling the castle, you lose out on that by using the wand that makes it happen instantly. Sure wand's castles may be larger, but you don't put a Lego castle together for the finished product.
people who want to make software that enables people to accomplish [task] will get the software they need quicker.
> without sacrificing quality
Right..
> it's your responsibility to use that tool
Again, it's actually not. It's my responsibility to do my job, not to make my boss' - or his boss' - car nicer. I know that's what we all know will create "job security" but let's not conflate these things. My job is to do my end of the bargain. My boss' job is paying me for doing that. If he deems it necessary to force me to use AI bullshit, I will of course, but it is definitely not my responsibility to do so autonomously.
This sparked a thought in how a large part of the job is often the work needed to demonstrate impact. I think this aspect is often overlooked by some of the good engineers not yet taking advantage of the AI tooling. LLM loops may not yet be good enough to produce shippable code by themselves, but they sure are capable to help reduce the overhead of these up and out communicative tasks.
For example, in the last month or so, I added a job queue plugin. The ability to run multiple tasks that they demoed today is quite similar. The issue I ran into with users is that without Enterprise plans, complex tasks run into rate limits when trying to run concurrently.
So I am adding an ability to have multiple queues, with each possibly using different models and/or providers, to get around rate limits.
By the way, my system has features that are somewhat similar not only to this tool they are showing but also things like Manus. It is quite rough around the edges though because I am doing 100% of it myself.
But it is MIT Licensed and it would be great if any developer on the planet wanted to contribute anything.
No comments yet
There's so many of these "vibe coding" tools and there has to be real engineering rigor at some point. I saw them demo "find the bug" but the bugs they found were pretty superficial and thats something we've seen in our internal benchmark from both Devin and Cursor. A lot of noise and false positives or superficial fixes.
No comments yet
sigh
I know right
also no human to contact on support... tempted to cancel the sub lol i'll give them 24h
I just enabled on Settings > Connectors > Github
hoping that makes it work
... still doesnt work, is it geo-restricted maybe? idk
≠ available now to all pro users
Every -big- release they gatekeep something to pro I pay for it like every 3 months, then cancel after the high
when will i learn
No comments yet