Developing with GitHub Copilot Agent Mode and MCP

87 miltonlaxer 60 6/30/2025, 8:47:28 PM austen.info ↗

Comments (60)

skydhash · 11h ago

And again, the most convoluted setup for development with an example that fails to demonstrate why you should adopt such practice. It’s like doing a GDB demo with an hello world program. Or doing Linux From Scratch to show how you can browse the web.

The goal of software engineering is not to write code faster. Coding is itself a translation task (and a learning workflow, as you can’t keep everything in your head). What you want is the power of decision, and better decision can be made with better information. There’s nothing in the setup that helps with making decision.

There are roughly six steps in software engineering, done sequentially and iteratively. Requirements gathering to shape the problem, Analysis to understand it, Design to come up with a solution, Coding to implement it, Testing to verify the solution, and Maintenance to keep the solution working. We have methods and tooling that help with each, giving us relevant information based on important parameters that we need to decide upon.

LLMs are example generators. Give it a prompt and it will gives the answer that fits the conversation. It’s an echo chamber powered by a lossy version of the internet. Unlike my linting tool which will show me the error when there’s one and not when I tell it to.

ADDENDUM

It's like an ivory tower filled with yes-men and mirrors that always reply "you're the fairest of them all". My mind is already prone to lie to itself. What I need most is tooling that is not influenced by what I told it, or what others believe in. My browser is not influencing my note taking tool, telling it to note down the first two results it got from google. My editor is not telling the linter to sweep that error under a virtual rug. And QA does not care that I've implemented the most advanced abstraction if the software does not fit the specs.

jcelerier · 10h ago

> The goal of software engineering is not to write code faster

That just really depends on your situation. Here's a case I had just last week: we had artists in residency who suddenly showed up with a new, expensive camera that didn't have any easy to use driver but requires the use of their huge and bulky custom SDK.

Claude whipped a basic working c++ proprietary-camera-sdk-to-open-video-sharing-protocol in, what, 2 minutes? From the first go with a basic prompt? Without that it'd have been at least a couple days of development, likely a day just to go through the humongous docs -- except I had at most two hours to put on this. And I already have experience doing exactly this, having written software that involves realsenses, orbbec, leapmotion, Kinect, and all forms of weird cameras that require the use of their c++ SDK.

So the artists would just not be able to do their residency the way they wanted because they only have 3 days on-site to work too.

Or I'd have spent two days for some code that is very likely to only ever being used once, as part of this residency.

Thus in my line of work, being able to output code that works, faster than humans, is absolutely game changer - this situation I'm describing is not the exception, it's pretty much a weekly occurrence.

skydhash · 10h ago

> Claude whipped a basic working c++ proprietary-camera-sdk-to-open-video-sharing-protocol in, what, 2 minutes? From the first go with a basic prompt? Without that it'd have been at least a couple days of development, likely a day just to go through the humongous docs

That's basically what I said. They are example generators. Their creators have not published the source of the data that goes in their training so we can assume that everything that is accessible from the web (and now from places that use their tools) was used.

So if you're already know the domain to provide the right keywords, and can judge the output to see if it's good enough, it's going to be fine. Especially, as you've said, it's something that you're used to do. But do you need the setup mentioned in TFA?

Most software engineering tasks involved more than getting some basic prototype working. After the 80% work done by the prototype, there's the other 80% to have reliable code. With LLMs, you're stuck with the first 80%, and that already require someone experienced to get there.

stpedgwdgfhgdd · 7h ago

It definitely takes a lot of experience writing code with a LLM. Like a junior engineer it makes tons of (small) mistakes. It takes years of practice to detect the LLM is introducing small bugs that will reveal themselves only after extensive testing or running in prod.

It will be interesting to see how beginning developers will deal with these bugs as they did not write the code and do not have a mental model of the code. Will quality drop? Perhaps some can be compensated by letting the LLM do extensive testing.

leetrout · 10h ago

Similar anecdata:

I was writing some automated infra tests with Terraform and Terratest and I wanted to deploy to my infra. My tests are compiled into a binary and shipped to ECS Fargate as an image to run.

Instead of doing docker in docker to pull and push my images and before googling for an existing lib for managing images directly I asked Claude to write code to pull the layer tarballs from docker hub and push them to my ECR. It did so flawlessly and even knew how to correctly auth to dockerhub with their token exchange on the first try.

I glanced at the code and surmised it would have taken me an hour or two to write and test as I read the docs on the APIs.

I am sure there is a lib somewhere that does this but even that would have likely taken more time than the code gen I got.

No comments yet

stpedgwdgfhgdd · 7h ago

“The goal of software engineering is not to write code faster”

Writing proper code including tests and refactorings takes substantial time.

It is definitely worth it to do this faster, if only to get faster feedback to go back to the first phase; requirements and analysis.

I have experienced this myself, using CC it took me a few hours less to realise i was on the wrong track.

skydhash · 7h ago

Requirements are filters for the set of implementations. The only feedback is the count and the nature of the results. And what you usually do is to either abandoning it or restricting it further. Because the source of the requirements is the business domain which exist outside the technical domain.

Selecting one of the implementation over the other is design, aka making decisions. Sometimes you have to prototype it out to where which parameters is the best. And sometimes a decision can revert an earlier one and you have to investigate the impact radius of that change.

But coding is straightforward translation. The illusion of going faster is that we forego making decisions. Instead we're hoping that the agent makes the correct ones based on some general direction, forgetting than an inch deviation can easily turn into a mile error. The hopeful things would have been an iteration, adding the correct decisions and highlighting the bad ones to avoid. But no one have that kind of patience. And those that use LLMs often finish with a "LGTM" patch.

The normal engineering is to attain a critical mass of decisions and turns that immediately to formal notation which is unambiguous. Then we validate with testing if that abrupt transformation was done properly. But all the decisions were made with proper information.

alterom · 10h ago

Oh, so Claude in this case was a bandaid over a communication problem (the artists not getting the memo about not suddenly showing up with new equipment that you have to support, with no prior discussion, warning, or heads-up).

It absolutely is a game changer.

Now the game for you is to deal with whatever equipment they throw at you, because nobody is going to bother consulting you in advance.

Just use AI, bro.

Good luck next time they show up with gear that Claude can't help you with. Say, because there's no API in the first place, and it's just incompatible the existing flow.

>So the artists would just not be able to do their residency the way they wanted because they only have 3 days on-site to work too.

That, to me, sounds like the good outcome for everyone involved.

It would have been their problem, which they were perfectly capable of solving by suddenly showing up with supported equipment on the job site.

Wanting you to deal with their "suddenly showing up" is not the right thing to want.

If want that, they shouldn't be able to do the residency the way they want.

Saying this as a performing musician: verifying that my gear will work at the venue before the performance is my responsibility, not the sound tech's. Ain't their job to have the right cables or power supplies. I can't fathom showing up with a setup and simply demanding to make it work.

IDK what kind of divas you work with, but what you described is a solid example of a situation when the best tool is saying "no", not using Claude.

The fact that it's a weekly occurrence is an organizational issue, not a software one.

And please — please don't use a chatbot to resolve that one either.

outofpaper · 9h ago

Chill Winston! Artists in residency are not know for being technical. They are not divas demanding support but individuals who are supposed to have access to resources, space, and support that allows them to develop as artists.

The spaces they are working with often benefit from having talented creatives but this isn't a performance gig we're talking about.

darkwater · 9h ago

I think it was just an example. The real meat behind the example is what someone said today and that I'm now stealing: AI helps with a faster tech debt generation. Having $anyone asking for $anything in a hurry it's probably the #1 cause of tech debt. If now with AI all the answers are going to be "yes, sure!" well, tech debt will go up.

jcelerier · 8h ago

the answer was "yes, sure" before AI too. the difference is whether you're going to do overtime staying up to 2AM trying to make things work.

darkwater · 8h ago

Not in my case. But overstaying, at least here in Europe, is not seen as a good thing, while using AI starts to be encouraged by CTOs and CEOs.

jcelerier · 5h ago

as someone from France this has really, really not been my experience when I think of how many times I had to pull on all-nighters for work because of $deadline.

Espressosaurus · 7h ago

That's a good way of putting it. Especially in inexperienced users' hands.

I'm working with a bunch of autonomy guys (think PHD algorithms people) and ChatGPT lets them write code. Which is good.

Except the code is hot garbage. It works for the thing they're trying to do, but it doesn't handle errors, it's not extendable, it's not maintainable, and they'll fight you when you tell them how to make it any of the above because they didn't write it and don't really understand how it works.

There's no management buy-in--startup, so velocity matters more than anything--so I've resorted to letting them have their working garbage patch and I just let them deal with the consequences. Frustrating, but quality doesn't matter sometimes.

alterom · 8h ago

>Artists in residency are not know for being technical.

They use the technology, they make the decisions about technology (like which cameras to use), and they literally can't do their work without the tech —

— they better be at list a little "technical" for their own sake.

The idea of knowing your gear has nothing to do with artistry. This is my rifle. There are many like it, but this one is mine.

And if they're "not known for being technical", they better consult someone who is (that is, you) before making decisions about technology (like choosing equipment for the job).

The problem here wasn't their lack of technical expertise.

The problem was the artists suddenly showing up one day with strange equipment.

The solution here is asking the artists to check in with you about gear as soon as they are signed up, and talking to you BEFORE getting new gear.

That's it. They don't have to be more technical. They need to help you help them.

You said it yourself that this would not have been a problem for you without Claude if you had more time to deal with it.

The problem was a lack of advance warning. And it's a solvable problem.

These artists might not have been divas before that residency.

But the policy of not having to consult the tech before bringing in gear and expecting it to just work, on zero notice, while remaining "not technical" — that turns them into divas.

If they are not making these demands, who does?

If that's you, that sounds rather masochistic. In which case using Claude to dull the pain seems rather counterproductive in the short term.

Seriously though, Claude isn't a solution for setting expectations and communication, which is the real problem here.

You could've still used Claude to write that driver. But you could have also had the several days to do it, and it wouldn't be an example of Claude saves the day any more.

Or — better yet — you could've given the artists a chance to reconsider their gear choices.

There might not have been a reason why they picked that gear in the first place instead of one you could've worked with easily (did you ask?).

You didn't enable the artists to do their job. You enabled them to make uninformed gear decisions.

That's not helping them in the long term; it's just setting them up for failure. Claude can't code around an API that isn't there, or a hardware incompatibility.

And that's before we get to the most important aspect of art: limitation breeds creativity. This sort of babysitting isn't helping the art either.

If the goal is to help them develop as artists, then it seems you're accomplishing the opposite.

Have you at least told them that they've created a problem for you? They'd want to know that. People usually don't want to create problems for others.

As for me — I'm chill AF in the first place; and I'm not against using chatbots to solve problems — it's just that I'm not convinced that Claude is the right LLM to use here.

Perhaps asking ChatGPT about this situation, and how to talk to artists (and shape the equipment policy for your space) would do much more impact on the problem you said is recurring on a weekly basis than using Claude to put more bandaids on a pile of bandaids.

jcelerier · 7h ago

> The problem was the artists suddenly showing up one day with strange equipment.

again, this is not a problem but a basic expectation when you do media arts residencies. Just like it's an expectation when you work in the event industry that you're going to get gigs for making something in the morning for an event happening in the evening.

CPLX · 9h ago

Excellent point! His approach worked in practice, but it would never work in a theoretical situation where proving a point is more important than just solving the problem, so it's obviously worthless.

alterom · 9h ago

"Use the cameras you used last time for this gig, we'll work out something for the next one"

If uttering a sentence like this is a "theoretical" solution for you, I don't know what to tell you, except that you're not going to have a good time in any job until you learn the practicality of saying "no" the hard way.

And if you're living your life where the only "solution" to any problem created by stupidity, miscommunication, and bad planning of other people is saying "yes, sir!” and enabling them to do more if it —

— best of luck growing out of serfdom one day.

diggan · 7h ago

> And if you're living your life where the only "solution" to any problem created by stupidity, miscommunication, and bad planning of other people is saying "yes, sir!” and enabling them to do more if it

I'm not sure if you've ever worked in a "creative" environment filled with artists, either professionally or not, but the goal is almost never to come up with a solution that is technological superior or even "correct", it just have to work for the session, maybe two.

If you're a technical contributor to such an environment, then your "job" is basically to support their whims, that's how the project moves forward. Saying "no" when an artist approaches you isn't really in the job requirements, but of course you can steer them into a different direction, but ultimately they steer the ship, and they have to.

But ultimately your job in those cases are to support the vision of someone else, often by any means necessary and fast too.

jcelerier · 8h ago

> If uttering a sentence like this is a "theoretical" solution for you, I don't know what to tell you, except that you're not going to have a good time in any job until you learn the practicality of saying "no" the hard way.

but no one wants to say no here. the thing is going to happen one way or another, if only because everyone involved is passionate about making sure artists can reach their dreams. if we can't do this, we just close shop.

mtkd · 10h ago

Unnecessarily critical take on a quality write-up

Much of the criticism of AI on HN feels driven by devs who have not fully ingested what is going with MCP, tools etc. right now as not looked deeper than making API calls to an LLM

danielbln · 10h ago

OP's comment also seems to be firmly stuck in 2023 when you'd prompt ChatGPT or whatever. The fact that LLMs today, when strapped into an agentic harness, can do or help with all of these things (ideation, architecture, use linters, validate code, evaluate outputs, and a million other things) seems to elude them.

skydhash · 9h ago

Dothey do requirement gatherings? Like talking to stakeholder and getting their input of what the feature should, translating business jargon to domain terms?

No.

Do they do the analysis? Removing specs that conflict with each other, validating what's possible in the technical domain and in the business domain?

No.

Do they help with design? Helping coming up with the changes that impact the current software the least, fitting in the current architecture and be maintainable in the feature.

All they do is pattern matching on your prompt and the weights they have. Not a true debate or weighing options based on the organization context.

Do they help with coding?

A lot if you're already experienced with the codebase and the domain. But that's the easiest part of the job.

Do they help with testing? Coming up with tests plan, writing test code, running them, analysing the output of the various tools and producing a cohesive report of the defects?

I don't know as I haven't seen any demo on that front.

Do they help with maintenance? Taking the same software and making changes to keep it churning on new platforms, through dependencies updates and bug fixes?

No demo so far.

IanCal · 7h ago

Why do you think any of these should be a challenge for, say, O3/O3 pro?

You pretty much just have to ask and give them access for these things. Talking to a stakeholder and translating jargon and domain terms? Trivial. They can churn through specs and find issues, none of that seems particularly odd to ask of a decent LLM.

> Do they help with testing? Coming up with tests plan, writing test code, running them, analysing the output of the various tools and producing a cohesive report of the defects?

This is pretty standard in agentic coding setups. They'll fix up broken tests, and fix up code when it doesn't pass the test. They can add debug statements & run to find issues, break down code to minimal examples to see what works and then build back up from there.

> Do they help with maintenance? Taking the same software and making changes to keep it churning on new platforms, through dependencies updates and bug fixes?

Yes - dependency updates is probably the easiest. Have it read the changelogs, new api docs and look at failing tests, iterate to have it pass.

These things are progressing surprisingly quickly so if your experience of them is from 2024 then it's quite out of date.

thunky · 8h ago

> Do they do requirement gatherings? Like talking to stakeholder and getting their input of what the feature should, translating business jargon to domain terms? No.

Why not? This is a translation problem so right up its alley.

Give it tool access to communicate directly with stakeholders (via email or chat) and put it in a loop to work with them until the goal is reached (stakeholders are happy). Same as a human would do.

And of course it will still need some steering by a "manager" to make sure it's building the right things.

skydhash · 6h ago

> Why not? This is a translation problem so right up its alley.

Translating a sign can be done with a dictionary. Translating a document is often a huge amount of work due to cultural difference, so you can not make a literal translation of sentences. And sometimes terms don't map to each other. That's when you start to use metaphors (and footnotes).

Even in the same organization, the same term can mean different things. As humans we don't mind when terms have several definitions and the correct one is contextual. But software is always context free. Meaning everything is fixed at its inception and the variables govern flow, not the instruction themselves ("eval" instruction (data as code) is dangerous for a reason).

So the whole process is going from something ambiguous and context dependent, to something that isn't. And we do this by eliminating incorrect definitions. Tell me how LLMs is going to help with that when it has no sense of what correct and what it is not (aka judging truthness).

thunky · 6h ago

> Tell me how LLMs is going to help with that when it has no sense of what correct and what it is not (aka judging truthness).

Same way it works with humans: someone tells it what "correct" means until it gets it right.

steveklabnik · 7h ago

> Dothey do requirement gatherings?

This is true, but they have helped prepare me with good questions to ask during those meetings!

> Do they do the analysis? Removing specs that conflict with each other, validating what's possible in the technical domain and in the business domain?

Yes, I have had LLMs point out missing information or conflicting information in the spec. See above about "good questions to ask stakeholders."

> Do they help with design? Helping coming up with the changes that impact the current software the least, fitting in the current architecture and be maintainable in the feature.

Yes.

I recently had a scenario where I had a refactoring task that I thought I should do, but didn’t really want to. It was cleaning up some error handling. This would involve a lot of changes to my codebase, nothing hard, but it would have taken me a while, and been very boring, and I’m trying to ship features, not polish off the perfect codebase, so I hadn’t done it, even though I still thought I should.

I was able to ask Claude “hey, how expensive would this refactoring be? how many methods would it change? What’s the before/after diffs on a simple affected place, and one of the more complex affected places look like?

Previously, I had to use my hard-won human intuition to make the call about implementing this or not. It’s very fuzzy. With Claude, I was able to very quickly quantify that fuzzy notion into something at least close to accurate: 260 method signatures. Before and after diffs look decent. And this kind of fairly mechanical transformation is something Claude can do much more quickly and just as accurately as I can. So I finally did it.

That I shipped the refactoring is one point. But the real point is that I was able to quickly focus my understanding of the problem, and make a better, more informed decision because of it. My gut was right. But now I knew it was right, without needing to actually try it out.

> Not a true debate or weighing options based on the organization context.

This context is your job to provide. They will take it into account when you provide it.

> Do they help with coding?

Yes.

> Do they help with testing? Coming up with tests plan, writing test code, running them, analysing the output of the various tools and producing a cohesive report of the defects?

Yes, absolutely.

> Do they help with maintenance? Taking the same software and making changes to keep it churning on new platforms, through dependencies updates and bug fixes?

See above about refactoring to improve quality.

stpedgwdgfhgdd · 7h ago

+1. Some refactorings are important but just not urgent enough compared to features. Letting CC do these refactorings makes quite a difference.

At least in the case of lot of automated test coverage and typed language (Go) so it can work independently efficiently.

diggan · 8h ago

I mean, if you "program" (prompt) them to do those stuff, then yeah, they'll do that. But you have to consider the task just like if you handed it over to a person with absolutely zero previous context, and explain what you need from the "requirements gathering", and how it should handle that.

None of the LLMs handle any of those things by themselves, because that's not what they're designed for. They're programmable things that output text, that you can then program to perform those tasks, but only if you can figure out exactly how a human would handle it, and you codify all the things we humans can figure out by ourselves.

skydhash · 8h ago

> But you have to consider the task just like if you handed it over to a person with absolutely zero previous context,

Which no one does. Even when hiring someone, there's the basic premise that they know how they should do the job (interns are there to learn, not to do). And then they are trained for the particular business context, with a good incentive to learn well and then do the job well.

You don't just suddenly wake up and find yourself at an unknown company being asked to code something for a jira task. And if you do find yourself in such situation, the obvious thing is to figure what's going on, not "Sure, I'll do it".

diggan · 8h ago

I don't understand the argument, I haven't said humans act like that, what I said is how you have to treat LLMs if you want to use it for things like that.

If you're somehow under the belief that LLMs will (or should) magically replace a person, I think you've built the wrong understanding of what LLMs are and what they can do.

skydhash · 7h ago

I interact with tools and with people. When with people, there's a shared understanding of the goal and the context (aka, alignment as some people like to called it). With tools, there's no such context needed. Instead I need reproducible results and clear output. And if it's something that I can automate, that it will follow my instructions closely.

LLMs are obviously tools, but their parameters space is so huge that's it's difficult to provide enough to ensure reliable results. With prompting, we have unreliable answers, but with agents, you have actions being made upon those reliable answers. We had that before with people copying and pasting from LLMs output, but now the same action is being automated. And then there's the feedback loop, where the agent is taking input from the same thing it has altered (often wrongly).

So it goes like this: Ambiguous query -> unrealiable information -> agents acting -> unreliable result -> unreliable validation -> final review (which are often skipped). And then the loop.

While with normal tools: Ambiguous requirement -> detailed specs -> formal code -> validation -> report of divergence -> review (which can be skipped) . There are issues in the process (which give us bugs) but we can pinpoint where we did wrong and fix the issue.

diggan · 2h ago

I'm sorry, I'm very lost here, are you responding to the wrong comment or something? Because I don't see how any of that is connected to the conversation from here on up?

skydhash · 2h ago

>>> But you have to consider the task just like if you handed it over to a person with absolutely zero previous context, and explain what you need from the "requirements gathering", and how it should handle that

The most similar thing is software. Which is a list of instructions we give to a computer alongside the data that forms the context for this particular run. Then it goes to process that data and gives us a result. The basic premise is that these instructions need to be formal so that they became context-free. The whole context is the input to the code, and you can use the code whenever.

Natural language is context dependent. And the final result depends on the participants. So what you want is a shared understanding so that instructions are interpreted the same way by every participant. Someone (or the LLM) coming in with zero context is already a failure scenario. But even with the context baked in every participant, misunderstandings will occur.

So what you want is formal notation which removes ambiguity. It's not as flexible as natural language or as expressive, but it's very good at sharing instructions and information.

danielbln · 9h ago

No they don't do requirements gathering, they also don't cook my food and wash my clothing. Some things are out of scope for an LLM.

Yes, they can do analysis, identify conflicting specs, etc. especially with a skilled human in the loop

Yes, they help with design, though this works best if the operator has sufficient knowledge.

The LLM can help significantly by walking through the code base, explaining parts of it in variable depth.

Yes, agentic LLMs can easily write tests, run them, validate the output (again, best used with an experienced operator so that anti-patterns are spotted early).

From your posts I gather you have not yet worked with a strong LLM in an agentic harness, which you can think of as almost a general purpose automation solution that can either handle, or heavily support most if not all of your points that you have mentioned.

troupo · 10h ago

This is the crypto discussion again.

"All our critics are clueless morons who haven't realised the one true meaning of things".

Have you once considered that critics have tried these tools in all these combinations and found them lacking in more ways than one?

diggan · 8h ago

The huge gap between the people who claim "It helps me some/most of the time" and the other people who claim "I've tried everything and it's all bad" is really interesting to me.

Is it a problem of knowledge? Is it a problem of hype that makes people over-estimate their productivity? Is it a problem of UX, where it's hard to figure out how to use these tools correctly? Is it a problem of the user's skills, where low-skilled developers see lots of value but high-skilled developers see no value, or even negative value sometimes?

The experiences seem so different, that I'm having a hard time wrapping my mind around it. I find LLMs useful in some particular instances, but not all of them, and I don't see them as the second coming of Jesus. But then I keep seeing people saying they've tried all the tools, and all the approaches, and they understand prompting, yet they cannot get any value whatsoever from the tools.

This is maybe a bit out there, but would anyone (including parent) be up for sending me a screen recording of exactly what you're doing, if you're one of the people that get no value whatsoever from using LLMs? Or maybe even a video call sharing your screen?

I'm not working in the space, have no products or services to sell, only curious is why this vast gap seemingly exists, and my only motive would be to understand if I'm the one who is missing something, or there are more effective ways to help people understand how they can use LLMs and what they can use them for.

My email is on my profile if anyone is up for it. Invitation open for anyone struggling to get any useful responses from LLMs.

skydhash · 8h ago

I think it's going to be personal. Because people define values in different ways, and the definition depends on the current context. I've used LLMs for things like shellscript, plotting with pyplot, explanations,... But always taking the output with a huge grain of salt. What I'm looking for is not the output itself, but the direction it can give me. But the only value is when I'm pressed for time and can't use a more objective and complete approach.

When you read the manual page for a program, or the documentation for a library, the things described always (99.99999...%) exist. So I can take it as objective truth. The description may be lacking, so I don't have a complete picture, but it's not pure fantasy. And if it turns out that it is, the solution is to drop it and turn back.

So when I act upon it, and the result comes back, I question my approach, not the information. And often I find the flaw quickly. It's slower initially, but the final result is something I have good confidence in.

diggan · 7h ago

> And often I find the flaw quickly. It's slower initially, but the final result is something I have good confidence in.

I guess what I'm looking for are people who don't have that experience, because you seem to be getting some value out of using LLMs at least, if I understand you correctly?

There are others out there who have tried the same approach, and countless of other approaches (self-declared at least) yet get 0 value from them, or negative value. These are the people I'm curious about :)

troupo · 7h ago

> The experiences seem so different, that I'm having a hard time wrapping my mind around it.

Because we only see very disjointed descriptions, with no attempt to quantify what we're talking about.

For every description of how LLMs work or don't work we know only some, but not all of the following:

- Do we know which projects people work on? No

- Do we know which codebases (greenfield, mature, proprietary etc.) people work on? No

- Do we know the level of expertise the people have? Is the expertise in the same domain, codebase, language that they apply LLMs to?

- How much additional work did they have reviewing, fixing, deploying, finishing etc.?

Even if you have one person describing all of the above, you will not be able to compare their experience to anyone else's because you have no idea what others answer for any of those bullet points.

And that's before we get into how all these systems and agents are completely non-deterministic, and works now may not work even 1 minute from now for the exact same problem.

And that's before we ask the question of how a senior engineer's experience with a greenfield project in React with one agent and model can even be compared to a bon-coding designer in a closed-source proprietary codebase in OCaml with a different agent and model (or even the same, because of non-determinism).

skydhash · 6h ago

> And that's before we get into how all these systems and agents are completely non-deterministic,

And that is the main issue. For some the value is reproducible results, for others, as long as they got a good result, it's fine.

It's like coin tossing. You may want tail all the time, because that's your chosen bet. You may prefer tail, but don't mind losing money if it's head. You may not interested in either, but you're doing the tossing and wants to know the techniques that works best for getting tail. Or you're just trying and if it's tail, your reaction is only "That's interesting".

The coin itself does not matter and the tossing is just an action. The output is what get judged. And the judgment will vary based on the person doing it.

So software engineering used to be the pursuit of tail of the time (by putting the coin on the ground, not tossing it). Then LLMs users say it's fine to toss the coin, because you'll get tail eventually. And companies are now pursuing the best coin tossing techniques to get tail. And for some, when the coin tossing gives tail, they only say "that's a nice toss".

troupo · 4h ago

> And companies are now pursuing the best coin tossing techniques to get tail.

With the only difference that the techniques for throwing coins can be verified by comparing the results of the tosses. More generally it's known as forcing https://en.wikipedia.org/wiki/Forcing_(magic)

What we have instead is companies (and people) saying they have perfected the toss not just for a specific coin, but for any objects in general. When it's very hard to prove that it's true even for a single coin :)

That said, I really like your comment :)

risyachka · 9h ago

>> is going with MCP, tools etc.

all these are just tools. there is nothing more to it. there is no etc.

hedgehog · 8h ago

I've used Copilot a bit and found it helpful for both coding and maintenance. My setup is pretty basic, and I only use it in places where the task is tedious and I am confident reviewing the diff or other output is sufficient. Things like:

"Refactor: We are replacing FlogSnarble with FloozBazzle. Review the example usage below and replace all usage across the codebase. <put an example>"

"In the browser console I see the error below. The table headers are also squished to the left while the row contents are squished to the right. Propose a fix. <pasted log and stack trace>."

"Restructure to early exit style and return an optional rather than use exceptions."

"Consolidate sliceCheese and all similar cheese-related utility functions into one file. Include doc comments noting the original location for each function."

By construction the resulting changes pass tests, come with an explainer outlining what was changed and why, and are open in tabs in VS Code for review. Meanwhile I can spend the time reading docs, dealing with house keeping tasks, and improving the design of what I'm doing. Better output, less RSI.

skydhash · 7h ago

The reason I tend not to use LLMs for these taks is that they are great for thinking moments. They're so mechanical that you tend to reflect instead. Also I use Vim and Emacs which are great for that type of works (fast navigation and good editing tools) and it's not as tedious as doing in editors like VS Code and Sublime (which are not great at editing). You can even concoct something with tmux, ripgrep/fzf, and nano that is better than VS Code at this.

kasey_junk · 9h ago

I use llms for each of those steps and modeling agent workflows following them has been very successful for me.

I think I’ve become disgruntled with the anti-llm crowd because every objection seems to boil down to “you are doing software engineering wrong” or “you have just described a workflow that is worse than the default”.

Stop for a minute and start from a different premise. There are people out there who know how to deliver software well, have been doing it for decades and find this tooling immensely productivity enhancing. Presume they know as much as you about the industry and have been just as successful doing it.

This person took the time to very specifically outline their workflow and steps in a clear and repeatable way. Rather than trying it and giving feedback in the same specific way you just said they have no idea what they are doing.

Try imagining that they do and it’s you who are not getting the message and see if you get your a different place.

skydhash · 8h ago

Criticism is not refutation. It's identifying flaws (subjectively or objectively) . I'm all for it if you can show me that those flaws don't exist or are inconsequential.

Workflows are personal and the only one who can judge them are the one who is paying for the work. At most, we can compare them in order to improve our own personal workflow.

My feedback is maybe not clear enough. But here are the main points:

- Too complicated in regards to the example provided, with the actual benefits for the complication not explained.

- Not a great methodology because the answer to the queries are tainted by the query. Like testing for alcohol by putting the liquid in a bottle of vodka. When I search for something that is not there, I expect "no results" or an error message. Not a mirage.

- The process of getting information, making decisions, and then acting is corrupted by putting it only at some irrelevant moments: Before even knowing anything; When presented with a restricted list of options with no understanding of the factors that play in the restriction; and after the work is done.

mvanbaak · 10h ago

I fear for the coming 2 to 3 generation of software engineers. Will they be able to handle problems if the AI is not available or is the source of the problem? Only time will tell.

mtkd · 10h ago

Same was said about dejanews, stackoverflow etc. and intellisense

alterom · 9h ago

Stack overflow didn't create a positive feedback loop where the solution to having to deal with an obscure, badly written, incomprehensible code base is creating an more incomprehensible sloppy code to glue it all together.

Neither did intellisense. If anything, it encouraged structuring your code better so that intellisense would be useful.

Intellisense does little for spaghetti code. And it was my #1 motivation to document the code in a uniform way, too.

The most important impact of tools is that they change the way we think and see the world, and this shapes the world we create with these tools.

When you hold a hammer, everything is a nail, as the saying goes.

And when you hold a gun, you're no longer a mere human; you're a gunman. And the solution space for all sorts of problems starts looking very differently.

The AI debate is not dissimilar to the gun debate.

Yes, both guns and the AI are powerful tools that we have to deal with now that they've been invented. And people wielding these tools have an upper hand over those who don't.

The point that people make in both debates that tends to get ignored by the proponents of these tools is that excessive use of the tools is exacerbating the very problem these tools are ostensibly solving.

Giving guns to all schoolchildren won't solve the problem of high school shootings — it will undeniably make it worse.

And giving the AI to all software developers won't solve the problem of bad, broken code that negatively impacts people who interact with it (as either users or developers).

Finally, a note. Both the gun technology and the AI have been continuously improved since their invention. The progress is undeniable.

Anyone who is thinking about guns in 1850 terms is making a mistake; the Maxim was a game changer. And we're not living in ChatGPT 2.0 times either.

But with all the progress made, the solution space that either tool created hasn't been changing in nature. A problem that wasn't solveable with a flintlock musket or several remains intractable for an AK-74 or an M16.

Improvements in either tech certainly did change the scale at which the tools were applied to resolve all sorts of problems.

And the first half of the 20th century, to this day, provides most of the most brilliant, masterful examples of using guns at scale.

What is also true is that the problems never went away. Nor did better guns made the lives of the common soldier any better.

The work of people like nurse Nightingale did.

And most of that work was that the solution to increasingly devastating battlefield casualties and dropping battlefield effectiveness wasn't giving every soldier a Maxim gun — it was better hygiene and living conditions. Washing hands.

The Maxim gun was a game changer, but it wasn't a solution.

The solution was getting out of the game with stupid prizes (like dying of cholera or typhoid fever). And it was an organizational issue, not a technological one.

* * * * *

To end on a good note, an observation for the AI doomers.

Genocides have predated the guns by millenia, and more people have died by the machete and the bayonet than by any other weapon even in the 20th century. Perhaps the 21st too.

Add disease and famine, and death by gun are a drop in the bucket.

Guns aren't a solution to violence, but they're not, in themselves, a cause of it on a large enough scale.

Mass production of guns made it possible to turn everyone into a soldier (and a target), but the absolute majority of people today have never seen war.

And while guns, by design, are harmful —

— they're also hella fun.

xena · 9h ago

But skydhash, if you don't nuke the anthill how can you be sure the ants are dead? Nuke it from orbit, it's the only way to be sure!

jpalomaki · 7h ago

Not sure how things are with Copilot, but with Claude Code a good alternative for MCP is in some cases old fashioned command line tools.

GitHub has gh, there's open source jira-cli, Cloudflare has wrangler and so on. No configuration needed, just mention on the agent doc that this kind of tool is available. Likely it will figure out the rest.

And if you have more complicated needs, then you can combine the commands, add some jq magic, put to package.json and tell agent to use npm run to execute it. Can be faster than doing it via multiple MCP calls.

skatanski · 7h ago

Really cool article. Personally I think the really cool bit about MCP is that you can very easily write your own server which can talk to the db or call various APIs. That server can run locally and be used by GitHub Copilot for answering questions and executing tasks. I also find it useful in a tight corporate environment where it’s more difficult to get a dedicated LLM API key. You can easily do POCa with what every dev has access to.

luckystarr · 11h ago

Playwright MCP is intriguing. I'll definitely give it a run today. Anybody got any tipps or gotchas?

never_inline · 7h ago

Can someone elucidate how using a full blown browser is improvisation over using say markitdown / pandoc / whatever? Given that most useful coding docs sites are static (made with sphinx or mkdocs or whatever)

Kostarrr · 10h ago

If they didn't change it, Playwright uses the aria (accessibility) representation for their MCP agent. It strongly depends on the web page whether or not that yields good results.

We at Octomind use a mix of augmented screenshots and page representation to guide the agent. If Playwright MCP doesnt work on you page, give our MCP a try. We have a free tier.

mohsen1 · 9h ago

I've had success using BrowserMCP

https://browsermcp.io

It really feels magical when the AI agent can browse and click around to understand the problem at hand

Also, sometimes an interactive command can stop agents from doing things. I wrote a small wrapper to always return so agents never stop from working

https://github.com/mohsen1/agentshell

WhitneyLand · 7h ago

What’s with Copilot “agent mode” anyway, how does it compare to using Claude Code or Gemini CLI?

jonstewart · 7h ago

Just yesterday I was reading a critique of MCP that specifically mentioned the GitHub MCP server as being harder to use (from model perspective) and requiring more tokens than having the agent execute git commands directly. I am surprised to see it listed here and also surprised to see two different web search servers and the time one. I would appreciate more detail from the author about the utility of each MCP server—overloading an agent with servers seems like it could be counterproductive.