The Many Sides of Erik Satie (thereader.mitpress.mit.edu)
1 points by anarbadalov 16m ago 0 comments
Getting Started Running Local LLMs with Ollama (mistwire.com)
2 points by mooreds 44m ago 0 comments
A look at Cloudflare's AI-coded OAuth library
169 itsadok 75 6/8/2025, 8:50:16 AM neilmadden.blog ↗
For me this is the key takeaway. You gain proper efficiency using LLMs when you are a competent reviewer, and for lack of a better word, leader. If you don't know the subject matter as well as the LLM, you better be doing something non-critical, or have the time to not trust it and verify everything.
The maximalists and skeptics both are confusing the debate by setting up this straw man that people will be delegating to LLMs blindly.
The idea that someone clueless about OAuth should develop an OAuth lib with LLM support without learning a lot about the topic is... Just wrong. Don't do that.
But if you're willing to learn, this is rocket fuel.
feels like there's a logical flaw here, when the issue is that LLMs are presenting the wrong information or missing it all together. The person trying to learn from it will experience Donald Rumsfield's "unknown unknowns".
I would not be surprised if we experience an even more dramatic "Cobol Moment" a generation from now, but unlike that one thankfully I won't be around to experience it.
It was extremely frustrating.
Well it turns out you can manage just fine.
You shouldn't blindly trust anything. Not what you read, not what people say.
Using LLMs effectively is a skill too, and that does involve deciding when and how to verify information.
With learning, aren’t you exposed to the same risks? Such that if there was a typical blind spot for the LLM, it would show up in the learning assistance and in the development assistance, thus canceling out (i.e unknown unknowns)?
Or am I thinking about it wrongly?
One big technique it sounds like the authors of the OAuth library missed is that LLMs are very good at generating tests. A good development process for today’s coding agents is to 1) prompt with or create a PRD, 2) break this down into relatively simple tasks, 3) build a plan for how to tackle each task, with listed out conditions that should be tested, 3) write the tests, so that things are broken, TDD style and finally 4) write the implementation. The LLM can do all of this, but you can’t one-shot it these days, you have to be a human in the loop at every step, correcting when things go off track. It’s faster, but it’s not a 10x speed up like you might imagine if you think the LLM is just asynchronously taking a PRD some PM wrote and building it all. We still have jobs for a reason.
How do you determine if the LLM accurately reflects what the high-quality source contains, if you haven't read the source? When learning from humans, we put trust on them to teach us based on a web-of-trust. How do you determine the level of trust with an LLM?
I don't think that's how things work. In learning tasks, LLMs are sparring partners. You present them with scenarios, and they output a response. Sometimes they hallucinate completely, but they can also update their context to reflect new information. Their output matches what you input.
This does mean that there's a reliance on me being able to determine what are key facts and when I should be asking for a source though. I have not experienced any significant drawbacks when compared to a classic research workflow though, so in my view it's a net speed boost.
However, this does mean that a huge variety of things remain out of reach for me to accomplish, even with LLM "assistance". So there's a decent chance even the speed boost is only perceptual. If nothing else, it does take a significant amount of drudgery out of it all though.
making mistakes is how we learn, and if they are never pointed out...
Sure, having access to legit experts who can tutor you privately on a range of topics would be better, but that's not realistic.
What I find is that if I need to explore some new domain within a field I'm broadly familiar with, just thinking through what the LLM is saying is sufficient for verification, since I can look for internal consistency and check against things I know already.
When exploring a new topic, often times my questions are superficial enough for me to be confident that the answers are very common in the training data.
When exploring a new topic that's also somewhat niche or goes into a lot of detail, I use the LLM first to get a broad overview and then drill down by asking for specific sources and using the LLM as an assistant to consume authoritative material.
LLMs will tell you 1 or 2 lies for each 20 facts. Its a hard way to learn. They cant even get their urls right...
That was my experience when growing up with school also, except you got punished one way or another for speaking up/trying to correct the teacher. If I speak up with the LLM they either explain why what they said is true, or corrects themselves, 0 emotions involved.
> They cant even get their urls right...
Famously never happens with humans.
If you are in class, and you incorrectly argue, there is a mistake in an explanation of Derivatives or Physics, but you are the one in error, your Teacher hopefully, will not say: "Oh, I am sorry you are absolutely correct. Thank you for your advice.."
I cannot see us living in a world of ignorance where there are literally zero engineers and no one on the planet understands what's been generated. Weirdly we could end up in a place where engineering skills are niche and extremely lucrative.
No comments yet
Fast forward 30 years and modern civilisation is entirely dependent on our AI’s.
Will deep insight and innovation from a human perspective perhaps come to a stop?
What is new is that you'll need the wisdom to figure out when the tool can do the whole job, and where you need to intervene and supervise it closely.
So humans won't be doing any less thinking, rather they'll be putting their thinking to work in better ways.
Experts will become those who use llm to learn and not to write code for them or solve tasks for them so they can build that skill.
For example, at one point a human + computer would have been the strongest combo in chess, now you'd be insane to allow a human to critic a chess bot because they're so unlikely to add value, and statistically a human in the loop would be far more likely to introduce error. Similar things can be said in fields like machine vision, etc.
Software is about to become much higher quality and be written at much, much lower cost.
I don’t think we are anywhere close to doing that.
A good analogy might be how machines gradually replaced textile workers in the 19th century. Were the machines better? Or was there a was to quantitatively measure the quality of their output? No. But at the end of the day companies which embraced the technology were more productive than those who didn't, and the quality didn't decrease enough (if it did at all) that customers would no longer do business with them – so these companies won out.
The same will naturally happen in software over the next few years. You'd be an moron to hire a human expert for $200,000 to critic a cybersecurity optimised model which costs maybe a 100th of the cost of employing a human... And this would likely be true even if we assume the human will catch the odd thing the model wouldn't because there's no such thing as perfect security – it's always a trade off between cost and acceptable risk.
Bookmark this and come back in a few years. I made similar predictions when ChatGPT first came out that within a few years agents would be picking up tickets and raising PRs. Everyone said LLMs were just stochastic parrots and this would not happen, well now it has and increasingly companies are writing more and more code with AI. At my company it's a little over 50% at the mo, but this is increasing every month.
In addition to the ability to review output effectively, I find the more closely I’m able to describe what I want in the way another expert in that domain would, the better the LLM output. Which isn’t really that surprising for a statistical text generation engine.
For example, I'm horrible at math, always been, so writing math-heavy code is difficult for me, I'll confess to not understanding math well enough. If I'm coding with an LLM and making it write math-heavy code, I write a bunch of unit tests to describe what I expect the function to return, write a short description and give it to the LLM. Once the function is written, run the tests and if it passes, great.
I might not 100% understand what the function does internally, and it's not used for any life-preserving stuff either (typically end up having to deal with math for games), but I do understand what it outputs, and what I need to input, and in many cases that's good enough. Working in a company/with people smarter than you tends to make you end up in this situation anyways, LLMs or not.
Though if in the future I end up needing to change the math-heavy stuff in the function, I'm kind of locked into using LLMs for understanding and changing it, which obviously feels less good. But the alternative is not doing it at all, so another tradeoff I suppose.
I still wouldn't use this approach for essential/"important" stuff, but more like utility functions.
People don't learn how a car works before buying one, they just take it to a mechanic when it breaks. Most people don't know how to build a house, they have someone else build it and assume it was done well.
I fully expect people to similarly have LLMs do what the person doesn't know how and assume the machine knew what to do.
I think it's caused by you not having a strong enough system prompt. Once you've built up a slightly reusable system prompt for coding or for infra work, where you bit by bit build it up while using a specific model (since different models respond differently to prompts), you end up getting better and better responses.
So if you notice it putting plaintext credentials in the code, add to the system prompt to not do that. With LLMs you really get what you ask for, and if you miss to specify anything, the LLM will do whatever the probabilities tells it to, but you can steer this by being more specific.
Imagine you're talking to a very literal and pedantic engineer who argues a lot on HN and having to be very precise with your words, and you're like 80% of the way there :)
This is going to be a powerful feedback loop which you might call regression to the intellectual mean.
On any task, most training data is going to represent the middle (or beginning) of knowledge about a topic. Most k8s examples will skip best practices, most react apps will be from people just learning react, etc.
If you want the LLM to do best practices in every knowledge domain (assuming best practices can be consistently well defined), then you have to push it away from the mean of every knowledge domain simultaneously (or else work with specialized fine tuned models).
As you continue to add training data it will tend to regress toward the middle because that's where most people are on most topics.
For complicated reasons the whole database is coming through on 1 topic, so I’m doing some fairly complicated parallelization to squeeze out enough performance.
I’d say overall the AI was close to a 2x speed up. It mostly saved me time when I forgot the go syntax for something vs looking it up.
However, there were at least 4 subtle bugs (and many more unsubtle ones) that I think anyone who wasn’t very familiar with Kafka or multithreaded programming would have pushed to prod. As it is, they took me a while to uncover.
On larger longer lived codebases, I’ve seen something closer to a 10-20% improvement.
All of this is using the latest models.
Overall this is at best the kind of productivity boost we got from moving to memory managed languages. Definitely not something that is going to replace engineers with PMs vibe coding anytime soon (based on rate of change I’ve seen over the last 3 years).
My real worry is that this is going to make mid level technical tornadoes, who in my experience are the most damaging kind of programmer, 10x as productive because they won’t know how to spot or care about stopping subtle bugs.
I don’t see how senior and staff engineers are going to be able to keep up with the inevitable flood of reviews.
I also worry about the junior to senior pipeline in a world where it’s even easier to get something up that mostly works—we already have this problem today with copy paste programmers, but we’ve just make copy paste programming even easier.
I think the market will eventually sort this all out, but I worry that it could take decades.
On the other hand, when you have to write something yourself you drop down to slow and thinking state where you will pay attention to details a lot more. This means that you will catch bugs you wouldn't otherwise think of. That's why people recommend writing toy versions of the tools you are using because writing yourself teaches a lot better than just reading materials about it. This is related to know our cognition works.
This is what keeps me up at night. Not that security holes will inevitably be introduced, or that the models will make mistakes, but that the knowledge and information we have as a society is basically going to get frozen in time to what was popular on the internet before LLMs.
Same here. For some of the services I pay, say the e-mail provider, the fact that they openly deny using LLMs for coding would be a plus for me.
Wow. Anecdotally it's my understanding that OAuth is ... tricky ... but wow.
Some would say it's a dumpster fire. I've never read the spec or implemented it.
That’s why companies go for things that are “battle tested” like vibe coding. ;)
Joke aside—I like how Anthropic is using their own product in a pragmatic fashion. I’m wondering if they’ll use it for their MCP authentication API.
No comments yet
[1] https://news.ycombinator.com/item?id=44205697
In the Github repo Cloudflare says:
"...Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards..."
My conclusion is that as a development team, they learned little since 2017: https://news.ycombinator.com/item?id=13718752
I’m very confident I would have noticed this bias in a first pass of reviewing the code. The very first thing you do in a security review is look at where you use `crypto`, what its inputs are, and what you do with its outputs, very carefully. On seeing that %, I would have checked characters.length and found it to be 62, not a factor of 256; so you need to mess around with base conversion, or change the alphabet, or some other such trick.
This bothers me and makes me lose confidence in the review performed.
so tldr most of the issue the author has is against the person who made the library is the design not the implementation?
'Yes, this does come across as a bit “vibe-coded”, despite what the README says, but so does a lot of code I see written by humans. LLM or not, we have to give a shit.'
If what most people do is "vibe coding" in general, the current definition of vibe coding is essentially meaningless. Instead, the author is making the distinction between "interim workable" and "stainless/battle tested" which is another dimension of code entirely. To describe that as vibe coding causes me to view the author's intent with suspicion.
I read it as: done by AI but not checked by humans.
> There are some tests, and they are OK, but they are woefully inadequate for what I would expect of a critical auth service. Testing every MUST and MUST NOT in the spec is a bare minimum, not to mention as many abuse cases as you can think of, but none of that is here from what I can see: just basic functionality tests.
and
> There are some odd choices in the code, and things that lead me to believe that the people involved are not actually familiar with the OAuth specs at all. For example, this commit adds support for public clients, but does so by implementing the deprecated “implicit” grant (removed in OAuth 2.1).
As Madden concludes "LLM or not, we have to give a shit."
It does not illustrate that at all.
> Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.
> To emphasize, *this is not "vibe coded"*. Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
— https://github.com/cloudflare/workers-oauth-provider
The humans who worked on it very, very clearly took responsibility for code quality. That they didn’t get it 100% right does not mean that they “blindly offloaded responsibility”.
Perhaps you can level that accusation at other people doing different things, but Cloudflare explicitly placed the responsibility for this on the humans.
No comments yet
It's not important whose responsbility led to mistakes, it's important to understand we're creating a responsbility gap.