Show HN: Small tool to query XML data using XPath (github.com)
5 points by linkdd 19h ago 1 comments
Show HN: dbSurface – A Developer Tool for pgvector (github.com)
5 points by z-gort 1d ago 1 comments
A look at Cloudflare's AI-coded OAuth library
202 itsadok 125 6/8/2025, 8:50:16 AM neilmadden.blog ↗
For me this is the key takeaway. You gain proper efficiency using LLMs when you are a competent reviewer, and for lack of a better word, leader. If you don't know the subject matter as well as the LLM, you better be doing something non-critical, or have the time to not trust it and verify everything.
The maximalists and skeptics both are confusing the debate by setting up this straw man that people will be delegating to LLMs blindly.
The idea that someone clueless about OAuth should develop an OAuth lib with LLM support without learning a lot about the topic is... Just wrong. Don't do that.
But if you're willing to learn, this is rocket fuel.
It was extremely frustrating.
https://en.wikipedia.org/wiki/Low-background_steel
Well it turns out you can manage just fine.
You shouldn't blindly trust anything. Not what you read, not what people say.
Using LLMs effectively is a skill too, and that does involve deciding when and how to verify information.
You missed the full context: you would never be able to trust a bunch of amateur randos self-policing their content. Turns out it's not perfect but better than a very small set of professionals; usually there's enough expertise out there, it's just widely distributed. The challenge this time is 1. the scale, 2. the rate of growth, 3. the decline in expertise.
>> Using LLMs effectively is a skill too, and that does involve deciding when and how to verify information.
How do you verify when ALL the sources are share the same AI-generated root, and ALL of the independent (i.e. human) experts have aged-out and no longer exist?
Why would that happen? There's demand for high quality, trustworthy information and that's not going away.
When asking an LLM coding questions, for example, you can ask for sources and it'll point you to documentation. It won't always be the correct link, but you can prod it more and usually get it, or fall back to searching the docs the old fashioned way.
Before AI generated results, the first page of Google was SEO-optimised crap blogs. The internet has been hard to search for a while.
No comments yet
You don't get knowledge by ONLY talking to LLMs, but they're a great tool.
feels like there's a logical flaw here, when the issue is that LLMs are presenting the wrong information or missing it all together. The person trying to learn from it will experience Donald Rumsfield's "unknown unknowns".
I would not be surprised if we experience an even more dramatic "Cobol Moment" a generation from now, but unlike that one thankfully I won't be around to experience it.
I’ve grown the most when I start with things I sort of know and I work to expand my understanding.
With learning, aren’t you exposed to the same risks? Such that if there was a typical blind spot for the LLM, it would show up in the learning assistance and in the development assistance, thus canceling out (i.e unknown unknowns)?
Or am I thinking about it wrongly?
One big technique it sounds like the authors of the OAuth library missed is that LLMs are very good at generating tests. A good development process for today’s coding agents is to 1) prompt with or create a PRD, 2) break this down into relatively simple tasks, 3) build a plan for how to tackle each task, with listed out conditions that should be tested, 3) write the tests, so that things are broken, TDD style and finally 4) write the implementation. The LLM can do all of this, but you can’t one-shot it these days, you have to be a human in the loop at every step, correcting when things go off track. It’s faster, but it’s not a 10x speed up like you might imagine if you think the LLM is just asynchronously taking a PRD some PM wrote and building it all. We still have jobs for a reason.
How do you determine if the LLM accurately reflects what the high-quality source contains, if you haven't read the source? When learning from humans, we put trust on them to teach us based on a web-of-trust. How do you determine the level of trust with an LLM?
But this is only part of the story. When learning from another human, you'll also actively try and devise whether they're trustworthy based on general linguistic markers, and will try to find and poke holes in what they're saying so that you can question intelligently.
This is not much different from what you'd do with an LLM, which is why it's such a problem that they're more convincing than correct pretty often. But it's not an insurmountable issue. The other issue is that their trustworthiness will wary in a different way than a human's, so you need experience to know when they're possibly just making things up. But just based on feel, I think this experience is definitely possible to gain.
Bonus: the high quality source is going to be mostly AI written anyway
I’m still on the lookout for a great model for this.
This does mean that there's a reliance on me being able to determine what are key facts and when I should be asking for a source though. I have not experienced any significant drawbacks when compared to a classic research workflow though, so in my view it's a net speed boost.
However, this does mean that a huge variety of things remain out of reach for me to accomplish, even with LLM "assistance". So there's a decent chance even the speed boost is only perceptual. If nothing else, it does take a significant amount of drudgery out of it all though.
I don't think that's how things work. In learning tasks, LLMs are sparring partners. You present them with scenarios, and they output a response. Sometimes they hallucinate completely, but they can also update their context to reflect new information. Their output matches what you input.
You are getting a stylised view of a topic from an entity who lacks the deep understanding needed to be able to fully distill the information. But it is enough to gain enough knowledge for you to feel confident which is still valuable but also dangerous.
And I assure you that many, many people are delegating to LLMs blindly e.g. it's a huge problem in the UK legal system right now because of all the invented case law references.
Isn't this how every child learns?
Unless his father happens to be king of Macedonia, of course.
making mistakes is how we learn, and if they are never pointed out...
Sure, having access to legit experts who can tutor you privately on a range of topics would be better, but that's not realistic.
What I find is that if I need to explore some new domain within a field I'm broadly familiar with, just thinking through what the LLM is saying is sufficient for verification, since I can look for internal consistency and check against things I know already.
When exploring a new topic, often times my questions are superficial enough for me to be confident that the answers are very common in the training data.
When exploring a new topic that's also somewhat niche or goes into a lot of detail, I use the LLM first to get a broad overview and then drill down by asking for specific sources and using the LLM as an assistant to consume authoritative material.
You know that it's possible to ask models for dissenting opinions, right? Nothing's stopping you.
> and if they are never pointed out...
They do point out mistakes though?
We’ve gone from skeptics saying LLMs can’t code, to they can’t code well, to they can’t produce human-level code, to they are riddled with hallucinations, to now “but they can’t one-shot code a library without any bugs or flaws” and “but they can only one-shot code, they can’t edit well” even tho recents coding utilities have been proving that wrong as well. And still they say they are useless.
Some people just don’t hear themselves or see how AI is constantly moving their bar.
LLMs will tell you 1 or 2 lies for each 20 facts. Its a hard way to learn. They cant even get their urls right...
That was my experience when growing up with school also, except you got punished one way or another for speaking up/trying to correct the teacher. If I speak up with the LLM they either explain why what they said is true, or corrects themselves, 0 emotions involved.
> They cant even get their urls right...
Famously never happens with humans.
If you are in class, and you incorrectly argue, there is a mistake in an explanation of Derivatives or Physics, but you are the one in error, your Teacher hopefully, will not say: "Oh, I am sorry you are absolutely correct. Thank you for your advice.."
- Confident synthesis of incompatible sources: LLM: “Einstein won the 1921 Nobel Prize for his theory of relativity, which he presented at the 1915 Solvay Conference.”
Or
- Fabricated but plausible citations: LLM: “According to Smith et al., 2022, Nature Neuroscience, dolphins recognise themselves in mirrors.” There is no such paper...model invents both authors and journal reference
And this is the danger of coding with LLMs....
What matters is how X reacts when you point out it wasn't correct, at least in my opinion, and was the difference I was trying to highlight.
An LLM, by contrast, will invent a flawless looking but nonexistent citation. Even a below average teacher doesn’t churn out fresh fabrications every tenth sentence.
Because a teacher usually cites recognizable material, you can check the textbook and recover quickly. With an LLM you first have to discover the source never existed. That verification cost is higher, the more complex task you are trying to achieve.
A LLM will give you a perfect paragraph about the AWS Database Migration service, the list of supported databases, and then include in there a data flow like on-prem to on-prem data that is not supported...Relying on an LLM is like flying with a friendly copilot but who has multiple personality disorder. You dont know which day he will forget to take his meds :-)
Stressful and mentally exhausting in a different kind of way....
Me: “explain why radioactive half-life changes with temperature”
ChatGPT 4o: “ Short answer: It doesn’t—at least not significantly. Radioactive Half-Life is (Almost Always) Temperature-Independent”
…and then it goes on to give a few edge cases where there’s a tiny effect.
I cannot see us living in a world of ignorance where there are literally zero engineers and no one on the planet understands what's been generated. Weirdly we could end up in a place where engineering skills are niche and extremely lucrative.
No comments yet
Fast forward 30 years and modern civilisation is entirely dependent on our AI’s.
Will deep insight and innovation from a human perspective perhaps come to a stop?
Tools will only amplify human skills. Sure, not everyone will choose to use tools for anything meaningful, but those people are driving human insight and innovation today anyway.
What is new is that you'll need the wisdom to figure out when the tool can do the whole job, and where you need to intervene and supervise it closely.
So humans won't be doing any less thinking, rather they'll be putting their thinking to work in better ways.
Experts will become those who use llm to learn and not to write code for them or solve tasks for them so they can build that skill.
For example, at one point a human + computer would have been the strongest combo in chess, now you'd be insane to allow a human to critic a chess bot because they're so unlikely to add value, and statistically a human in the loop would be far more likely to introduce error. Similar things can be said in fields like machine vision, etc.
Software is about to become much higher quality and be written at much, much lower cost.
I don’t think we are anywhere close to doing that.
A good analogy might be how machines gradually replaced textile workers in the 19th century. Were the machines better? Or was there a was to quantitatively measure the quality of their output? No. But at the end of the day companies which embraced the technology were more productive than those who didn't, and the quality didn't decrease enough (if it did at all) that customers would no longer do business with them – so these companies won out.
The same will naturally happen in software over the next few years. You'd be an moron to hire a human expert for $200,000 to critic a cybersecurity optimised model which costs maybe a 100th of the cost of employing a human... And this would likely be true even if we assume the human will catch the odd thing the model wouldn't because there's no such thing as perfect security – it's always a trade off between cost and acceptable risk.
Bookmark this and come back in a few years. I made similar predictions when ChatGPT first came out that within a few years agents would be picking up tickets and raising PRs. Everyone said LLMs were just stochastic parrots and this would not happen, well now it has and increasingly companies are writing more and more code with AI. At my company it's a little over 50% at the mo, but this is increasing every month.
In addition to the ability to review output effectively, I find the more closely I’m able to describe what I want in the way another expert in that domain would, the better the LLM output. Which isn’t really that surprising for a statistical text generation engine.
For example, I'm horrible at math, always been, so writing math-heavy code is difficult for me, I'll confess to not understanding math well enough. If I'm coding with an LLM and making it write math-heavy code, I write a bunch of unit tests to describe what I expect the function to return, write a short description and give it to the LLM. Once the function is written, run the tests and if it passes, great.
I might not 100% understand what the function does internally, and it's not used for any life-preserving stuff either (typically end up having to deal with math for games), but I do understand what it outputs, and what I need to input, and in many cases that's good enough. Working in a company/with people smarter than you tends to make you end up in this situation anyways, LLMs or not.
Though if in the future I end up needing to change the math-heavy stuff in the function, I'm kind of locked into using LLMs for understanding and changing it, which obviously feels less good. But the alternative is not doing it at all, so another tradeoff I suppose.
I still wouldn't use this approach for essential/"important" stuff, but more like utility functions.
People don't learn how a car works before buying one, they just take it to a mechanic when it breaks. Most people don't know how to build a house, they have someone else build it and assume it was done well.
I fully expect people to similarly have LLMs do what the person doesn't know how and assume the machine knew what to do.
Because LLMs are not competent professionals to whom you might outsource tasks in your life. LLMs are statistical engines that make up answers all the time, even when the LLM “knows” the correct answer (i.e., has the correct answer hidden away in its weights.)
I don’t know about you, but I’m able to validate something is true much more quickly and efficiently if it is a subject I know well.
For instance, I used Next.js to build a simple login page with Google auth. It worked great, even though I only had basic knowledge of Node.js and a bit of React.
Then I tried adding a database layer using Prisma to persist users. That's where things broke. The integration didn't work, seemingly due to recent versions in Prisma or subtle breaking updates. I found similar issues discussed on GitHub and Reddit, but solving them required shifting into full manual debugging mode.
My takeaway: even with improved models, fast-moving frameworks and toolchains can break workflows in ways that LLMs/ML (at least today) can't reason through or fix reliably. It's not always about missing domain knowledge, it's that the moving parts aren't in sync with the model yet.
I think it's caused by you not having a strong enough system prompt. Once you've built up a slightly reusable system prompt for coding or for infra work, where you bit by bit build it up while using a specific model (since different models respond differently to prompts), you end up getting better and better responses.
So if you notice it putting plaintext credentials in the code, add to the system prompt to not do that. With LLMs you really get what you ask for, and if you miss to specify anything, the LLM will do whatever the probabilities tells it to, but you can steer this by being more specific.
Imagine you're talking to a very literal and pedantic engineer who argues a lot on HN and having to be very precise with your words, and you're like 80% of the way there :)
This is going to be a powerful feedback loop which you might call regression to the intellectual mean.
On any task, most training data is going to represent the middle (or beginning) of knowledge about a topic. Most k8s examples will skip best practices, most react apps will be from people just learning react, etc.
If you want the LLM to do best practices in every knowledge domain (assuming best practices can be consistently well defined), then you have to push it away from the mean of every knowledge domain simultaneously (or else work with specialized fine tuned models).
As you continue to add training data it will tend to regress toward the middle because that's where most people are on most topics.
For complicated reasons the whole database is coming through on 1 topic, so I’m doing some fairly complicated parallelization to squeeze out enough performance.
I’d say overall the AI was close to a 2x speed up. It mostly saved me time when I forgot the go syntax for something vs looking it up.
However, there were at least 4 subtle bugs (and many more unsubtle ones) that I think anyone who wasn’t very familiar with Kafka or multithreaded programming would have pushed to prod. As it is, they took me a while to uncover.
On larger longer lived codebases, I’ve seen something closer to a 10-20% improvement.
All of this is using the latest models.
Overall this is at best the kind of productivity boost we got from moving to memory managed languages. Definitely not something that is going to replace engineers with PMs vibe coding anytime soon (based on rate of change I’ve seen over the last 3 years).
My real worry is that this is going to make mid level technical tornadoes, who in my experience are the most damaging kind of programmer, 10x as productive because they won’t know how to spot or care about stopping subtle bugs.
I don’t see how senior and staff engineers are going to be able to keep up with the inevitable flood of reviews.
I also worry about the junior to senior pipeline in a world where it’s even easier to get something up that mostly works—we already have this problem today with copy paste programmers, but we’ve just make copy paste programming even easier.
I think the market will eventually sort this all out, but I worry that it could take decades.
The seems to match my experience in "important" work too; a real increase but not essentially changing the essence of software development. Brook's "No Silver Bullet" strikes again...
Leaning on and heavily relying on a black box that hallucinates gibberish to “learn”, perform your work, and review your work.
All the while it literally consumes ungodly amounts of energy and is used as pretext to get rid of people.
Really cool stuff! I’m sure it’s 10x’ing your life!
All speculation, but I'd be curious to see it evaluated - does the LLM do better edits on egregiously commented code?
On the other hand, when you have to write something yourself you drop down to slow and thinking state where you will pay attention to details a lot more. This means that you will catch bugs you wouldn't otherwise think of. That's why people recommend writing toy versions of the tools you are using because writing yourself teaches a lot better than just reading materials about it. This is related to know our cognition works.
I have a lot of experience reviewing code -- more than I ever really wanted. It has... turned me cynical and bitter, to the point that I never believe anything is right, no matter who wrote it or how nice it looks, because I've seen so many ways things can go wrong. So I tend to review every line, simulate it in my head, and catch things. I kind of hate it, because it takes so long for me to be comfortable approving anything, and my reviewees hate it too, so they tend to avoid sending things to me.
I think I agree that if I'd written the code by hand, it would be less likely to have bugs. Maybe. I'm not sure, because I've been known to author some pretty dumb bugs of my own. But yes, total Kenton brain cycles spent on each line would be higher, certainly.
On the other hand, though, I probably would not have been the one to write this library. I just have too much on my plate (including all those reviews). So it probably would have been passed off to a more junior engineer, and I would have reviewed their work. Would I have been more or less critical? Hard to say.
But one thing I definitely disagree with is the idea that humans would have produced bug-free code. I've seen way too many bugs in my time to take that seriously. Hate to say it but most of the bugs I saw Claude produce are mistakes I'd totally expect an average human engineer could make.
Aside, since I know some people are thinking it: At this time, I do not believe LLM use will "replace" any human engineers at Cloudflare. Our hiring of humans is not determined by how much stuff we have to do, because we basically have infinite stuff we want to do. The limiting factor is what we have budget for. If each human becomes more productive due to LLM use, and this leads to faster revenue growth, this likely allows us to hire more people, not fewer. (Disclaimer: As with all of my comments, this is my own opinion / observation, not an official company position.)
> I’m also an expert in OAuth
I'll admin I think Neil is significantly more of an expert than me, so I'm delighted he took a pass at reviewing the code! :)
I'd like to respond to a couple of the points though.
> The first thing that stuck out for me was what I like to call “YOLO CORS”, and is not that unusual to see: setting CORS headers that effectively disable the same origin policy almost entirely for all origins:
I am aware that "YOLO CORS" is a common novice mistake, but that is not what is happening here. These CORS settings were carefully considered.
We disable the CORS headers specifically for the OAuth API (token exchange, client registration) endpoints and for the API endpoints that are protected by OAuth bearer tokens.
This is valid because none of these endpoints are authorized by browser credentials (e.g. cookies). The purpose of CORS is to make sure that a malicious website cannot exercise your credentials against some other website by sending a request to it and expecting the browser to add your cookies to that request. These endpoints, however, do not use browser credentials for authentication.
Or to put in another way, the endpoints which have open CORS headers are either control endpoints which are intentionally open to the world, or they are API endpoints which are protected by an OAuth bearer token. Bearer tokens must be added explicitly by the client; the browser never adds one automatically. So, in order to receive a bearer token, the client must have been explicitly authorized by the user to access the service. CORS isn't protecting anything in this case; it's just getting in the way.
(Another purpose of CORS is to protect confidentiality of resources which are not available on the public internet. For example, you might have web servers on your local network which lack any authorization, or you might unwisely use a server which authorizes you based on IP address. Again, this is not a concern here since the endpoints in question don't provide anything interesting unless the user has explicitly authorized the client.)
Aside: Long ago I was actually involved in an argument with the CORS spec authors, arguing that the whole spec should be thrown away and replaced with something that explicitly recognizes bearer tokens as the right way to do any cross-origin communications. It is almost never safe to open CORS on endpoints that use browser credentials for auth, but it is almost always safe to open it on endpoints that use bearer tokens. If we'd just recognized and embraced that all along I think it would have saved a lot of confusion and frustration. Oh well.
> A more serious bug is that the code that generates token IDs is not sound: it generates biased output.
I disagree that this is a "serious" bug. The tokens clearly have enough entropy in them to be secure (and the author admits this). Yes, they could pack more entry per byte. I noticed this when reviewing the code, but at the time decided:
1. It's secure as-is, just not maximally efficient. 2. We can change the algorithm freely in the future. There is not backwards-compatibility concern.
So, I punted.
Though if I'd known this code was going to get 100x more review than anything I've ever written before, I probably would have fixed it... :)
> according to the commit history, there were 21 commits directly to main on the first day from one developer, no sign of any code review at all
Please note that the timestamps at the beginning of the commit history as shown on GitHub are misleading because of a history rewrite that I performed later on to remove some files that didn't really belong in the repo. GitHub appears to show the date of the rebase whereas `git log` shows the date of actual authorship (where these commits are spread over several days starting Feb 27).
> I had a brief look at the encryption implementation for the token store. I mostly like the design! It’s quite smart.
Thank you! I'm quite proud of this design. (Of course, the AI would never have come up with it itself, but it was pretty decent and filling in the details based on my explicit instructions.)
But to be clear, I had no idea how to write good prompts. I basically just wrote like I would write to a human. That seemed to work.
I didn't think of that, though. I didn't have an agenda here, I just put the note in the readme about it being LLM-generated only because I thought it was interesting.
This is what keeps me up at night. Not that security holes will inevitably be introduced, or that the models will make mistakes, but that the knowledge and information we have as a society is basically going to get frozen in time to what was popular on the internet before LLMs.
Same here. For some of the services I pay, say the e-mail provider, the fact that they openly deny using LLMs for coding would be a plus for me.
Wow. Anecdotally it's my understanding that OAuth is ... tricky ... but wow.
Some would say it's a dumpster fire. I've never read the spec or implemented it.
That’s why companies go for things that are “battle tested” like vibe coding. ;)
Joke aside—I like how Anthropic is using their own product in a pragmatic fashion. I’m wondering if they’ll use it for their MCP authentication API.
In the Github repo Cloudflare says:
"...Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards..."
My conclusion is that as a development team, they learned little since 2017: https://news.ycombinator.com/item?id=13718752
I’m very confident I would have noticed this bias in a first pass of reviewing the code. The very first thing you do in a security review is look at where you use `crypto`, what its inputs are, and what you do with its outputs, very carefully. On seeing that %, I would have checked characters.length and found it to be 62, not a factor of 256; so you need to mess around with base conversion, or change the alphabet, or some other such trick.
This bothers me and makes me lose confidence in the review performed.
[1] https://news.ycombinator.com/item?id=44205697
Moreover, after developing the code, I have multiple LLMs critique the code, file by file, or even method by method.
When I say multiple, I mean a non-reasoning one, a reasoning large one, and a next-gen reasoning small one, preferably by multiple vendors.
so tldr most of the issue the author has is against the person who made the library is the design not the implementation?
'Yes, this does come across as a bit “vibe-coded”, despite what the README says, but so does a lot of code I see written by humans. LLM or not, we have to give a shit.'
If what most people do is "vibe coding" in general, the current definition of vibe coding is essentially meaningless. Instead, the author is making the distinction between "interim workable" and "stainless/battle tested" which is another dimension of code entirely. To describe that as vibe coding causes me to view the author's intent with suspicion.
I read it as: done by AI but not checked by humans.
> There are some tests, and they are OK, but they are woefully inadequate for what I would expect of a critical auth service. Testing every MUST and MUST NOT in the spec is a bare minimum, not to mention as many abuse cases as you can think of, but none of that is here from what I can see: just basic functionality tests.
and
> There are some odd choices in the code, and things that lead me to believe that the people involved are not actually familiar with the OAuth specs at all. For example, this commit adds support for public clients, but does so by implementing the deprecated “implicit” grant (removed in OAuth 2.1).
As Madden concludes "LLM or not, we have to give a shit."
It does not illustrate that at all.
> Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.
> To emphasize, *this is not "vibe coded"*. Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
— https://github.com/cloudflare/workers-oauth-provider
The humans who worked on it very, very clearly took responsibility for code quality. That they didn’t get it 100% right does not mean that they “blindly offloaded responsibility”.
Perhaps you can level that accusation at other people doing different things, but Cloudflare explicitly placed the responsibility for this on the humans.
No comments yet
It's not important whose responsbility led to mistakes, it's important to understand we're creating a responsbility gap.