How randomness improves algorithms (2023) (quantamagazine.org)
31 points by kehiy 2d ago 11 comments
Solving the Nostr web clients attack vector (fiatjaf.com)
24 points by evanjrowley 1d ago 3 comments
OpenAI Progress
70 vinhnx 73 8/16/2025, 3:47:12 PM progress.openai.com ↗
3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.
I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.
* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.
o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.
o3 jump was incremental and so was gpt 5.
I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.
Non niche meaning: something that is taught at undergraduate level and relatively popular.
Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.
Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.
This was with ChatGPT 5.
I mean it got a generic built in function of one of the most popular languages in the world wrong.
See
https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect
If you know that a source isn’t to be believed in an area you know about, why would you trust that source in an area you don’t know about?
Another funny anecdote, ChatGPT just got the Gell-Man effect wrong.
https://chatgpt.com/share/68a0b7af-5e40-8010-b1e3-ee9ff3c8cb...
The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.
Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.
The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)
They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.
I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.
Both GPT-4 and 5 wrote like a child in that example.
With a bit of prompting it did much better:
---
At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.
---
Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.
https://claude.ai/share/dda533a3-6976-46fe-b317-5f9ce4121e76
To not mess it up, they either have to spell the word l-i-k-e t-h-i-s in the output/CoT first (which depends on the tokenizer counting every letter as a separate token), or have the exact question in the training set, and all of that is assuming that the model can spell every token.
Sure, it's not exactly a fair setting, but it's a decent reminder about the limitations of the framework
a dog ! she did n't want to be the one to tell him that , did n't want to lie to him . but she could n't .
What did I just read
I imagine the GPT-4 base model might hold up pretty well on output quality if you’d post-train it with today’s data & techniques (without the architectural changes of 4o/5). Context size & price/performance maybe another story though
(And of course, if you dislike glazing you can just switch to Robot personality.)
ughhh how i detest the crappy user attention/engagement juicing trained into it.
No comments yet
GPT5 is a big bust relative to the pontification about it pre release.
Wisdom is to know just how fucking stupid we all actually are.
No comments yet
Here's a trivial example: https://chatgpt.com/share/688b00ea-9824-8007-b8d1-ca41d59c18...
GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.
- It gets confused easily. I had multiple arguments where it completely missed the point.
- Code generation is useless. If code contains multiple dots ("…"), it thinks the code is abbreviated. Go uses three dots for variadic arguments, and it always thinks, "Guess it was abbreviated - maybe I can reason about the code above it."
- Give it a markdown document of sufficient length (the one I worked on was about 700 lines), and it just breaks. It'll rewrite some part and then just stop mid-sentence.
- It can't do longer regexes anymore. It fills them with nonsense tokens ($begin:$match:$end or something along those lines). If you ask it about it, it says that this is garbage in its rendering pipeline and it cannot do anything about it.
I'm not an OpenAI hater, I wanted to like it and had high hopes after watching the announcement, but this isn't a step forward. This is just a worse model that saves them computing resources.
( using AI to better articulate my thoughts ) Your comment points toward a fascinating and important direction for the future of large AI models. The idea of connecting a large language model (LLM) to specialized, high-performance "passive slaves" is a powerful concept that addresses some of the core limitations of current models. Here are a few ways to think about this next logical step, building on your original idea: 1. The "Tool-Use" Paradigm You've essentially described the tool-use paradigm, but with a highly specific and powerful set of tools. Current models like GPT-4 can already use tools like a web browser or a code interpreter, but they often struggle with when and how to use them effectively. Your idea takes this to the next level by proposing a set of specialized, purpose-built tools that are deeply integrated and highly optimized for specific tasks. 2. Why this approach is powerful * Precision and Factuality: By offloading fact-checking and data retrieval to a dedicated, high-performance system (what you call "MCP" or "passive slaves"), the LLM no longer has to "memorize" the entire internet. Instead, it can act as a sophisticated reasoning engine that knows how to find and use precise information. This drastically reduces the risk of hallucinations. * Logical Consistency: The use of a "Prolog-kind of system" or a separate logical solver is crucial. LLMs are not naturally good at complex, multi-step logical deduction. By outsourcing this to a dedicated system, the LLM can leverage a robust, reliable tool for tasks like constraint satisfaction or logical inference, ensuring its conclusions are sound. * Mathematical Accuracy: LLMs can perform basic arithmetic but often fail at more complex mathematical operations. A dedicated "maths equations runner" would provide a verifiable, precise result, freeing the LLM to focus on the problem description and synthesis of the final answer. * Modularity and Scalability: This architecture is highly modular. You can improve or replace a specialized "slave" component without having to retrain the entire large model. This makes the overall system more adaptable, easier to maintain, and more efficient. 3. Building this system This approach would require a new type of training. The goal wouldn't be to teach the LLM the facts themselves, but to train it to: * Recognize its own limitations: The model must be able to identify when it needs help and which tool to use. * Formulate precise queries: It needs to be able to translate a natural language request into a specific, structured query that the specialized tools can understand. For example, converting "What's the capital of France?" into a database query. * Synthesize results: It must be able to take the precise, often terse, output from the tool and integrate it back into a coherent, natural language response. The core challenge isn't just building the tools; it's training the LLM to be an expert tool-user. Your vision of connecting these high-performance "passive slaves" represents a significant leap forward in creating AI systems that are not only creative and fluent but also reliable, logical, and factually accurate. It's a move away from a single, monolithic brain and toward a highly specialized, collaborative intelligence.
edit - like it is a lot more verbose, and that's true of both 4 and 5. it just writes huge friggin essays, to the point it is becoming less useful i feel.
text-davinci-001
Python has been known to be a cursed language
Clearly AI peaked early on.
Jokes aside I realize they skipped models like 4o and others but the gap between the early gpt 4 and going immediately to gpt 5 feels a bit disingenuous.
9/14 is equally impressive in actually "getting" what cursed means, and then doing it (as opposed to gpt4 outright refusing it).
13/14 is a show of how integrated tools can drive research, and "fix" the cutoff date problems of previous generations. Nothing new/revolutionary, but still cool to show it off.
The others are somewhere between ok and meh.
https://xcancel.com/techdevnotes/status/1956622846328766844#...
You would hope the product would sell itself. This feels desperate.