Writing a storage engine for Postgres: An in-memory table access method (2023) (notes.eatonphil.com)

A couple of weeks ago, I asked google, ordinary google search, how many times the letter r is found in preferred, and it told me 2. This century has taken quite a bitter turn against those of us who think that the 'enough' in 'good enough' ought to exclude products indistinguishable from the most grievously disgraceful products of sloth. But I have also lately realized that human beings, brains, society, culture, education, technology, computers, etc, are all extremely complicated emergent properties of a universe that is far beyond our understanding. And we ought not to complain too seriously, because this, too, shall pass.

schoen · 3h ago

These are always amazing when juxtaposed with apparently impressive LLM reasoning, knowledge, and creativity. You can trivially get them to make the most basic mistakes about words and numbers, and double down on those mistakes, repeatedly explaining that they're totally correct.

Have any systems tried prompting LLMs with a warning like "You don't intuitively or automatically know many facts about words, spelling, or the structure or context of text, when considered as text; for example, you don't intuitively or automatically know how words or other texts are spelled, how many letters they contain, or what the result of applying some code, mechanical transformation, or substitution to a word or text is. Your natural guesses about these subjects are likely to be wrong as a result of how your training doesn't necessarily let you infer correct answers about them. If the content or structure of a word or text, or the result of using a transformation, code, or the like on a text, is a subject of conversation, or you are going to make a claim about it, always use a tool to confirm your intuitions."?

mikestorrent · 2h ago

This is a great idea. Like, if someone asked me to count the number of B's in your paragraph, I'd yeet it through `grep -o 'B' file.txt | wc -l` or similar, why would I sit there counting it by hand?

As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully. Should ought have my glasses. Mouse pointer helps some as an ersatz thousands-separator, but still.

Since we're giving them tools, especially for math, it makes way more sense to start giving them access to some of the finest tools ever. Make an MCP into Mathematica or Matlab and let the LLM write some math and have classical solvers actually deal with the results. Let the LLM write little bits of bash or python as its primary approach for dealing with these kinds of analytical questions.

It's like giving a kid a calculator...

ehnto · 2h ago

I often tell LLMs to ask questions if required, and that it is a skilled developer who is working along side me. That seems to help them be more collaborative rather than prescriptive.

axdsk · 4h ago

“It’s like talking to a PhD level expert” -Sam Altman

https://www.youtube.com/live/0Uu_VJeVVfo?si=PJGU-MomCQP1tyPk

Erem · 2h ago

With data starvation driving ai companies towards synthetic data I’m surprised that an easily synthesized problem like this hasn’t been trained out of relevance. Yet here we are with proof that it hasn’t

quatonion · 2h ago

Are we a hundred percent sure it isn't a watermark that is by design?

A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.

Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.

I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.

The explanation that Karpathy was popularizing was that it has to do with tokenization.

However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.

Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.

So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.

HsuWL · 4h ago

I love this test. Demonstrates the "understanding" process of the language model.

simianwords · 2h ago

If you choose the thinking model it doesn’t make this mistake. It means the auto router should be tuned to call the thinking model on edge cases like these.

GPT-5 (openai.com)

Linear sent me down a local-first rabbit hole (bytemash.net)

Flipper Zero dark web firmware bypasses rolling code security (rtl-sdr.com)

Benchmarking GPT-5 on 400 real-world code reviews (qodo.ai)

Historical Tech Tree (historicaltechtree.com)

A love letter to my future employer (2020) (catzkorn.dev)

GPT-5: Key characteristics, pricing and system card (simonwillison.net)

OpenAI's new open-source model is basically Phi-5 (seangoedecke.com)

Cursor CLI (cursor.com)

Writing a storage engine for Postgres: An in-memory table access method (2023) (notes.eatonphil.com)

GPT-5 for Developers (openai.com)

Virtual Linux Devices on ARM64 (underjord.io)

Cursed Knowledge (immich.app)

Encryption made for police and military radios may be easily cracked (wired.com)

Over engineering my homelab so I don't pay cloud providers (ergaster.org)

Ask HN: Has any of the Pivotal Tracker replacement attempts succeeded?

The Paranoid Style in American Politics (1964) (harpers.org)

Achieving 10,000x training data reduction with high-fidelity labels (research.google)

Exit Tax: Leave Germany before your business gets big (eidel.io)

Building Bluesky comments for my blog (natalie.sh)

Vibechart (vibechart.net)

Windows XP Professional (win32.run)

How AI conquered the US economy: A visual FAQ (derekthompson.org)

Benchmark Framework Desktop Mainboard and 4-node cluster (github.com)

Infinite Pixels (meyerweb.com)

How to sell if your user is not the buyer (writings.founderlabs.io)

Open music foundation models for full-song generation (map-yue.github.io)

Claude Code IDE integration for Emacs (github.com)

Show HN: Browser AI agent platform designed for reliability (github.com)

Turn Any Website into an API (parse.bot)

Show HN: Octofriend, a cute coding agent that can swap between GPT-5 and Claude (github.com)

Foundry (YC F24) is hiring staff-level product engineers (ycombinator.com)

Zero-day flaws in authentication, identity, authorization in HashiCorp Vault (cyata.ai)

An LLM does not need to understand MCP (hackteam.io)

The Q Programming Language (git.urbach.dev)

The Inkhaven Blogging Residency (inkhaven.blog)

Leonardo Chiariglione – Co-founder of MPEG (leonardo.chiariglione.org)

How to Not Build the Torment Nexus (buttondown.com)

MCDB – full-stack web servers in Minecraft (github.com)

GPT-5 leaked system prompt? (gist.github.com)

Monte Carlo Crash Course: Quasi-Monte Carlo (thenumb.at)

The fundamentals still matter (jordangoodman.bearblog.dev)

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model (github.com)

Lightweight LSAT (lightweightlsat.com)

Italy's pizza detectives (bbc.com)

Laptop Support and Usability (LSU): July 2025 Report (github.com)

Show HN: Stasher – Burn-after-read secrets from the CLI, no server, no trust (github.com)

Spatio-temporal indexing the Bluesky firehose (joelgustafson.com)

DNA tests are uncovering the true prevalence of incest (2024) (theatlantic.com)

Emailing a one-time code is worse than passwords (blog.danielh.cc)

GPT-5: "How many times does the letter b appear in blueberry?"

Comments (9)