The Illusion of Thinking: Understanding the Limitations of Reasoning LLMs [pdf] (ml-site.cdn-apple.com)

    > While sparring with o4-mini was thrilling, its progress was also alarming. Ono and He express concern that the o4-mini’s results might be trusted too much. “There’s proof by induction, proof by contradiction, and then proof by intimidation,” He says. “If you say something with enough authority, people just get scared. I think o4-mini has mastered proof by intimidation; it says everything with so much confidence.”

I think there is a pitfall of designating a uniform categorization of “reasoning” like in this article; it is not surprising to hear that models are good at casting a wide net in fitting many different ideas together by association, however the subtle pitfalls in assuming that pieces fit together without unexpected interactions is something which require formal reasoning through instead of just correlating literature.

Reubend · 10h ago

> “There’s proof by induction, proof by contradiction, and then proof by intimidation,” He says. “If you say something with enough authority, people just get scared. I think o4-mini has mastered proof by intimidation; it says everything with so much confidence.”

Proof validation is the perfect solution to this, and indeed I would love to see future improvements to LLMs which allow them to formalize their proofs with a feedback loop from something like Lean or Coq so that they can ensure that hallucinations haven't occurred.

alimw · 2h ago

You can already try this in Cursor. It doesn't work too well right now but perhaps that's just because noone has tuned the loop.

AlexErrant · 9h ago

> “I came up with a problem which experts in my field would recognize as an open question in number theory—a good Ph.D.-level problem,” he says. He asked o4-mini to solve the question... o4-mini presented a correct but sassy solution

I wonder who gets first author credits on that paper.

Low-Level Optimization with Zig (alloc.dev)

The FAIR Package Manager: Decentralized WordPress infrastructure (joost.blog)

Researchers develop ‘transparent paper’ as alternative to plastics (japannews.yomiuri.co.jp)

The time bomb in the tax code that's fueling mass tech layoffs (qz.com)

Falsehoods programmers believe about aviation (flightaware.engineering)

How we decreased GitLab repo backup times from 48 hours to 41 minutes (about.gitlab.com)

A year of funded FreeBSD development (daemonology.net)

Why are smokestacks so tall? (practical.engineering)

Sharing everything I could understand about gradient noise (blog.pkh.me)

The Illusion of Thinking: Understanding the Limitations of Reasoning LLMs [pdf] (ml-site.cdn-apple.com)

Ziina (YC W21) the Series A fintech is hiring product engineers (ziina.notion.site)

Highly efficient matrix transpose in Mojo (veitner.bearblog.dev)

Medieval Africans had a unique process for purifying gold with glass (2019) (atlasobscura.com)

Getting Past Procrastination (spectrum.ieee.org)

Sandia turns on brain-like storage-free supercomputer (blocksandfiles.com)

A masochist's guide to web development (sebastiano.tronto.net)

NASA delays next flight of Boeing's alternative to SpaceX Dragon (theedgemalaysia.com)

Smalltalk, Haskell and Lisp (storytotell.org)

I Read All of Cloudflare's Claude-Generated Commits (maxemitchell.com)

Odyc.js – A tiny JavaScript library for narrative games (odyc.dev)

Show HN: AI game animation sprite generator (godmodeai.cloud)

Workhorse LLMs: Why Open Source Models Dominate Closed Source for Batch Tasks (sutro.sh)

Wendelstein 7-X sets new fusion record (heise.de)

Too Many Open Files (mattrighetti.com)

Curate your shell history (esham.io)

Series C and scale (cursor.com)

What you need to know about EMP weapons (aardvark.co.nz)

Meta: Shut down your invasive AI Discover feed (mozillafoundation.org)

Reverse Engineering Cursor's LLM Client (tensorzero.com)

Windows 10 spies on your use of System Settings (2021) (michaelhorowitz.com)

Weaponizing Dependabot: Pwn Request at its finest (boostsecurity.io)

Show HN: Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Freight rail fueled a new luxury overnight train startup (freightwaves.com)

4-7-8 Breathing (breathbelly.com)

SaaS is just vendor lock-in with better branding (rwsdk.com)

Swift and the Cute 2d game framework: Setting up a project with CMake (layer22.com)

Dreams of improving the human race are no longer science fiction (economist.com)

What “working” means in the era of AI apps (a16z.com)

Researchers find a way to make the HIV virus visible within white blood cells (theguardian.com)

United States Digital Service Origins (usdigitalserviceorigins.org)

Show HN: Cpdown – Copy any webpage/YouTube subtitle as clean Markdown(LLM-ready) (github.com)

How to (actually) send DTMF on Android without being the default call app (edm115.dev)

An Interactive Guide to Rate Limiting (blog.sagyamthapa.com.np)

A Rippling Townhouse Facade by Alex Chinneck Takes a Seat in a London Square (thisiscolossal.com)

HZ-program (Typesetting algorithm by Hermann Zapf) (en.wikipedia.org)

Physicists observe a new form of magnetism (news.mit.edu)

Test Postgres in Python Like SQLite (github.com)

Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction (zju3dv.github.io)

The Coleco Adam Computer (dfarq.homeip.net)

CRDTs #4: Convergence, Determinism, Lower Bounds and Inflation (jhellerstein.github.io)

The Secret Meeting Where Mathematicians Struggled to Outsmart AI

Comments (4)