How Confident Are You, ChatGPT?

Comments (1)

martianlantern · 1h ago

Very insightful post, this may work in the IMO setting because mathematical problems are inherently binary if we ignore somethings like the incompleteness theorem. In contrast, subjective tasks, such as evaluating a painting or rating a poem, lack absolute truth. How would such reasoners estimate confidence in these cases, and to what extent could RL techniques effective in the IMO transfer to real world problems?

GPT-5 Thinking says: this ABC proof is rigorous (zenodo.org)

How one mother lost her daughter to mental illness (theguardian.com)

Nitpicks: Record a video and let an agent implement the code (nitpicks.ai)

I solved my biggest founder problem (and it wasn't what I expected) (feedbackkit.app)

Ask HN: SleepLive – ASMR custom audio and video requests platform (sleeplive.io)

Empire of the Absurd: A Brief History of the Absurdities of the Soviet Union (laurivahtre.ee)

UK unveils plans to 'transform' the consumer smart meter experience (theregister.com)

So, About Those Big Trade Deals (theatlantic.com)

'Toothless' compulsory voting can increase voter turnout (phys.org)

All-In on Omarchy at 37signals (world.hey.com)

Memory 2.0: Attentive Memory (2012) (blog.ninlabs.com)

Canada approves national standard for age verification, estimation (biometricupdate.com)

It's a mess: Quantum Mechanics turns 100 years old (quantamagazine.org)

Ask HN: Do you not submit things because page is festooned w pop-ups/ads/etc.?

ChatGPT Agent – EU Launch (help.openai.com)

How California energy policy is holding back a game changing climate technology (sfchronicle.com)

Ask HN: What do you use for user management/IAM in your SaaS app?

Tell HN: ChatGPT 4o has been re-enabled

Irmin Retrospective (patrick.sirref.org)

The Ancient Art and Intimate Craft of Artificial Eyes (thereader.mitpress.mit.edu)

End-User Programmable AI (queue.acm.org)

VR with: 1,400 nits – 180° FOV – 90 pixels per degree (meta.com)

60% of medal of honor recipients are Irish or Irish-American (en.wikipedia.org)

Show HN: Grid-table – Emacs grid table with rich text, images, formulas (github.com)

Cartels may be able to target witnesses after major court hack (politico.com)

Public DNS malware filters to be tested in 2025 (techblog.nexxwave.eu)

Restaurant chains feel the pinch as US consumers tighten their belts (ft.com)

Oxide's $100M Series B (oxide-and-friends.transistor.fm)

Man-eating' screw worm turns hospital into horror show (telegraph.co.uk)

Toit: A modern high-level language designed specifically for microcontrollers (toitlang.org)

Expediting On-Device LLM Personalization via Explainable Model Selection (arxiv.org)

As electric bills rise, evidence mounts that data centers share blame (apnews.com)

Notes on a Smaller Rust (2019) (without.boats)

Opencode [video] (youtube.com)

What even is distributed systems (notes.eatonphil.com)

Smartwatches offer little insight into stress levels, researchers find (theguardian.com)

Flattery, Lobbyists and Business: Crypto's Richest Man Campaigns for a Pardon (nytimes.com)

Retiring and relocating? Take a holistic approach (apnews.com)

Show HN: Sparc3D AI – High‑Res 3D Generation Tool (sparc3dai.com)

Humans make better content cops than AI, but cost 40x more (theregister.com)

Free AI w/ dynamic disagreement engine, optimized for constructive conflict (dmwithme.com)

Diffusion Language Models Are Super Data Learners (jinjieni.notion.site)

ESP32 Bus Pirate 0.5 – A Hardware Hacking Tool That Speaks Every Protocol (github.com)

UN: Booming solar, wind and green energy hits global tipping point for low cost (news.mongabay.com)

Take: Process file lines with a logic-based language (github.com)

Avatarl: Training language models from scratch with pure reinforcement learning (tokenbender.com)

CaMeL-Powered Secure Agent Demo with ADK (github.com)

Google AI – Confidently and Hilariously Wrong (photo-pick.com)

Google Gemini struggles to write code, calls itself "a disgrace to my species" (arstechnica.com)

Simon Willison's Lethal Trifecta Talk at the Bay Area AI Security Meetup (simonwillison.net)

How Confident Are You, ChatGPT?

Comments (1)