Evaluating GPT5's reasoning ability using the Only Connect game show

Comments (1)

scrollaway · 21h ago

We evaluated OpenAI GPT5 lateral reasoning abilities against other models using an approach based on the notoriously difficult and highly-challenging british game show Only Connect, which challenges contestants' pattern-matching and trivia skills.

Insights: - GPT-5 does extremely well, but only marginally better than o3. - Model verbosity has little impact on accuracy and cleverness, except, interestingly, for the sequences round - "minimal" verbosity however causes accuracy to drop sharply.

We'll be publishing additional results in the coming days from our extended tests. We're looking at different types of evals (how do the models fare with a single item in a sequence vs. 2, 3, 4). We would also like to look at how the models behave in a team of 3, replicating the format of the game show.

We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now). Finally, we are looking at replicating the results of the connecting wall with the New York Times' Connections, however we suspect those to be in the training materials which would skew the results.

F-Droid build servers can't build modern Android apps due to outdated CPUs

Ask HN: Is the rise of AI tools going to be the next 'dot com' bust?

Thoughts on using AI as a middleman in parenting

Best area of AI to focus on for a front end developer?

Ask HN: Are there software engineering areas that are safe from LLMs invasion?

Ask HN: What alternatives to GitHub are you using?

$160M VC-backed company just killed my EU trademark for a small OSS project

Ask HN: Would you swap your desk for a restaurant shift?

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

Ask HN: What LLM are you all using for coding assistance right now?

Ask HN: What toolchains are people using for desktop app development in 2025?

Gemini's brutal assessment of a vibe coding session

Ask HN: Are leetcode interviews going away?

Stripe suspended our account without clear reason – need advice

Ask HN: Why Is My Happiness Tied to My Productivity?

Ask HN: Is anyone running LLMs directly on GitHub Actions runners?

Ask HN: With all the AI hype, how are software engineers feeling?

Why do dev tools crush it on Product Hunt but never seem to raise money?

Snapchat open source cross-platform mobile framework. Looking for beta testers

ASK: Could AI Replace the Slush Pile Intern?

Ask HN: Do you do anything with the "cool" languages that get posted here?

GitHub Outage?

Tell HN: Regulations.gov Comments API is shutting down on Friday

Ask HN: How do you find Enterprise customers?

Ask HN: What do you dislike about ChatGPT and what needs improving?

Dual-Crypt – Working AES+RSA Encryption Between Java Spring Boot and React/JS

Google's RCS disconnected in several countries

What's your favorite CLI tool for integrating LLMs into your terminal workflow?

Ask HN: Canadian founders, how do you build in SF?

Ask HN: Advice for someone who wants to try AI-assisted coding?

Ask HN: Has anyone built anything useful using AI?

Vectorless: open-source PDF chatbot without RAG

ChatGPT 5 is slow and no better than 4

Ask HN: What are some comfy/stress-free jobs a SWE can do? (LCOL country)

Ask HN: How would you build second brain in the AI era?

Ask HN: What tech skill gave you the biggest boost in your career?

Does anyone know a detailed residential cost estimator

Tell HN: Charles Irby has passed away

Ask HN: Why is Usenet not coming back?

Ask HN: Best way to get a land line for my kids?

Ask HN: In which programming language is it better to make your own language?

Evaluating GPT5's reasoning ability using the Only Connect game show

Comments (1)