NixOS Now Celebrates Pride Month Year Round (lunduke.substack.com)

1 points by serial_dev 32s ago 0 comments

No more 'Sanity Checks.' Inclusive language guide bans problematic tech terms (theregister.com)

1 points by rntn 1m ago 0 comments

The Discovery of Complex Heterocycles from Millipede Secretions (pubs.acs.org)

1 points by PaulHoule 1m ago 0 comments

God created men; Sam Altman made them equal (taylor.town)

1 points by surprisetalk 2m ago 0 comments

Writing is power transfer technology (danco.substack.com)

1 points by jger15 2m ago 0 comments

TextQuests: How Good Are LLMs at Text-Based Video Games? (textquests.ai)

1 points by lottaFLOPS 3m ago 0 comments

OpenAI Burns the Boats (ethanding.substack.com)

1 points by whoami_nr 4m ago 0 comments

Sandboxing AI-Generated Code: Why We Moved from WebR to AWS Lambda (quesma.com)

2 points by stared 4m ago 0 comments

High-purity quantum optomechanics at room temperature (nature.com)

1 points by bookofjoe 5m ago 0 comments

Show HN: Created 60 free useful tools in one place (kewltools.com)

2 points by bubblebobble 7m ago 1 comments

Visualize Embeddings Using DuckDB (github.com)

1 points by krishadi 9m ago 0 comments

Prediction markets could create a missing incentive for climate action (santiag0m.github.io)

1 points by santiag0m 10m ago 0 comments

Show HN: Omnara – Run Claude Code from Anywhere (github.com)

3 points by kmansm27 11m ago 0 comments

Station – Deploy Sub Agents

1 points by epuerta99 13m ago 0 comments

Jenny's Daily Drivers: FreeDOS 1.4 (hackaday.com)

3 points by Bogdanp 13m ago 0 comments

Losing the "fun" part of "for fun and profit" (ezhik.jp)

1 points by Ezhik 13m ago 1 comments

Turning Microsoft's Login Page into Our Phishing Infrastructure (infocondb.org)

1 points by layer8 14m ago 0 comments

Precariat (en.wikipedia.org)

2 points by georgecmu 14m ago 0 comments

Show HN: How Low Can You Go? – A daily "lowest unique number" challenge (golow.app)

1 points by destel 15m ago 1 comments

Former Googlers' AI startup OpenArt creates 'brain rot' videos in one click (techcrunch.com)

1 points by CharlesW 15m ago 0 comments

Lessons learned building an AI hacker (theori.io)

3 points by tylerni7 16m ago 0 comments

Perplexity makes bold $34.5B bid for Google's Chrome browser (reuters.com)

2 points by voxadam 16m ago 2 comments

Sling TV's $5 pass buys you one day of cable TV (theverge.com)

3 points by speckx 16m ago 0 comments

vivid: A themeable LS_COLORS generator with a rich filetype datebase (github.com)

1 points by vinhnx 17m ago 0 comments

How my divorce and a dildo inspired a practical use of AI (mythos.one)

1 points by brianswichkow 20m ago 1 comments

Symplectification of Circular Arcs and Arc Splines (researchgate.net)

1 points by fango 22m ago 0 comments

Experiment will attempt to counter climate change by altering ocean (insideclimatenews.org)

2 points by JPLeRouzic 22m ago 0 comments

Guédelon Castle (en.wikipedia.org)

4 points by arbuge 23m ago 0 comments

AI Is Like Outsourcing (brentozar.com)

3 points by tanelpoder 24m ago 0 comments

Google ending AI arms ban concerning, campaigners say (2025/02/05) (bbc.co.uk)

1 points by ColinWright 24m ago 0 comments

Show HN: Voice-Controlled iOS Navigation Example (github.com)

1 points by trolleycrash 25m ago 0 comments

Pico3D: Open World 3D Game Engine for the RP2040 Microcontroller (github.com)

1 points by flykespice 26m ago 0 comments

Do Kwon Pleads Guilty (bsky.app)

3 points by edent 29m ago 1 comments

Can modern LLMs count the number of b's in "blueberry"? (minimaxir.com)

3 points by minimaxir 32m ago 0 comments

Synthetic aperture waveguide holography for compact mixed-reality displays (nature.com)

1 points by PaulHoule 33m ago 0 comments

Ask HN: Are leetcode interviews going away?

3 points by ryandvm 34m ago 0 comments

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

8 points by grace77 34m ago 7 comments

Perplexity offers to buy Google's Chrome browser for $34.5B (cnbc.com)

5 points by antimora 35m ago 0 comments

Using Socratic Dialog with AI for Better Technical Decisions (matthewsinclair.com)

4 points by matthewsinclair 35m ago 0 comments

Spanner's Columnar Engine Unites OLTP and OLAP (cloud.google.com)

1 points by tanelpoder 36m ago 0 comments

Service Model (Book) System Design for Task Management (varunmehta.github.io)

1 points by emortal 36m ago 1 comments

Jules' sharpest critic and most valuable ally (developers.googleblog.com)

1 points by meetpateltech 37m ago 0 comments

Solo to $1B – What it takes to build a unicorn alone (marcrand.com)

1 points by bizgrayson 38m ago 0 comments

Dog May Stop Loving You (gardnermcintyre.com)

4 points by zebomon 41m ago 1 comments

Prompt ChatGPT, Claude, Deepseek and Gemini Simultaneously (tantyai.com)

1 points by sm1100 41m ago 2 comments

Choosing the Right Wireless Module for Your Framework Desktop (boilingsteam.com)

1 points by ekianjo 41m ago 0 comments

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

11 points by wilsonzlin 42m ago 1 comments

Provably guarantee correctness of (some of) your LLM outputs (aws.amazon.com)

2 points by barthelomew 42m ago 0 comments

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

136 points by adocomplete 42m ago 36 comments

Dicing an Onion, the Mathematically Optimal Way (pudding.cool)

1 points by feross 43m ago 0 comments

Evaluating GPT5's reasoning ability using the Only Connect game show

1 scrollaway 1 8/12/2025, 1:52:51 PM ingram.tech ↗

Comments (1)

scrollaway · 2h ago

We evaluated OpenAI GPT5 lateral reasoning abilities against other models using an approach based on the notoriously difficult and highly-challenging british game show Only Connect, which challenges contestants' pattern-matching and trivia skills.

Insights: - GPT-5 does extremely well, but only marginally better than o3. - Model verbosity has little impact on accuracy and cleverness, except, interestingly, for the sequences round - "minimal" verbosity however causes accuracy to drop sharply.

We'll be publishing additional results in the coming days from our extended tests. We're looking at different types of evals (how do the models fare with a single item in a sequence vs. 2, 3, 4). We would also like to look at how the models behave in a team of 3, replicating the format of the game show.

We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now). Finally, we are looking at replicating the results of the connecting wall with the New York Times' Connections, however we suspect those to be in the training materials which would skew the results.