We evaluated OpenAI GPT5 lateral reasoning abilities against other models using an approach based on the notoriously difficult and highly-challenging british game show Only Connect, which challenges contestants' pattern-matching and trivia skills.
Insights:
- GPT-5 does extremely well, but only marginally better than o3.
- Model verbosity has little impact on accuracy and cleverness, except, interestingly, for the sequences round
- "minimal" verbosity however causes accuracy to drop sharply.
We'll be publishing additional results in the coming days from our extended tests. We're looking at different types of evals (how do the models fare with a single item in a sequence vs. 2, 3, 4). We would also like to look at how the models behave in a team of 3, replicating the format of the game show.
We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now).
Finally, we are looking at replicating the results of the connecting wall with the New York Times' Connections, however we suspect those to be in the training materials which would skew the results.
Insights: - GPT-5 does extremely well, but only marginally better than o3. - Model verbosity has little impact on accuracy and cleverness, except, interestingly, for the sequences round - "minimal" verbosity however causes accuracy to drop sharply.
We'll be publishing additional results in the coming days from our extended tests. We're looking at different types of evals (how do the models fare with a single item in a sequence vs. 2, 3, 4). We would also like to look at how the models behave in a team of 3, replicating the format of the game show.
We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now). Finally, we are looking at replicating the results of the connecting wall with the New York Times' Connections, however we suspect those to be in the training materials which would skew the results.