Letting the AIs Judge Themselves: A One Creative Prompt: The Coffee-Ground Test

I work on the best way to bemchmark todays LLM's and i thought about diffrent kind of compettion.

*Why I Ran This Mini-Benchmark* I wanted to see whether today’s top LLMs share a sense of “good taste” when you let them score each other, no human panel, just pure model *democracy*.

The Setup One prompt - Let the decide and score each other (anonimously), the highest score overall wins.

*Models tested (all May 2025 endpoints)*

* *OpenAI o3* * *Gemini 2.0 Flash* * *DeepSeek Reasoner* * *Grok 3 (latest)* * *Claude 3.7 Sonnet*

*Single prompt given to every model:*

In exactly *10* words, propose a groundbreaking global use for spent coffee grounds. Include *one* emoji, no hyphens, end with a period.

Letting the AIs Judge Themselves: A One Creative Prompt: The Coffee-Ground Test

Comments (2)