Can LLMs Judge Themselves? A Mini Self-Scoring Benchmark
Five state-of-the-art models anonymously rated each other's 10-word coffee-grounds ideas—here's who won and why it matters.
Can LLMs Judge Themselves? A Tiny Self-Scoring Benchmark
New
Why I ran this mini-benchmark
I wanted to see whether today's top LLMs share a sense of "good taste" when you let them score each other—no human panel, just pure model democracy.
The setup
- Single prompt (shown below)
- Each model answers anonymously
- Every model then scores all answers (including its own) from 1–10
- Highest total wins
Models tested (May 2025 endpoints)
- OpenAI
o3
- Gemini 2.0 Flash
- DeepSeek Reasoner
- Grok 3 (latest)
- Claude 3.7 Sonnet
The prompt
In 10 words exactly, propose a groundbreaking global use for spent coffee grounds. Include exactly ONE emoji. No hyphens. End with a period.
Responses
- Grok 3: Turn spent coffee grounds into sustainable biofuel globally. ☕.
- Claude 3.7 Sonnet: Biofuel revolution: spent coffee grounds power global transportation networks. 🚀.
- OpenAI o3: Transform spent grounds into supercapacitors energizing equitable resilient infrastructure 🌍.
- DeepSeek Reasoner: Convert coffee grounds into biofuel and carbon capture material worldwide. ☕️.
- Gemini 2.0 Flash: Coffee grounds: biodegradable batteries for a circular global energy economy. 🔋
Score matrix
Grok 3 | Claude 3.7 | OpenAI o3 | DeepSeek | Gemini 2.0 | |
---|---|---|---|---|---|
Grok 3 | 7 | 8 | 9 | 7 | 10 |
Claude 3.7 | 8 | 7 | 8 | 9 | 9 |
OpenAI o3 | 3 | 9 | 9 | 2 | 2 |
DeepSeek | 3 | 4 | 7 | 8 | 9 |
Gemini 2.0 | 3 | 3 | 10 | 9 | 4 |
Leaderboard
- OpenAI o3 — 43 points
- DeepSeek Reasoner — 35 points
- Gemini 2.0 Flash — 34 points
- Claude 3.7 Sonnet — 31 points
- Grok 3 — 26 points
My take
OpenAI o3's line looked bananas at first. Ten minutes of Googling later: turns out coffee-ground-derived carbon really is being studied for supercapacitors. The models actually picked the most science-plausible answer!
Disclaimer
This was a tiny, just-for-fun experiment. Don't treat the numbers as a rigorous benchmark—different prompts or scoring rules could easily shuffle the leaderboard.
I'll post a full write-up (with runnable prompts) on my blog soon. Meanwhile, what do you think—did the model-jury get it right?
About the Author
Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.
Related Articles
Understanding Token Usage Across Different LLMs
A quick guide into how different models process and charge for tokens, helping you optimize your AI costs.
Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong
Exploring why large language models like GPT-4, Claude, Mistral, and Gemini still stumble on basic decimal comparisons.