Mini LLM Self-Scoring Benchmark – May 2025

Can LLMs Judge Themselves? A Tiny Self-Scoring Benchmark
New

Why I ran this mini-benchmark
I wanted to see whether today's top LLMs share a sense of "good taste" when you let them score each other—no human panel, just pure model democracy.

The setup

Single prompt (shown below)
Each model answers anonymously
Every model then scores all answers (including its own) from 1–10
Highest total wins

Models tested (May 2025 endpoints)

OpenAI o3
Gemini 2.0 Flash
DeepSeek Reasoner
Grok 3 (latest)
Claude 3.7 Sonnet

The prompt

In 10 words exactly, propose a groundbreaking global use for spent coffee grounds. Include exactly ONE emoji. No hyphens. End with a period.

Responses

Grok 3: Turn spent coffee grounds into sustainable biofuel globally. ☕.
Claude 3.7 Sonnet: Biofuel revolution: spent coffee grounds power global transportation networks. 🚀.
OpenAI o3: Transform spent grounds into supercapacitors energizing equitable resilient infrastructure 🌍.
DeepSeek Reasoner: Convert coffee grounds into biofuel and carbon capture material worldwide. ☕️.
Gemini 2.0 Flash: Coffee grounds: biodegradable batteries for a circular global energy economy. 🔋

Score matrix

	Grok 3	Claude 3.7	OpenAI o3	DeepSeek	Gemini 2.0
Grok 3	7	8	9	7	10
Claude 3.7	8	7	8	9	9
OpenAI o3	3	9	9	2	2
DeepSeek	3	4	7	8	9
Gemini 2.0	3	3	10	9	4

Leaderboard

OpenAI o3 — 43 points
DeepSeek Reasoner — 35 points
Gemini 2.0 Flash — 34 points
Claude 3.7 Sonnet — 31 points
Grok 3 — 26 points

My take

OpenAI o3's line looked bananas at first. Ten minutes of Googling later: turns out coffee-ground-derived carbon really is being studied for supercapacitors. The models actually picked the most science-plausible answer!

Disclaimer

This was a tiny, just-for-fun experiment. Don't treat the numbers as a rigorous benchmark—different prompts or scoring rules could easily shuffle the leaderboard.

I'll post a full write-up (with runnable prompts) on my blog soon. Meanwhile, what do you think—did the model-jury get it right?

Can LLMs Judge Themselves? A Mini Self-Scoring Benchmark

Can LLMs Judge Themselves? A Tiny Self-Scoring Benchmark
New

The setup

Models tested (May 2025 endpoints)

The prompt

Responses

Score matrix

Leaderboard

My take

Disclaimer

About the Author

Related Articles

Understanding Token Usage Across Different LLMs

Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong

Can LLMs Judge Themselves? A Tiny Self-Scoring BenchmarkNew

The setup

Models tested (May 2025 endpoints)

The prompt

Responses

Score matrix

Leaderboard

My take

Disclaimer

About the Author

Related Articles

Understanding Token Usage Across Different LLMs

Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong

Can LLMs Judge Themselves? A Tiny Self-Scoring Benchmark
New