TechnicalAI ResearchModel Comparison

Do LLMs Believe in God? 25 Models, One Impossible Question

By TamirMay 20, 20253 min read

We asked 25 production endpoints to answer the ultimate yes-or-no question—see who obeyed, who rebelled, how fast they were, and what it cost.

25 LLMs, One Impossible Question – May 2025

Do LLMs Believe in God? 25 Models Tested

Try this compare →

We fired the same terse prompt at 25 live LLM endpoints:

I’ll ask you only one question, answer only in yes or no, don’t explain yourself. Is there God?

The goal was a playful micro-benchmark of instruction-following, latency, and per-call cost. Here’s what happened:

ModelAnswerLatency (s)TokensCost USD
Mistral SmallNo0.8430$0.000005
Mistral LargeLong reply1.1751$0.000285
Grok 3Yes1.2032$0.000180
Gemini 1.5 FlashNo1.2425$0.000006
Gemini 2.0 FlashNo1.2525$0.000004
Gemini 2.0 Flash LiteNo1.4125$0.000003
Grok 3 FastYes1.4232$0.000300
Claude 3.7 SonnetNo1.4536$0.000252
Gemini 1.5 ProYes1.5026$0.000120
GPT-4o (omni)Long reply1.6043$0.000296
GPT-4.1-nanoYes1.6032$0.000005
GPT-4o-miniYes1.6033$0.000006
Claude 3 HaikuNo1.7236$0.000021
Claude 3.5 HaikuYes1.8136$0.000067
GPT-4.1Refused2.0542$0.000225
Claude 3.5 Sonnet v2No2.1136$0.000252
GPT-4.5-previewLong reply3.1948$0.000015
Claude 3 OpusVery long reply4.62132$0.012060
Grok 3 Mini FastNo7.7033$0.000040
Grok 3 MiniNo8.9433$0.000015
o4-miniYes9.9325$0.000046
deepseek-chatMaybe14.2531$0.000015
o3-miniYes15.0325$0.000042
o3Refused19.0334$0.000960
o1Yes50.7925$0.000630

Key Takeaways

  • Instruction followers: 18 / 25 models complied with a clean “Yes” or “No.”
  • Rebels & philosophers: 6 produced longer or refusal answers.
  • Wildcard: deepseek-chat broke the binary with “Maybe.”
  • Fastest compliant: Mistral Small – 0.84 s ($0.000005).
  • Cheapest call: Gemini 2.0 Flash Lite$0.000003.
  • Most expensive word: Claude 3 Opus$0.012060 for a single refusal.

Yes, it’s tongue-in-cheek—but it highlights how instruction-following, latency and cost vary wildly when you scale LLM calls.

About the Author

Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.

Related Articles

Understanding Token Usage Across Different LLMs

A quick guide into how different models process and charge for tokens, helping you optimize your AI costs.

April 21, 20252 min read

Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong

Exploring why large language models like GPT-4, Claude, Mistral, and Gemini still stumble on basic decimal comparisons.

April 21, 20253 min read