TechnicalAI ResearchModel Comparison

Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong

By TamirApril 21, 20253 min read

Exploring why large language models like GPT-4, Claude, Mistral, and Gemini still stumble on basic decimal comparisons.

Large language models (LLMs) like GPT‑4.1, Claude 3.7, Mistral Large and others can draft code, craft poetry and summarise research papers, yet some still mis‑rank two simple decimals. What is happening under the hood?

The 60‑Second Experiment

I asked thirteen different LLM endpoints the same question:

Which number is greater, 9.9 or 9.11?

Here is a snapshot of the models we tested - latency and price:

Comparison summary of LLM responses

Summary of model responses sorted by price

All models were prompted once with default temperature and no chain‑of‑thought forcing.

Why Is This Hard for LLMs?

1. Tokenisation ≠ Place Value

LLMs read text as tokens, not digits. The token for 9.9 is usually different from the three‑token sequence 9, ., 11. Without an explicit decimal parser, the model relies on probability patterns rather than arithmetic rules.

2. Training Data Noise

The public web is full of mistaken math answers. During pre‑training, the model sees both correct and incorrect comparisons; if mistakes dominate the contexts that look similar to the prompt, the model may reproduce them.

3. Loss Function Blind Spots

The pre‑training objective is predicting the next token it doesn't explicitly punish factual inconsistencies. A sentence like "9.9 is greater than 9.11" might be perfectly valid as a quoted error in source text, so outputting it is not catastrophic for the loss.

4. Lack of Explicit Numeracy Modules

Most base models have no built‑in decimal comparator. They acquire arithmetic skills through pattern association, which works surprisingly well for integers but breaks for decimals with unequal lengths.

5. Prompt Ambiguity (what greater 9.9 or 9.11?)

The original query is slightly ungrammatical. Some models may mis‑parse "what greater" as "what's greater, 9.9 or 9.11?" while others infer "what's the greater difference between 9.9 or 9? 11?" causing drift.

How to Get Consistent Numeric Answers

  • Rephrase the prompt: "Compare 9.9 and 9.11 and state which is larger."
  • Request step‑by‑step reasoning (chain‑of‑thought). When the model lines up the decimals, it usually self‑corrects.
  • Use tool‑augmented models that can call a calculator or Python REPL.
  • Post‑process numeric claims with deterministic code before showing them to end users.

The Road Ahead

Research teams are actively injecting neural calculators and designing curricula focused on numeric reasoning. Meanwhile, developers can mitigate errors by:

  1. Integrating lightweight evaluator steps inside prompts (e.g. CoT‑Tools).
  2. Using hybrid retrieval‑and‑execution pipelines.
  3. Tracking numeric confidence rather than textual probability alone.

Why This Matters for TryAII Users

Using our platform to test different models on mathematical reasoning can reveal fascinating insights:

  • Discover which models handle numeric operations most accurately for your specific use case
  • Compare how different prompting techniques affect mathematical accuracy
  • Understand the cost/performance tradeoffs when dealing with number-heavy tasks

Next time you're working on an application that requires decimal arithmetic, try using our platform to compare how different models handle your specific numeric tasks. You might be surprised at which models perform best!

Experiment Results

Below are detailed responses from various models to our decimal comparison question. Hover over any image to enlarge.

DeepSeek Reasoner's detailed response

DeepSeek Reasoner (correct)

GPT and Claude responses

GPT-4.1-nano and Claude 3 Haiku responses (both incorrect)

GPT-4.1 response

GPT-4.1 (incorrect)

Mistral and Claude responses

Mistral Large and Claude 3.7 Sonnet (contradicting)

Gemini model responses

Gemini 2.0 Flash vs Gemini 2.0 Flash Lite (disagreeing)

o3-mini and deepseek-chat responses

o3-mini and deepseek-chat (both with the same incorrect answer)

About the Author

Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.

Related Articles

Understanding Token Usage Across Different LLMs

A quick guide into how different models process and charge for tokens, helping you optimize your AI costs.

April 21, 20252 min read