Large language models (LLMs) like GPT‑4.1, Claude 3.7, Mistral Large and others can draft code, craft poetry and summarise research papers, yet some still mis‑rank two simple decimals. What is happening under the hood?

The 60‑Second Experiment

I asked thirteen different LLM endpoints the same question:

Which number is greater, 9.9 or 9.11?

Try this experiment yourself

Here is a snapshot of the models we tested - latency and price:

Summary of model responses sorted by price

All models were prompted once with default temperature and no chain‑of‑thought forcing.

Why Is This Hard for LLMs?

1. Tokenisation ≠ Place Value

LLMs read text as tokens, not digits. The token for 9.9 is usually different from the three‑token sequence 9, ., 11. Without an explicit decimal parser, the model relies on probability patterns rather than arithmetic rules.

2. Training Data Noise

The public web is full of mistaken math answers. During pre‑training, the model sees both correct and incorrect comparisons; if mistakes dominate the contexts that look similar to the prompt, the model may reproduce them.

3. Loss Function Blind Spots

The pre‑training objective is predicting the next token it doesn't explicitly punish factual inconsistencies. A sentence like "9.9 is greater than 9.11" might be perfectly valid as a quoted error in source text, so outputting it is not catastrophic for the loss.

4. Lack of Explicit Numeracy Modules

Most base models have no built‑in decimal comparator. They acquire arithmetic skills through pattern association, which works surprisingly well for integers but breaks for decimals with unequal lengths.

5. Prompt Ambiguity (`what greater 9.9 or 9.11?`)

The original query is slightly ungrammatical. Some models may mis‑parse "what greater" as "what's greater, 9.9 or 9.11?" while others infer "what's the greater difference between 9.9 or 9? 11?" causing drift.

How to Get Consistent Numeric Answers

Rephrase the prompt: "Compare 9.9 and 9.11 and state which is larger."
Request step‑by‑step reasoning (chain‑of‑thought). When the model lines up the decimals, it usually self‑corrects.
Use tool‑augmented models that can call a calculator or Python REPL.
Post‑process numeric claims with deterministic code before showing them to end users.

The Road Ahead

Research teams are actively injecting neural calculators and designing curricula focused on numeric reasoning. Meanwhile, developers can mitigate errors by:

Integrating lightweight evaluator steps inside prompts (e.g. CoT‑Tools).
Using hybrid retrieval‑and‑execution pipelines.
Tracking numeric confidence rather than textual probability alone.

Why This Matters for TryAII Users

Using our platform to test different models on mathematical reasoning can reveal fascinating insights:

Discover which models handle numeric operations most accurately for your specific use case
Compare how different prompting techniques affect mathematical accuracy
Understand the cost/performance tradeoffs when dealing with number-heavy tasks

Next time you're working on an application that requires decimal arithmetic, try using our platform to compare how different models handle your specific numeric tasks. You might be surprised at which models perform best!

Try Here

Experiment Results

Below are detailed responses from various models to our decimal comparison question. Hover over any image to enlarge.

DeepSeek Reasoner (correct)

GPT-4.1-nano and Claude 3 Haiku responses (both incorrect)

GPT-4.1 (incorrect)

Mistral Large and Claude 3.7 Sonnet (contradicting)

Gemini 2.0 Flash vs Gemini 2.0 Flash Lite (disagreeing)

o3-mini and deepseek-chat (both with the same incorrect answer)

Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong

The 60‑Second Experiment