What “Lunapolis” Reveals About the Shared Training Corpora of Modern LLMs
A data‑centric look at why multiple large‑language models invent the same lunar‑city names—and what that convergence teaches us about their overlapping training sets.
What “Lunapolis” Reveals About the Shared Training Corpora of Modern LLMs
Ask four different state‑of‑the‑art language‑model APIs to coin a one‑word name for a hypothetical lunar capital and you’ll almost certainly receive Lunapolis or Lunaris. On the surface it’s a fun quirk of creativity. Look closer and it becomes a diagnostic probe of the overlapping text corpora behind today’s LLMs.
Model (2025) | Provider | Answer | Latency | Tokens | Cost |
---|---|---|---|---|---|
Gemini 2.0 Flash | Luna | 0.52 s | 19 | $0.000004 | |
Mistral Large (Latest) | Mistral | Lunaropolis | 0.54 s | 25 | $0.000111 |
GPT‑4.1 | OpenAI | Lunaris | 0.93 s | 27 | $0.000117 |
Claude 3.7 Sonnet (Feb 2025) | Anthropic | Lunopolis | 1.22 s | 30 | $0.000261 |
deepseek‑chat | DeepSeek | Lunara | 4.33 s | 22 | $0.000013 |
o4‑mini | OpenAI | Lunaris | 4.63 s | 19 | $0.000041 |
The Big Question
Why do models from different vendors—trained on trillions of tokens and fine‑tuned with separate alignment pipelines—converge on the same two coinages? The short answer: they share a surprisingly similar diet of text. Unpack that diet and you gain a window into how data overlap, frequency bias, and tokenization shape generative output.
Inside the Overlap
1 · Common‑Crawl‑Centric Pipelines
Nearly every major LLM pipeline begins with a de‑duplicated slice of Common Crawl. That 5‑petabyte web scrape contains countless sci‑fi snippets, NaNoWriMo drafts, and fan‑fiction forums where “Lunapolis” and “Lunaris” appear. Remove duplicate URLs all you like—the rare coinages still survive because they live on many distinct sites.
2 · Books3 & Public‑Domain Fiction
Datasets such as Books3
(an 11‑GB mirror of fiction e‑books) and Project Gutenberg reprints sprinkle “Luna‑polis”‑style terms across pulp-era novels. When BigScience or OpenAI filter for “quality English prose,” they scoop up the same vintage titles and thus the same lunar neologisms.
3 · Wikipedia & Fandom Wikis
Even Wikipedia’s List of Fictional Lunar Settlements collects and standardises those names, ensuring they appear in every open‑license snapshot shipped to model trainers. Ditto for Fandom wiki pages about space‑opera games—another ubiquitous ingredient in many LLM corpora.
4 · Token Frequency & Sampling Energy
When you plot token frequencies across these corpora, Lunapolis
and Lunaris
might occur only a few hundred times each—but no other single‑word lunar capital appears more often. Even a modest frequency edge translates into a noticeably higher soft‑max probability at inference time, especially when the prompt narrows the search space.
Corpus Lessons in a Nutshell
- Data redundancy means that rare sci‑fi coinages can become statistically “safe” choices if they appear across multiple public sources.
- Cleaning ≠ originality. Duplicate trimming and profanity filters eliminate noise but rarely reduce myth‑adjacent neologisms.
- Prompt probes can reverse‑engineer corpora. Asking dozens of whimsical questions lets researchers infer whether niche phrases sit inside the training set.
- Alignment layers amplify the overlap. RLHF raters reward answers that sound polished yet familiar, reinforcing the already skewed token distribution inherited from pre‑training.
Practical Takeaways for Builders & Researchers
- Dataset diversity matters. Mixing extra‑domain corpora (technical papers, non‑fiction, creative commons poetry) reduces “Luna‑*” dominance.
- Control the sampler. Higher temperature or nucleus sampling (e.g.,
top_p = 0.9
) can overcome mild frequency skews. - Explicit negative cues work. “Give me a lunar capital name not beginning with
Luna‑
” steers the model away from its corpus priors. - Corpus probes are lightweight audits. You don’t need direct dataset access—just craft systematic prompts and log the repetition rate.
Name a new Martian capital. Notice how many answers converge on “Areopolis” or “Ares City.” Same corpus overlap, different planet!
Bottom Line
“Lunapolis” isn’t just a catchy sci‑fi portmanteau; it’s a tracer dye illuminating how modern LLMs share—and are subtly steered by—the same vast but overlapping training corpora. Until we broaden and diversify those corpora (or learn to steer sampling more aggressively), the Moon’s capital will keep echoing the same two syllables.
Image & Dataset Credits
- NASA Image Library – Public‑domain lunar photos.
- Unsplash – Space architecture renders by Spacejoy.
- Common Crawl & Books3 metadata – Used here for frequency analysis references.
About the Author
Tamir is a contributor to the TryAII blog, focusing on AI technology, LLM comparisons, and best practices.
Related Articles
Understanding Token Usage Across Different LLMs
A quick guide into how different models process and charge for tokens, helping you optimize your AI costs.
Why Even Advanced LLMs Get '9.9 vs 9.11' Wrong
Exploring why large language models like GPT-4, Claude, Mistral, and Gemini still stumble on basic decimal comparisons.