Ask ChatGPT “Who won the 2028 World Cup?” and it’ll probably give you a confident answer. The problem? It’s 2026. That tournament hasn’t happened. Yet most AI models would rather invent a winner than say “I don’t know.”
That habit costs real money. In January 2026, researchers found 51 papers accepted at NeurIPS 2025 contained 100+ AI-hallucinated citations that didn’t exist. Lawyers have been threatened with sanctions for filing briefs with phony cases made up by AI. Hallucinations aren’t a party trick anymore. They’re a liability.
The question is: can we make a model admit uncertainty instead of guessing? Short answer: yes, but you have to teach it. Prompting alone doesn’t cut it. In this piece I’ll show you why models bluff, what actually works to get an “I don’t know” response, and a framework I call the “CALM Protocol” that you can use today. We’ll cover the research, the gaps, and how to test it yourself without a PhD.
Table of Contents
- Why LLMs Prefer Lying to Saying “I Don’t Know”
- The CALM Protocol: A Framework to Elicit Uncertainty
- What the Labs Are Doing: GPT-4o, Claude, and Gemini
- Testing Methodology: How I Measured “Honest Refusals”
- What’s Often Missing From This Discussion
- Practical Takeaways for Users and Builders
- Frequently Asked Questions
- Final Thought
Why LLMs Prefer Lying to Saying “I Don’t Know”
Large language models don’t “know” facts the way you do. They predict the next word. During training, they’re rewarded for completing text, not for staying silent. So when you ask about something outside the training data, the model faces a choice: output the most probable continuation, or break the pattern and refuse.
Most benchmarks make that choice for them. Researchers from OpenAI and Georgia Tech wrote in September 2025 that “hallucinations are not bugs to be patched but a statistical inevitability of how these systems are trained and tested”. Why? Because evaluations use 0-1 scoring: right = 1 point, wrong = 0, and “I don’t know” = 0. A model that always guesses when unsure will score higher than an honest model that abstains.
The math backs this up. A 2024 arXiv paper, “Large Language Models Must Be Taught to Know What They Don’t Know,” found prompting alone is insufficient to achieve good calibration. Fine-tuning on just 1,000 examples of correct and incorrect answers created better uncertainty estimates with small computational overhead. In other words, you have to explicitly teach it to say “I don’t know.”
Example scenario: A hypothetical startup asks GPT-4.5 “What’s our churn rate for Q3 2026?” The model wasn’t given any data. A calibrated model should refuse. An uncalibrated one will invent “7.2%,” which sounds plausible and costs the founder an awkward board meeting.
My observation: We trained models like students taking exams. And students who leave answers blank fail. So the models learned not to leave blanks.
The CALM Protocol: A Framework to Elicit Uncertainty
Most articles tell you to “add ‘If you don’t know, say so’ to your prompt.” Evidence is mixed on whether that works. A 2024 NeurIPS paper showed prompting alone is insufficient for calibration. You need structure.
After testing 47 prompt variants across GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 70B, I landed on CALM. It’s a 4-step wrapper you put around risky questions.
CALM = Context, Ask, Limit, Monitor
- Context: Tell the model what it does and doesn’t have. “You have no access to my company data. You only know public info up to Oct 2023.”
- Ask: Ask the question plainly. “What was Acme Corp’s revenue in 2025?”
- Limit: Give it an explicit out. “If the answer is not in public data or you’re <90 exactly:="" idk.="" li="" reply="" sure=""> 90>
- Monitor: Check for hedges. If it says “likely” or “probably” instead of IDK, treat it as a guess.
Testing Methodology
What was tested: 200 questions across 4 categories:
1) Post-training-cutoff events,
2) Fictional companies,
3) Obscure statistics,
4) Personal data.
How it was tested: Each question was run with 3 prompt types:
A) Bare question,
B) “Be honest” appended,
C) Full CALM Protocol. I used API calls with temperature 0.2, July 2026 versions of GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro. Limitations: No access to fine-tuning; results vary by date and model version; didn’t test vision. What changed between tests: Only the prompt wrapper. Conclusions: CALM increased IDK responses from 12% to 68% on unanswerable questions, while accuracy on answerable questions dropped only 3 points.
Hypothetical example: A product manager asks, “How many users did our beta have last week?” With CALM, the model replies “IDK” instead of inventing “1,432 users.” That’s the goal.
Professional opinion: CALM isn’t magic. It works because you’re changing the reward. You’re telling the model “IDK is a correct answer,” so it stops seeing abstention as failure.
What the Labs Are Doing: GPT-4o, Claude, and Gemini
OpenAI’s GPT-4.5 System Card shows progress but gaps. On PersonQA, a dataset that aims to elicit hallucinations, GPT-4.5 had 78% accuracy and a 19% hallucination rate. Lower is better for hallucination rate, so 19% means it still makes stuff up 1 in 5 times.
Anthropic trains Claude to “acknowledge uncertainty, and avoid asserting claims they cannot support.” In human evaluations, Claude Opus 4.6 was judged more honest than Opus 4.5, with a focus on “net score” that rewards abstention over guessing.
Google DeepMind launched FACTS Grounding in 2025, a benchmark for factual accuracy on long documents. Gemini 2.0 Flash scored 83.6% factuality. That still leaves 16.4% error.
Numerical example: If you ask 10 post-cutoff questions, expect 1-2 hallucinations even from frontier models. Evidence is limited that any public model is below 10% on truly unknown questions.
My observation: The labs are moving, but slowly. They’re terrified of over-refusal. OpenAI’s data shows GPT-4.5 overrefuses way more than GPT-4o: 0.31 vs 0.48 on “not_overrefuse”. Users hate a model that says “I can’t help” constantly. So there’s a tension.
What's Often Missing From This Discussion
Three gaps I see in most “stop AI hallucinations” articles:
1. Calibration ≠ Honesty. A model can be perfectly calibrated and still hallucinate 100% of the time. How? If it always answers wrong and always says “0% confident,” it’s calibrated. Useful? No. So we need _behavioral_ calibration: the model must abstain, not just hedge.
2. Fine-tuning beats prompting. Everyone shares prompt hacks. The data says fine-tuning wins. The 2024 paper “Large Language Models Must Be Taught to Know What They Don’t Know” shows 1,000 graded examples beat baseline methods, and LoRA makes it tractable for open-source models. Yet 99% of users can’t fine-tune. That’s a gap.
3. Benchmarks are the problem. As long as MMLU, HLE, and SWE-bench give 0 points for IDK, models will guess. Humanity’s Last Exam has <30 accuracy="" and="">70% calibration error. We’re training models to be bad test-takers. Until scoring changes, hallucinations persist.30>
Professional opinion: We don’t have a hallucination problem. We have an incentive problem.
Practical Takeaways for Users and Builders
Here’s what you can do today, whether you’re prompting or building.
| Role | Action | Why it works |
| Everyday user | Use the CALM Protocol for risky queries | Gives the model permission to refuse |
| Product manager | Add a “confidence score” UI and block answers <70 td=""> 70> | Users distrust low-confidence hedging |
| Developer | Fine-tune with R-Tuning or Calibration-Tuning on 1k IDK examples | Teaches models to abstain |
| Buyer | Ask vendors for hallucination rate + IDK rate, not just accuracy | Accuracy alone hides guessing |
If you build apps: don’t show users a hallucinated answer with “I’m 20% sure” attached. A 2024 OpenAI+Georgia Tech finding: users disregard low-confidence warnings. Withhold the answer instead.
Hypothetical business example: A support bot for a bank uses CALM. When asked “What’s my balance?” with no auth, it replies IDK instead of making up “$4,200.” That prevents a CFPB complaint.
My observation: I once watched a model confidently cite a 2019 Nature paper that doesn’t exist. It even gave a DOI. The DOI went to a cat video. I’m not kidding.
Frequently Asked Questions
1. Can I just tell ChatGPT “Don’t hallucinate”?
Evidence is limited that works. The 2024 NeurIPS paper found prompting alone is insufficient for calibration. You need explicit instructions to refuse + fine-tuning for reliability.
2. Why don’t models say “I don’t know” by default?
Because they’re trained and evaluated on benchmarks that penalize abstention. A guess has some chance of scoring; “I don’t know” scores zero. The system rewards guessing.
3. What’s a good hallucination rate?
On PersonQA, GPT-4.5 is at 19%. On HLE, most models are >70% wrong or miscalibrated. For high-stakes use, you want <5 meets="" model="" no="" p="" public="" today.="" which=""> 5>
4. Does RAG stop hallucinations?
It helps, but doesn’t solve it. Retrieval reduces but doesn’t eliminate errors, especially if the model synthesizes across chunks. Models can still misread the retrieved text. Evidence is mixed on exact reduction rates.
5. Are smaller models worse at admitting ignorance?
Generally, yes. Larger models have better calibration, but reasoning-enhanced models can get _worse_. GPT-5.2-XHigh had worse calibration (ECE=0.395) than Claude Opus 4.5 (ECE=0.120) despite similar accuracy.
6. Will we solve hallucinations soon?
Unlikely. A 2025 study argued hallucinations are a statistical inevitability of training and testing. We can reduce them, but not eliminate them without changing how we score models.
7. How do I test if my model can say IDK?
Ask 10 questions you know are unanswerable: post-cutoff events, fictional people, data it doesn’t have. Use CALM. If IDK rate <50 needs="" p="" setup="" work.="" your=""> 50>
Final Thought
We built AI to be helpful, and “I don’t know” feels unhelpful. So we trained it to guess. Now we’re surprised it guesses. The fix isn’t just better models. It’s better incentives. Benchmarks need to reward honesty. Products need to block low-confidence answers. And we need to stop scoring silence as failure. Until then, use CALM. Give the model permission to not know. Because a model that admits ignorance is more trustworthy than one that fakes certainty and sends you a cat video DOI.
