Hallucinations in LLMs

November 10, 2025

Why they happen, how to detect them, and what you can do

As large language models (LLMs) like ChatGPT, Claude, Gemini and open-source alternatives become integral to modern software development workflows – from coding assistance to automated documentation and testing – there’s a growing challenge that continues to puzzle even experienced practitioners: hallucinations.

For many organisations, especially those with in-house development teams, this is not just an AI curiosity but a practical risk. LLM hallucinations can lead to flawed technical outputs, incorrect business insights and wasted development effort if they go unchecked. Understanding and mitigating them is essential to delivering reliable AI-powered solutions that meet business goals.

Hallucinations in LLMs refer to confidently generated, but false or misleading, content. These aren’t AI dreams or bugs in the system. And if you’ve worked with LLMs for even a short while, you’ve probably seen it first-hand – be it a made-up API endpoint, non-existent RFC, or incorrect step in a testing workflow.

What exactly is a hallucination?

A hallucination is when an LLM generates content that’s syntactically correct but semantically false or unverifiable. These can include

Inventing function names, parameters or return types in code
Generating fictitious quotes or research papers
Making up test cases, tools or datasets
Misrepresenting laws, standards or security guidelines

What makes hallucinations tricky is that they often sound very plausible… And that’s what makes them dangerous.

Types of hallucinations in LLMs

Not all hallucinations are the same. Understanding the different types helps software engineers pinpoint the problem and mitigate it more effectively. As a custom software development company, our teams take it one step further and design guardrails into any AI solutions from the outset. These guardrails can include stricter validation in testing frameworks, embedded fact-checking tools in documentation workflows, or context-aware prompts for internal chatbots.

Type	Description	Example
Factual	Stating incorrect or fabricated facts	“The capital of Australia is Sydney” (It’s Canberra.)
Intrinsic	Contradicts the input or prompt itself	Asked for fruits, but response includes “apple, banana, car”
Extrinsic	Adds info not present in the input	Summary includes events never mentioned in the source document.
Contextual	Misinterprets context due to ambiguity	Misanswers “Who won the last election?” without knowing the country
Semantic	Grammatically correct but semantically nonsensical	“A square has three sides and can roll”
Logical / Reasoning	Errors in multi-step reasoning or conclusions	Incorrectly solves a math problem or gives flawed cause-effect logic
Temporal	Uses outdated or inaccurate time-based info	“The latest iPhone model in 2025 is the iPhone 12”
Attribution	Cites non-existent sources, authors, or studies	Refers to a “2021 MIT study” that doesn’t exist
Common sense	Violates basic real-world logic	“Water is dry and used to fuel cars”

These hallucination types may overlap or occur simultaneously. Recognising and mitigating against them improves your ability to assess and correct LLM outputs before a broader audience relies on any output.

Why do LLMs hallucinate?

The root causes are more fundamental than just “bad data.” Here’s why they happen:

1. Predictive nature of language models

LLMs are trained to predict the next word (token) based on previous ones. They aren’t grounded in truth; they’re grounded in probability. If the most probable next token leads to a falsehood, the model will still generate it confidently.

2. Training data gaps

LLMs are trained on snapshots of the internet, codebases, documentation, and more. But:

Some topics are underrepresented
Some information is outdated or incorrect
They may not be trained on your internal codebase or proprietary domain, causing guesses or fabrications

3. Lack of retrieval mechanism

Basic LLMs can’t “look up” real-time or external sources unless paired with tools like RAG (Retrieval-Augmented Generation). Without this, they rely solely on internal memory which leads to confident fiction.

4. Prompt ambiguity or overreach

Sometimes, hallucinations stem from how we ask. Broad, vague, or misleading prompts lead to outputs where the model feels compelled to “fill in the blanks”.

Critically, this underscores why most internal AI implementations cannot simply plug-and-play. At BBD, we pair LLM capabilities with RAG, internal codebase integration, and careful prompt design so outputs are grounded in verified, domain-specific knowledge.

Prompt engineering to reduce hallucinations

While not a silver bullet, the right prompt can dramatically reduce hallucination frequency. Enter prompt engineering. Prompt engineering is the practice of crafting and refining instructions to guide AI models in generating specific and high-quality outputs.

Prompt techniques that help:

Constrain the response
Example: “Only respond with information you are 100% confident about.”
Specify format and source
Example: “Cite the documentation and provide the exact URL. If unsure, say ‘I don’t know.’”
Use system-level instructions
Example: “You are a cautious assistant. Never invent facts or APIs. Always verify.”
Break down complex asks
Instead of “Generate a test plan,” say:
“List 5 key features of X. Then for each, suggest 1 possible test scenario.”
Chain-of-thought prompting
Ask the model to explain its reasoning step-by-step. You can often catch hallucinations in the explanation before trusting the final output.

Another technique is to incorporate prompt-level safeguards during development so that business-critical workflows are protected from the start, rather than patched later. This also puts less responsibility on the end user while ensuring better quality.

Detecting hallucinations automatically

Detecting hallucinations is hard, even for humans. But here are current approaches used to flag or prevent them at scale:

1. Reference-based evaluation

Compare generated content against ground truth sources (e.g., documentation, test plans, codebases).

Tools: BERTScore, BLEU, ROUGE, TruthfulQA

2. Self-consistency checks

Ask the LLM the same question multiple times (with slight variations) and compare the outputs.

Inconsistencies often indicate uncertainty or hallucination.

3. Tool-augmented validation

Use RAG or plugins/tools to verify facts. Pair LLMs with code search, test case repositories, or live documentation.

4. External validators

Integrate LLM output validators in CI/CD pipelines:

Test if generated API docs match actual code
Use linters for test code generated by AI
Apply approval testing for generated content

For our development teams working on client projects, these detection techniques form part of our delivery pipelines. For example, when integrating AI into QA, our teams configure automated validators in CI/CD to flag discrepancies before they reach production, helping clients maintain software quality without slowing release cycles.

Real-world impact: When hallucinations hit QA and documentation

Hallucinations aren’t just academic – they can break things in production or spread misinformation inside teams.

Example: In one Sprint review, a developer used an LLM to auto-generate test cases. A fabricated test claimed a feature should reject passwords under 8 characters except the actual requirement was 6. The bug wasn’t in the code – it was in the hallucination.

In QA & testing:

Bogus test steps: LLMs might invent test data or click paths that don’t exist
Unsupported assertions: “Assert that login should take <500ms” – but no such requirement exists
Tool misguidance: Recommending tools or methods that don’t align with your stack

In documentation:

Inaccurate API details
Incorrect usage patterns
Fabricated citations

If these go unchecked, they lead to developer confusion, poor automation, or worse, defects in shipped software.

The “so what” is clear: hallucinations can introduce costly rework, delay releases, and even erode user trust if misinformation makes it into public-facing content. By building detection and validation into AI-powered workflows, we help ensure that the efficiency gains of LLMs don’t come at the expense of accuracy or compliance.

What you can do: A quick checklist

Area	Action
Prompting	Be specific, add disclaimers, use structured prompts
Evaluation	Use human-in-the-loop validation for critical content
Grounding	Add RAG (Retrieval-Augmented Generation) pipelines or embed your internal documentation to help the LLM ground responses in verified knowledge
Testing	Use approval tests or assert source mappings in CI
Training	Educate your team: LLMs are copilots, not oracles

Hallucinations aren’t a flaw in AI. They’re a reminder that human oversight, context and validation still matter. The future of AI-assisted development belongs to teams who know when to trust the model and when to test it.

As AI becomes embedded in how we build and maintain software, the line between speed and accuracy will define success. That’s why at BBD we focus on solutions that are as reliable as they are intelligent – helping teams scale with confidence and clarity.

Hallucinations in LLMs

Why they happen, how to detect them, and what you can do

What exactly is a hallucination?

Types of hallucinations in LLMs

Why do LLMs hallucinate?

Prompt engineering to reduce hallucinations

Detecting hallucinations automatically

Real-world impact: When hallucinations hit QA and documentation

What you can do: A quick checklist

Related Content

The real battle in WealthTech isn’t innovation. It’s friction

Modernising financial services through automation

Leveraging AI and RAG to transform legal analysis