Hallucinations in LLMs - BBD

Hallucinations in LLMs

November 10, 2025

A woman with curly hair smiles while looking at her phone in a lush, green garden, sunlight filtering through

Why they happen, how to detect them, and what you can do

As large language models (LLMs) like ChatGPT, Claude, Gemini and open-source alternatives become integral to modern software development workflows – from coding assistance to automated documentation and testing – there’s a growing challenge that continues to puzzle even experienced practitioners: hallucinations.

For many organisations, especially those with in-house development teams, this is not just an AI curiosity but a practical risk. LLM hallucinations can lead to flawed technical outputs, incorrect business insights and wasted development effort if they go unchecked. Understanding and mitigating them is essential to delivering reliable AI-powered solutions that meet business goals.

Hallucinations in LLMs refer to confidently generated, but false or misleading, content. These aren’t AI dreams or bugs in the system. And if you’ve worked with LLMs for even a short while, you’ve probably seen it first-hand – be it a made-up API endpoint, non-existent RFC, or incorrect step in a testing workflow.

 

What exactly is a hallucination?

A hallucination is when an LLM generates content that’s syntactically correct but semantically false or unverifiable. These can include

  • Inventing function names, parameters or return types in code
  • Generating fictitious quotes or research papers
  • Making up test cases, tools or datasets
  • Misrepresenting laws, standards or security guidelines

What makes hallucinations tricky is that they often sound very plausible… And that’s what makes them dangerous.

 

Types of hallucinations in LLMs

Not all hallucinations are the same. Understanding the different types helps software engineers pinpoint the problem and mitigate it more effectively. As a custom software development company, our teams take it one step further and design guardrails into any AI solutions from the outset. These guardrails can include stricter validation in testing frameworks, embedded fact-checking tools in documentation workflows, or context-aware prompts for internal chatbots.

Type Description Example
Factual Stating incorrect or fabricated facts “The capital of Australia is Sydney” (It’s Canberra.)
Intrinsic Contradicts the input or prompt itself Asked for fruits, but response includes “apple, banana, car”
Extrinsic Adds info not present in the input Summary includes events never mentioned in the source document.
Contextual Misinterprets context due to ambiguity Misanswers “Who won the last election?” without knowing the country
Semantic Grammatically correct but semantically nonsensical “A square has three sides and can roll”  
Logical / Reasoning Errors in multi-step reasoning or conclusions Incorrectly solves a math problem or gives flawed cause-effect logic
Temporal Uses outdated or inaccurate time-based info “The latest iPhone model in 2025 is the iPhone 12”
Attribution Cites non-existent sources, authors, or studies Refers to a “2021 MIT study” that doesn’t exist
Common sense Violates basic real-world logic “Water is dry and used to fuel cars”

These hallucination types may overlap or occur simultaneously. Recognising and mitigating against them improves your ability to assess and correct LLM outputs before a broader audience relies on any output.

 

Why do LLMs hallucinate?

The root causes are more fundamental than just “bad data.” Here’s why they happen:

 

1. Predictive nature of language models

LLMs are trained to predict the next word (token) based on previous ones. They aren’t grounded in truth; they’re grounded in probability. If the most probable next token leads to a falsehood, the model will still generate it confidently.

 

2. Training data gaps

LLMs are trained on snapshots of the internet, codebases, documentation, and more. But:

  • Some topics are underrepresented
  • Some information is outdated or incorrect
  • They may not be trained on your internal codebase or proprietary domain, causing guesses or fabrications

 

3. Lack of retrieval mechanism

Basic LLMs can’t “look up” real-time or external sources unless paired with tools like RAG (Retrieval-Augmented Generation). Without this, they rely solely on internal memory which leads to confident fiction.

 

4. Prompt ambiguity or overreach

Sometimes, hallucinations stem from how we ask. Broad, vague, or misleading prompts lead to outputs where the model feels compelled to “fill in the blanks”.

Critically, this underscores why most internal AI implementations cannot simply plug-and-play. At BBD, we pair LLM capabilities with RAG, internal codebase integration, and careful prompt design so outputs are grounded in verified, domain-specific knowledge.

 

Prompt engineering to reduce hallucinations

While not a silver bullet, the right prompt can dramatically reduce hallucination frequency. Enter prompt engineering. Prompt engineering is the practice of crafting and refining instructions to guide AI models in generating specific and high-quality outputs.

 

Prompt techniques that help:

  • Constrain the response
    Example: “Only respond with information you are 100% confident about.”
  • Specify format and source
    Example: “Cite the documentation and provide the exact URL. If unsure, say ‘I don’t know.’”
  • Use system-level instructions
    Example: “You are a cautious assistant. Never invent facts or APIs. Always verify.”
  • Break down complex asks
    Instead of “Generate a test plan,” say:
    “List 5 key features of X. Then for each, suggest 1 possible test scenario.”
  • Chain-of-thought prompting
    Ask the model to explain its reasoning step-by-step. You can often catch hallucinations in the explanation before trusting the final output.

Another technique is to incorporate prompt-level safeguards during development so that business-critical workflows are protected from the start, rather than patched later. This also puts less responsibility on the end user while ensuring better quality.

 

Detecting hallucinations automatically

Detecting hallucinations is hard, even for humans. But here are current approaches used to flag or prevent them at scale:

 

1. Reference-based evaluation

Compare generated content against ground truth sources (e.g., documentation, test plans, codebases).

Tools: BERTScore, BLEU, ROUGE, TruthfulQA

 

2. Self-consistency checks

Ask the LLM the same question multiple times (with slight variations) and compare the outputs.

Inconsistencies often indicate uncertainty or hallucination.

 

3. Tool-augmented validation

Use RAG or plugins/tools to verify facts. Pair LLMs with code search, test case repositories, or live documentation.

 

4. External validators

Integrate LLM output validators in CI/CD pipelines:

  • Test if generated API docs match actual code
  • Use linters for test code generated by AI
  • Apply approval testing for generated content

For our development teams working on client projects, these detection techniques form part of our delivery pipelines. For example, when integrating AI into QA, our teams configure automated validators in CI/CD to flag discrepancies before they reach production, helping clients maintain software quality without slowing release cycles.

 

Real-world impact: When hallucinations hit QA and documentation

Hallucinations aren’t just academic – they can break things in production or spread misinformation inside teams.

Example: In one Sprint review, a developer used an LLM to auto-generate test cases. A fabricated test claimed a feature should reject passwords under 8 characters except the actual requirement was 6. The bug wasn’t in the code – it was in the hallucination.

 

In QA & testing:

  • Bogus test steps: LLMs might invent test data or click paths that don’t exist
  • Unsupported assertions: “Assert that login should take <500ms” – but no such requirement exists
  • Tool misguidance: Recommending tools or methods that don’t align with your stack

 

In documentation:

  • Inaccurate API details
  • Incorrect usage patterns
  • Fabricated citations

If these go unchecked, they lead to developer confusion, poor automation, or worse, defects in shipped software.

The “so what” is clear: hallucinations can introduce costly rework, delay releases, and even erode user trust if misinformation makes it into public-facing content. By building detection and validation into AI-powered workflows, we help ensure that the efficiency gains of LLMs don’t come at the expense of accuracy or compliance.

 

What you can do: A quick checklist

Area Action
Prompting Be specific, add disclaimers, use structured prompts
Evaluation Use human-in-the-loop validation for critical content
Grounding Add RAG (Retrieval-Augmented Generation) pipelines or embed your internal documentation to help the LLM ground responses in verified knowledge
Testing Use approval tests or assert source mappings in CI
Training Educate your team: LLMs are copilots, not oracles

Hallucinations aren’t a flaw in AI. They’re a reminder that human oversight, context and validation still matter. The future of AI-assisted development belongs to teams who know when to trust the model and when to test it.

As AI becomes embedded in how we build and maintain software, the line between speed and accuracy will define success. That’s why at BBD we focus on solutions that are as reliable as they are intelligent – helping teams scale with confidence and clarity.

Related Content

Featured insights

Case Studies

Modernising financial services through automation

Two professionals smiling while standing together in a modern office, looking at a laptop screen near large windows, suggesting collaboration or teamwork
Case Studies

Leveraging AI and RAG to transform legal analysis

Abstract image with smooth, layered curves and circular shapes in varying shades of deep blue.
Case Studies

Conversational banking, built for scale using AI

A woman in a white shirt smiles warmly while shaking hands across a table with another person. A man in a green shirt sits beside her, looking pleased.