Understanding and Measuring Hallucinations in Large Language Models

Introduction

Large Language Models (LLMs) have made significant strides in natural language processing, but they are not without their flaws. One of the most discussed issues is “hallucination,” where the model generates responses that are factually incorrect or irrelevant to the prompt. In this blog, we will delve into the concept of hallucinations in LLMs, explore how to measure them, and discuss various metrics used in their evaluation.

What Are Hallucinations in Large Language Models?

Definition

In the context of LLMs, hallucinations refer to instances where the model generates outputs that do not align with reality or the given prompt. These can be broadly categorized into two types: inaccurate responses and irrelevant responses.

Types of Hallucinations

Inaccurate Responses: These are factually incorrect outputs. For example, an LLM might state that “The capital of France is Berlin,” which is incorrect.
Irrelevant Responses: These are outputs that, while possibly accurate in isolation, do not pertain to the given prompt. For instance, if asked about the weather, an LLM might respond with information about the stock market.

The Semantics of Hallucination

Inaccurate Responses

Inaccurate responses are straightforward hallucinations where the LLM provides information that is factually incorrect. This often happens due to the model’s training data limitations or misunderstanding of the prompt.

Irrelevant Responses

Irrelevant responses are trickier, as they might be factually correct but not relevant to the prompt. This occurs when the LLM fails to maintain context or misunderstands the prompt’s intent.

Semantic Similarity vs. Relevance

It’s essential to distinguish between semantic similarity and relevance. A response might be semantically similar to the prompt but still irrelevant. For example, the prompt “Tell me about Shakespeare’s plays” might receive a response about Shakespeare’s birthplace. While related, it does not address the specific query about his plays.

Example:

Prompt: “Tell me about Shakespeare’s plays.”
Response (Inaccurate): “Shakespeare wrote ‘War and Peace.'”
Response (Irrelevant): “Shakespeare was born in Stratford-upon-Avon.”
Response (Relevant): “Shakespeare’s plays include ‘Hamlet,’ ‘Macbeth,’ and ‘Othello.'”

Measuring Hallucinations in LLMs

Prompt-Response Relevance

To measure the relevance of a response to a given prompt, various heuristics can be used. These evaluate how well the generated response aligns with the prompt’s intent. e.g BLEU Score and BERT Score

Related page : BLEU Vs BERT: Choosing The Right Metric For Evaluating LLM Prompt Responses

Response Self-Similarity for the Same Prompt

Another approach to measuring hallucinations is to check the consistency of responses to the same prompt over multiple iterations. e.g using Sentence embeddings or LLM self-similarity. A high degree of variability can indicate a tendency towards hallucinations.

Related page : Sentence Embeddings Vs. LLM Self-Similarity: Battle Of The Hallucination Detectors

Heuristics for Evaluating Hallucinations

BLEU Score

The BLEU (Bilingual Evaluation Understudy) score is commonly used for evaluating the quality of machine-generated text. It compares the overlap between the generated response and reference texts, focusing on precision.

BERT Score

The BERT (Bidirectional Encoder Representations from Transformers) score uses pre-trained transformer models to evaluate the semantic similarity between the generated response and the reference text, providing a more nuanced assessment than BLEU.

Sentence Embeddings

Sentence embeddings involve encoding sentences into dense vector representations. These vectors can then be compared to measure the similarity between responses and prompts or among multiple responses to the same prompt.

LLM Self-Similarity

LLM self-similarity involves evaluating the consistency of the model’s responses to identical prompts. Techniques like cosine similarity between sentence embeddings can be used to measure this.

Conclusion

Hallucinations in LLMs pose a significant challenge to their reliability and usability. Understanding the types of hallucinations and how to measure them is crucial for improving these models. By using heuristics like BLEU, BERT, sentence embeddings, and self-similarity, we can better evaluate and mitigate the impact of hallucinations.

References

DeepLearning.AI Free Course : Quality and Safety for LLM Applications