Introduction
Large Language Models (LLMs) have made significant strides in natural language processing, but they are not without their flaws. One of the most discussed issues is “hallucination,” where the model generates responses that are factually incorrect or irrelevant to the prompt. In this blog, we will delve into the concept of hallucinations in LLMs, explore how to measure them, and discuss various metrics used in their evaluation.
What Are Hallucinations in Large Language Models?
Definition
In the context of LLMs, hallucinations refer to instances where the model generates outputs that do not align with reality or the given prompt. These can be broadly categorized into two types: inaccurate responses and irrelevant responses.
Types of Hallucinations
- Inaccurate Responses: These are factually incorrect outputs. For example, an LLM might state that “The capital of France is Berlin,” which is incorrect.
- Irrelevant Responses: These are outputs that, while possibly accurate in isolation, do not pertain to the given prompt. For instance, if asked about the weather, an LLM might respond with information about the stock market.
The Semantics of Hallucination
Inaccurate Responses
Inaccurate responses are straightforward hallucinations where the LLM provides information that is factually incorrect. This often happens due to the model’s training data limitations or misunderstanding of the prompt.
Irrelevant Responses
Irrelevant responses are trickier, as they might be factually correct but not relevant to the prompt. This occurs when the LLM fails to maintain context or misunderstands the prompt’s intent.
Semantic Similarity vs. Relevance
It’s essential to distinguish between semantic similarity and relevance. A response might be semantically similar to the prompt but still irrelevant. For example, the prompt “Tell me about Shakespeare’s plays” might receive a response about Shakespeare’s birthplace. While related, it does not address the specific query about his plays.
Example:
- Prompt: “Tell me about Shakespeare’s plays.”
- Response (Inaccurate): “Shakespeare wrote ‘War and Peace.'”
- Response (Irrelevant): “Shakespeare was born in Stratford-upon-Avon.”
- Response (Relevant): “Shakespeare’s plays include ‘Hamlet,’ ‘Macbeth,’ and ‘Othello.'”
Measuring Hallucinations in LLMs
Prompt-Response Relevance
To measure the relevance of a response to a given prompt, various heuristics can be used. These evaluate how well the generated response aligns with the prompt’s intent. e.g BLEU Score and BERT Score
Related page : BLEU Vs BERT: Choosing The Right Metric For Evaluating LLM Prompt Responses
Response Self-Similarity for the Same Prompt
Another approach to measuring hallucinations is to check the consistency of responses to the same prompt over multiple iterations. e.g using Sentence embeddings or LLM self-similarity. A high degree of variability can indicate a tendency towards hallucinations.
Related page : Sentence Embeddings Vs. LLM Self-Similarity: Battle Of The Hallucination Detectors
Heuristics for Evaluating Hallucinations
BLEU Score
The BLEU (Bilingual Evaluation Understudy) score is commonly used for evaluating the quality of machine-generated text. It compares the overlap between the generated response and reference texts, focusing on precision.
BERT Score
The BERT (Bidirectional Encoder Representations from Transformers) score uses pre-trained transformer models to evaluate the semantic similarity between the generated response and the reference text, providing a more nuanced assessment than BLEU.
Sentence Embeddings
Sentence embeddings involve encoding sentences into dense vector representations. These vectors can then be compared to measure the similarity between responses and prompts or among multiple responses to the same prompt.
LLM Self-Similarity
LLM self-similarity involves evaluating the consistency of the model’s responses to identical prompts. Techniques like cosine similarity between sentence embeddings can be used to measure this.
Conclusion
Hallucinations in LLMs pose a significant challenge to their reliability and usability. Understanding the types of hallucinations and how to measure them is crucial for improving these models. By using heuristics like BLEU, BERT, sentence embeddings, and self-similarity, we can better evaluate and mitigate the impact of hallucinations.
References
DeepLearning.AI Free Course : Quality and Safety for LLM Applications