Explanation: BLEU (Bilingual Evaluation Understudy) score measures how closely a machine-generated text matches one or more reference texts. It’s commonly used in machine translation but can also apply to evaluating responses from language models.
Imagine you have two people trying to answer the same question. One person gives an answer based on facts, while the other uses a language model. You want to see how accurate the language model’s answer is compared to the factual one. The BLEU (Bilingual Evaluation Understudy) score helps with this by comparing the words and phrases in both answers. If many of the same words and phrases appear in both responses, the BLEU score goes up, indicating a better match.
Think of it like this: the higher the BLEU score, the closer the language model’s response is to the factual answer. It’s a way to measure how well a machine-generated response matches a human-generated one.
Purpose: To quantify how much of the content in the generated response overlaps with the reference text, focusing on n-grams (sequences of words).
Example:
- Prompt: “What is the capital of France?”
- Reference Response: “The capital of France is Paris.”
- Generated Response 1 (Success): “The capital of France is Paris.” (High BLEU Score because it matches the reference exactly)
- Generated Response 2 (Failure): “The capital of France is Berlin.” (Low BLEU Score because it has no overlap with the reference text)
Implementation:
from nltk.translate.bleu_score import sentence_bleu
# Example reference responses for prompts
references = [
["Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'"],
["The capital of France is Paris."],
["A computer processes data using a central processing unit (CPU) and memory."]
]
# Example prompts and responses
prompts = [
"Tell me about Shakespeare's plays.",
"What is the capital of France?",
"How does a computer work?"
]
responses = [
["Shakespeare wrote 'War and Peace.'", "Shakespeare was born in Stratford-upon-Avon.", "Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'"],
["The capital of France is Berlin.", "France is a country in Europe.", "The capital of France is Paris."],
["A computer is a type of bird.", "The history of computers dates back to the 19th century.", "A computer processes data using a central processing unit (CPU) and memory."]
]
# Calculate BLEU scores
for i, prompt in enumerate(prompts):
print(f"Prompt: {prompt}")
for j, response in enumerate(responses[i]):
score = sentence_bleu(references[i], response)
print(f" Response {j + 1}: {response}")
print(f" Reference: {references[i]}")
print(f" BLEU Score: {score:.4f}")
Prerequisites:
- Install NLTK:
pip install nltk
- Import the
sentence_bleu
function fromnltk.translate.bleu_score.
✨ Another library you may use for calculating BLEU score is Hugging Face Evaluate. It is a library for easily evaluating machine learning models and datasets. You get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!) ✨
Expected Output: A numerical score (BLEU score) for each response, indicating the level of precision in capturing the reference text’s content. Higher scores indicate better performance.
Prompt: Tell me about Shakespeare's plays.
Response 1: Shakespeare wrote 'War and Peace.'
Reference: ["Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'"]
BLEU Score: 0.2118
Response 2: Shakespeare was born in Stratford-upon-Avon.
Reference: ["Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'"]
BLEU Score: 0.2052
Response 3: Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'
Reference: ["Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'"]
BLEU Score: 1.0000
Prompt: What is the capital of France?
Response 1: The capital of France is Berlin.
Reference: ['The capital of France is Paris.']
BLEU Score: 0.7923
Response 2: France is a country in Europe.
Reference: ['The capital of France is Paris.']
BLEU Score: 0.3578
Response 3: The capital of France is Paris.
Reference: ['The capital of France is Paris.']
BLEU Score: 1.0000
Prompt: How does a computer work?
Response 1: A computer is a type of bird.
Reference: ['A computer processes data using a central processing unit (CPU) and memory.']
BLEU Score: 0.0964
Response 2: The history of computers dates back to the 19th century.
Reference: ['A computer processes data using a central processing unit (CPU) and memory.']
BLEU Score: 0.2503
Response 3: A computer processes data using a central processing unit (CPU) and memory.
Reference: ['A computer processes data using a central processing unit (CPU) and memory.']
BLEU Score: 1.0000
The Advantages and Disadvantages of Using BLEU Score for Evaluating Prompt Response Relevance in Large Language Models
When it comes to evaluating the relevance of responses generated by Large Language Models (LLMs), the BLEU (Bilingual Evaluation Understudy) score is a commonly used metric. Originally designed for assessing the quality of machine translations, BLEU has been adapted for various natural language processing tasks, including evaluating LLM outputs. However, like any metric, it has its strengths and weaknesses.
Advantages of BLEU Score
- Quantitative Measure: BLEU provides a numerical score between 0 and 1, making it easy to compare the performance of different models or configurations. A higher score indicates better performance, which simplifies the evaluation process.
- Language-Independent: The BLEU score can be applied to responses in various languages, making it versatile for multilingual LLM evaluation.
- Correlation with Human Judgment: BLEU scores have been shown to correlate reasonably well with human judgments of translation quality, particularly in the context of machine translation. This makes it a useful proxy for more labor-intensive human evaluations.
- Efficiency: Computing the BLEU score is relatively fast and straightforward, which is beneficial for large-scale evaluations where speed and computational efficiency are crucial.
- N-gram Precision: BLEU captures n-gram precision, which helps in assessing the fluency and coherence of the generated text by comparing it to reference texts.
Disadvantages of BLEU Score
- Lack of Semantic Analysis: BLEU focuses on exact word matches and does not consider the meaning or context of the words. This can lead to misleading evaluations where semantically correct but differently worded responses receive low scores.
- Overemphasis on Precision: The metric tends to favor shorter, more concise responses that match the reference n-grams, potentially at the expense of completeness and depth. This can be problematic for tasks requiring detailed or nuanced responses.
- Dependency on Reference Texts: BLEU requires one or more reference texts for comparison. The quality and comprehensiveness of these references can significantly impact the BLEU score, and they may not cover all possible correct responses.
- Insensitivity to Word Importance: All words are treated equally in BLEU calculations, which means that critical content words and less important function words are given the same weight. This can distort the evaluation of the response’s relevance.
- Limited Applicability to Creative Tasks: For open-ended or creative prompts, BLEU may not be suitable as there can be multiple valid responses that differ significantly from the reference. This limitation makes it less effective for evaluating tasks like storytelling or complex question answering.
- Lack of Grammatical Assessment: BLEU does not explicitly evaluate the grammatical correctness or fluency of the generated text, which are important aspects of response quality.
- Potential for Gaming: Models can be optimized to produce high BLEU scores without necessarily improving the overall quality or relevance of the responses. This can lead to overfitting to the evaluation metric rather than genuine improvements in performance.
In conclusion, while the BLEU score offers a quick and quantitative way to assess the relevance of LLM-generated responses, it has significant limitations. It is best used in conjunction with other metrics and human evaluation to provide a more comprehensive assessment of LLM performance.
Summary: BLEU scores in machine learning compare computer-generated responses with human-generated responses to measure accuracy. A higher BLEU score means the computer response is closer to the human one, serving as a grading system for machine-generated content.
References
- Free Course : Quality and Safety for LLM Applications
- Evaluating Large Language Models – Fuzzy Labs
- Quantitative Evaluation of LLM Responses with RAG-based – Gary Stafford
- ROUGE and BLEU Score – Ankur Bhargava
- LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide – Confident AI
- Fine-tuning with Instruction prompts, Model Evaluation Metrics, and Evaluation Benchmarks for LLMs – GopenAI