Sentence Embeddings: Measuring LLM Response vs Response vs Response for Same Prompt

Explanation

Sentence embeddings represent sentences as dense vectors in a high-dimensional space. The similarity between sentences can be measured by computing the cosine similarity between their embeddings.

Imagine you’re comparing two summaries of the same book. Even if they don’t use the exact same words, you’d want to know if they convey the same key points and ideas. Sentence embeddings represent sentences as vectors in a high-dimensional space, allowing you to measure how close these vectors are to each other using cosine similarity.

Think of it like this: the higher the cosine similarity score, the closer the language model’s response is to the human-generated response in terms of content and meaning.

Purpose: To evaluate the similarity between responses and reference texts or among multiple responses to the same prompt.

Example:

Prompt: “What is the capital of France?”
Reference Response: “The capital of France is Paris.”
Generated Response 1 (Success): “Paris is the capital of France.” (High similarity score)
Generated Response 2 (Failure): “Berlin is the capital of France.” (Low similarity score)

Implementation:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Example prompts and responses
prompts = [
    "Tell me about Shakespeare's plays.",
    "What is the capital of France?",
    "How does a computer work?"
]

responses = [
    ["Shakespeare's renowned works encompass tragedies such as 'Hamlet,' 'Macbeth,' and 'Othello'",
     "Some of Shakespeare's notable plays are 'Hamlet,' 'Macbeth,' and 'Othello'",
     "Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'"],
    ["Paris is the capital city of France",
     "The city of Paris serves as the capital of France.",
     "The capital of France is Paris."],
    ["A computer is a type of bird.",
     "The history of computers dates back to the 19th century.",
     "A computer processes data using a central processing unit (CPU) and memory."]
]

# Example reference responses for prompts
references = [
    ["Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'"],
    ["The capital of France is Paris."],
    ["A computer processes data using a central processing unit (CPU) and memory."]
]

# Calculate sentence embeddings and similarity
for i, prompt in enumerate(prompts):
    print(f"Prompt: {prompt}")
    reference_text = references[i][0]
    print(f"Reference: {reference_text}")
    reference_embedding = model.encode(reference_text)
    response_embeddings = [model.encode(response) for response in responses[i]]

    # Compare each response to the reference
    ref_similarities = [cosine_similarity([reference_embedding], [resp_emb])[0][0] for resp_emb in response_embeddings]

    # Compare each response to every other response
    resp_similarities = cosine_similarity(response_embeddings)

    for j, response in enumerate(responses[i]):
        print(f"  Response {j + 1}: {response}")
        print(f"    Similarity to Reference: {ref_similarities[j]:.4f}")

        # Calculate and print intermediate similarities to other responses
        other_resp_similarities = []
        for k in range(len(responses[i])):
            if k != j:
                similarity = resp_similarities[j][k]
                other_resp_similarities.append(similarity)
                print(f"    Similarity to Response {k + 1}: {responses[i][k]}")
                print(f"      Intermediate Similarity: {similarity:.4f}")

        avg_other_similarity = np.mean(other_resp_similarities)
        print(f"    Average Similarity to Other Responses: {avg_other_similarity:.4f}")

        # Calculate overall average similarity
        overall_avg = np.mean([ref_similarities[j]] + other_resp_similarities)
        print(f"    Overall Average Similarity: {overall_avg:.4f}")

Prerequisites:

Install sentence-transformers: pip install sentence-transformers
Import SentenceTransformer from sentence-transformers.
Import cosine_similarity from sklearn.metrics.pairwise.

Expected Output: A numerical score indicating the similarity between each response and the reference text or among responses to the same prompt. The overall average similarity provides a single metric that takes into account both the accuracy (similarity to reference) and the consistency (similarity to other responses) of each response. Higher scores indicate greater similarity.

Prompt: Tell me about Shakespeare's plays.
Reference: Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'
  Response 1: Shakespeare's renowned works encompass tragedies such as 'Hamlet,' 'Macbeth,' and 'Othello'
    Similarity to Reference: 0.8304
    Similarity to Response 2: Some of Shakespeare's notable plays are 'Hamlet,' 'Macbeth,' and 'Othello'
      Intermediate Similarity: 0.8626
    Similarity to Response 3: Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'
      Intermediate Similarity: 0.8304
    Average Similarity to Other Responses: 0.8465
    Overall Average Similarity: 0.8411
  Response 2: Some of Shakespeare's notable plays are 'Hamlet,' 'Macbeth,' and 'Othello'
    Similarity to Reference: 0.9373
    Similarity to Response 1: Shakespeare's renowned works encompass tragedies such as 'Hamlet,' 'Macbeth,' and 'Othello'
      Intermediate Similarity: 0.8626
    Similarity to Response 3: Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'
      Intermediate Similarity: 0.9373
    Average Similarity to Other Responses: 0.9000
    Overall Average Similarity: 0.9124
  Response 3: Shakespeare's plays include 'Hamlet,' 'Macbeth,' and 'Othello.'
    Similarity to Reference: 1.0000
    Similarity to Response 1: Shakespeare's renowned works encompass tragedies such as 'Hamlet,' 'Macbeth,' and 'Othello'
      Intermediate Similarity: 0.8304
    Similarity to Response 2: Some of Shakespeare's notable plays are 'Hamlet,' 'Macbeth,' and 'Othello'
      Intermediate Similarity: 0.9373
    Average Similarity to Other Responses: 0.8839
    Overall Average Similarity: 0.9226


Prompt: What is the capital of France?
Reference: The capital of France is Paris.
  Response 1: Paris is the capital city of France
    Similarity to Reference: 0.9637
    Similarity to Response 2: The city of Paris serves as the capital of France.
      Intermediate Similarity: 0.9601
    Similarity to Response 3: The capital of France is Paris.
      Intermediate Similarity: 0.9637
    Average Similarity to Other Responses: 0.9619
    Overall Average Similarity: 0.9625
  Response 2: The city of Paris serves as the capital of France.
    Similarity to Reference: 0.9462
    Similarity to Response 1: Paris is the capital city of France
      Intermediate Similarity: 0.9601
    Similarity to Response 3: The capital of France is Paris.
      Intermediate Similarity: 0.9462
    Average Similarity to Other Responses: 0.9532
    Overall Average Similarity: 0.9509
  Response 3: The capital of France is Paris.
    Similarity to Reference: 1.0000
    Similarity to Response 1: Paris is the capital city of France
      Intermediate Similarity: 0.9637
    Similarity to Response 2: The city of Paris serves as the capital of France.
      Intermediate Similarity: 0.9462
    Average Similarity to Other Responses: 0.9550
    Overall Average Similarity: 0.9700


Prompt: How does a computer work?
Reference: A computer processes data using a central processing unit (CPU) and memory.
  Response 1: A computer is a type of bird.
    Similarity to Reference: 0.4440
    Similarity to Response 2: The history of computers dates back to the 19th century.
      Intermediate Similarity: 0.4981
    Similarity to Response 3: A computer processes data using a central processing unit (CPU) and memory.
      Intermediate Similarity: 0.4440
    Average Similarity to Other Responses: 0.4711
    Overall Average Similarity: 0.4621
  Response 2: The history of computers dates back to the 19th century.
    Similarity to Reference: 0.3757
    Similarity to Response 1: A computer is a type of bird.
      Intermediate Similarity: 0.4981
    Similarity to Response 3: A computer processes data using a central processing unit (CPU) and memory.
      Intermediate Similarity: 0.3757
    Average Similarity to Other Responses: 0.4369
    Overall Average Similarity: 0.4165
  Response 3: A computer processes data using a central processing unit (CPU) and memory.
    Similarity to Reference: 1.0000
    Similarity to Response 1: A computer is a type of bird.
      Intermediate Similarity: 0.4440
    Similarity to Response 2: The history of computers dates back to the 19th century.
      Intermediate Similarity: 0.3757
    Average Similarity to Other Responses: 0.4099
    Overall Average Similarity: 0.6066

Sentence Embeddings for Measuring LLM Responses: A Double-Edged Sword in Hallucination Detection

As development teams working with Large Language Models (LLMs), we’re constantly seeking efficient ways to evaluate model outputs and detect potential hallucinations. Sentence embeddings have emerged as a powerful tool in this quest, offering both advantages and challenges. Let’s dive into how sentence embeddings can be leveraged for comparing LLM responses to the same prompt, with a focus on hallucination detection.

Advantages

Semantic Understanding: Sentence embeddings capture the overall meaning and context of a sentence, going beyond simple word-level comparisons. This allows us to detect subtle semantic differences between responses, which is crucial for identifying hallucinations that might be semantically plausible but factually incorrect.
Efficiency and Scalability: Unlike traditional string-matching techniques, sentence embeddings allow us to represent entire sentences as fixed-length vectors. This reduced dimensionality makes it computationally efficient to compare large numbers of responses, enabling scalable evaluation of LLM outputs .
Robustness to Paraphrasing: LLMs often generate responses that convey the same information in different words. Sentence embeddings are particularly good at recognizing semantic similarity despite lexical differences, helping us identify consistent responses even when they’re phrased differently.
Multilingual Capabilities: Many modern embedding models, like BERT-based ones, offer strong multilingual support. This is invaluable for development teams working on multilingual LLMs, allowing for consistent evaluation across languages.

Disadvantages

Lack of Factual Verification: While sentence embeddings excel at capturing semantic similarity, they don’t inherently verify factual accuracy. A hallucinated response that’s semantically coherent but factually incorrect might still have a high similarity score with a correct response.
Computational Intensity: Generating high-quality sentence embeddings, especially with large transformer models, can be computationally expensive. This might pose challenges for real-time evaluation or when working with large datasets of LLM responses.
Model Dependence: The quality of comparison heavily depends on the underlying embedding model. Using a model that’s not well-aligned with your domain or task could lead to misleading similarity scores.
Threshold Determination: Deciding on the right similarity threshold for flagging potential hallucinations can be tricky. Set it too low, and you’ll have many false positives; too high, and you might miss subtle hallucinations.

Practical Implementation Tips

Choose the Right Model: Consider domain-specific models if available, or fine-tune general models on your specific data for better performance.
Combine with Other Techniques: Use sentence embeddings as part of a broader evaluation strategy. Complement them with fact-checking against knowledge bases or rule-based systems for more robust hallucination detection.
Benchmark and Iterate: Regularly benchmark your embedding-based evaluation against human judgments to ensure alignment and adjust as needed.
Leverage Libraries: Utilize established libraries like Sentence-BERT (SBERT) for easy implementation and experimentation with different models.

Summary:

Sentence embeddings measure the similarity between computer-generated and human-generated responses by comparing their vector representations. A higher similarity score means the responses are closer in content and meaning, providing a deeper understanding of their alignment.