Combining Metrics for Comprehensive Analysis of Data Quality in LLMs

Integrated Approach to Data Quality

Using multiple metrics provides a more comprehensive understanding of data quality. For instance, combining regex patterns with entity recognition can enhance detection accuracy for both data leakage and toxicity.

Related page: Understanding Data Quality Concepts and Metrics for LLM Models ; Data Leakage: Data Quality Concepts; Toxicity: Data Quality Concepts

Code Snippet: Comprehensive Data Quality Check

import re
import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Example data
prompts_responses = [
    {"prompt": "My secret code is 12345", "response": "You are an idiot"},
    {"prompt": "Contact me at example@example.com", "response": "I feel threatened by your presence"},
    {"prompt": "The password is qwerty", "response": "You are amazing"}
]

# Patterns for data leakage and explicit toxicity
leakage_patterns = {
    "email": r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+",
    "password": r"\b(password|pass|pwd)\b\s*\w+",
    "code": r"\b(code|secret)\b\s*\d+"
}
explicit_toxicity_pattern = r"\b(idiot|hate|stupid)\b"

# Combined analysis
for item in prompts_responses:
    prompt = item["prompt"]
    response = item["response"]
    
    # Check for data leakage
    for label, pattern in leakage_patterns.items():
        if re.search(pattern, prompt):
            print(f"Detected {label} in prompt: '{prompt}'")
    
    # Check for explicit toxicity
    if re.search(explicit_toxicity_pattern, response):
        print(f"Explicit toxicity detected in response: '{response}'")
    
    # Check for entities in prompt and response
    for text in [prompt, response]:
        doc = nlp(text)
        for ent in doc.ents:
            print(f"Entity: {ent.text}, Label: {ent.label_}")

By combining these techniques, developers and testers can ensure a more thorough and reliable assessment of data quality in LLM models.

Leave a Reply Cancel reply