Integrated Approach to Data Quality
Using multiple metrics provides a more comprehensive understanding of data quality. For instance, combining regex patterns with entity recognition can enhance detection accuracy for both data leakage and toxicity.
Related page: Understanding Data Quality Concepts and Metrics for LLM Models ; Data Leakage: Data Quality Concepts; Toxicity: Data Quality Concepts
Code Snippet: Comprehensive Data Quality Check
import re
import spacy
# Load spaCy model
nlp = spacy.load("en_core_web_sm")
# Example data
prompts_responses = [
{"prompt": "My secret code is 12345", "response": "You are an idiot"},
{"prompt": "Contact me at example@example.com", "response": "I feel threatened by your presence"},
{"prompt": "The password is qwerty", "response": "You are amazing"}
]
# Patterns for data leakage and explicit toxicity
leakage_patterns = {
"email": r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+",
"password": r"\b(password|pass|pwd)\b\s*\w+",
"code": r"\b(code|secret)\b\s*\d+"
}
explicit_toxicity_pattern = r"\b(idiot|hate|stupid)\b"
# Combined analysis
for item in prompts_responses:
prompt = item["prompt"]
response = item["response"]
# Check for data leakage
for label, pattern in leakage_patterns.items():
if re.search(pattern, prompt):
print(f"Detected {label} in prompt: '{prompt}'")
# Check for explicit toxicity
if re.search(explicit_toxicity_pattern, response):
print(f"Explicit toxicity detected in response: '{response}'")
# Check for entities in prompt and response
for text in [prompt, response]:
doc = nlp(text)
for ent in doc.ents:
print(f"Entity: {ent.text}, Label: {ent.label_}")
By combining these techniques, developers and testers can ensure a more thorough and reliable assessment of data quality in LLM models.