Data Leakage : Data Quality Concepts

In this post on Data Leakage you will find:

Definition and Impact
Techniques for Detection
- Pattern Matching with Code Snippets
- Entity Recognition with Code Snippets
Metrics for Quantifying Data Leakage

Related page: Understanding Data Quality Concepts and Metrics for LLM Models, Determining the Appropriate Amount of Test Data for Evaluating Machine Learning Models

What is Data Leakage?

Data leakage occurs when sensitive or confidential information unintentionally gets incorporated into the training dataset or is inadvertently exposed during interactions with the model. This can lead to security breaches and the erosion of user trust. For example, a training dataset containing user passwords or email addresses can result in these being generated by the model during predictions or outputs.

Types of Data Leakage

Data leakage can occur in several ways throughout the machine learning process:

User-initiated leakage: Users may inadvertently include sensitive information in their prompts or queries when interacting with models.
Model-generated leakage: The model itself might reveal confidential data in its responses to user queries.
Training-test data contamination: Test data accidentally included in the training set can lead to an overly optimistic estimate of the model’s accuracy.

These forms of data leakage can be detected and mitigated at various stages of the data pipeline. Implementing proper safeguards and regular audits can help maintain data integrity and ensure more accurate model evaluation.

Detecting Data Leakage

Pattern Matching with Regex

Regex (Regular Expressions) is a powerful tool for pattern matching and can be used to detect common forms of sensitive information in text data, such as email addresses, passwords, and secret codes.

Code Snippet: Using Regex to Detect Patterns

import re

# Example data
prompts = ["My secret code is 12345", "Contact me at example@example.com", "The password is qwerty"]

# Regex patterns for common sensitive information
patterns = {
    "email": r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+",
    "password": r"\b(password|pass|pwd)\b\s*\w+",
    "code": r"\b(code|secret)\b\s*\w+"
}

# Detecting patterns
for prompt in prompts:
    for label, pattern in patterns.items():
        if re.search(pattern, prompt):
            print(f"Detected {label} in prompt: '{prompt}'")

Explanation:

Purpose: This script uses regex patterns to identify and flag sensitive information in text data.
Implementation:
1. Import the re library for regex operations.
2. Define example prompts containing potentially sensitive information.
3. Define regex patterns for detecting emails, passwords, and secret codes.
4. Iterate through the prompts and use regex to search for patterns. If a pattern is found, it prints the label of the detected sensitive information.
Prerequisites: Basic knowledge of Python and regex.
Expected Output:

Detected code in prompt: 'My secret code is 12345'
Detected email in prompt: 'Contact me at example@example.com'
Detected password in prompt: 'The password is qwerty'

Entity Recognition for Data Leakage

Entity Recognition involves identifying predefined entities (such as dates, names, or financial amounts) in text. Using python modules like spaCy or SpanMarker, we can extend this to detect sensitive entities, enhancing our ability to spot data leakage.

Real-Life Scenario:

Imagine you’re developing a customer support chatbot for a bank. You need to ensure that the bot doesn’t inadvertently share sensitive information, such as account numbers or social security numbers, in its responses. By using entity recognition, you can detect and flag such sensitive information in the training data and the bot’s responses.

Code Snippet: Entity Recognition with spaCy.

import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Example data
prompts = [
    "My account number is 1234567890.",
    "The password for my online banking is SecurePass123!",
    "The current balance in my savings account is $10,000.",
    "The last four digits of my credit card are 5678.",
    "My home address is 123 Main St, Anytown, USA 12345.",
    "The routing number for my checking account is 987654321.",
    "My current credit score is 750.",
    "I need to transfer $500 from account 11111 to account 22222.",
    "The interest rate on my mortgage is 3.5%.",
    "My social security number is 123-45-6789.",
    "Can you update my email to johndoe@example.com?",
    "My PIN for the ATM is 4321.",
    "The CVV on my credit card is 123.",
    "I'd like to change my phone number to 555-123-4567.",
    "My date of birth is January 1, 1980.",
    "My secret code is 12345",
    "Contact me at example@example.com",
    "The password is qwerty"
]

# Processing prompts
for prompt in prompts:
    doc = nlp(prompt)
    for ent in doc.ents:
        print(f"Entity: {ent.text}, Label: {ent.label_}")

Explanation:

Purpose: This script uses the spaCy NLP library to identify and classify entities in text data.
Implementation:
1. Install spaCy and download the “en_core_web_sm” model (pip install spacy and python -m spacy download en_core_web_sm).
2. Load the spaCy model.
3. Define example prompts containing potentially sensitive information.
4. Process each prompt with the spaCy model to detect entities. Print detected entities and their labels.
Prerequisites: Install spaCy, basic knowledge of Python and NLP.
Expected Output:

Entity: 1234567890, Label: DATE
Entity: 10,000, Label: MONEY
Entity: four, Label: CARDINAL
Entity: 5678, Label: DATE
Entity: 123, Label: CARDINAL
Entity: Main St, Label: PERSON
Entity: Anytown, Label: GPE
Entity: 987654321, Label: DATE
Entity: 750, Label: CARDINAL
Entity: 500, Label: MONEY
Entity: 11111, Label: DATE
Entity: 22222, Label: DATE
Entity: 3.5%, Label: PERCENT
Entity: 123, Label: CARDINAL
Entity: PIN, Label: ORG
Entity: ATM, Label: ORG
Entity: 4321, Label: DATE
Entity: CVV, Label: ORG
Entity: 123, Label: CARDINAL
Entity: 555-123-4567, Label: CARDINAL
Entity: January 1, 1980, Label: DATE
Entity: 12345, Label: DATE

Another Entity Recognition Python module you may explore is SpanMarker

Code Snippet: Entity Recognition with SpanMarker.

from span_marker import SpanMarkerModel

# Load a finetuned SpanMarkerModel from the Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")

prompts = [
    "The Ninth suffered a serious defeat at the Battle of Camulodunum under Quintus Petillius Cerialis in the rebellion of Boudica (61), when most of the foot-soldiers were killed in a disastrous attempt to relieve the besieged city of Camulodunum (Colchester).",
    "He was born in Wellingborough, Northamptonshire, where he attended Victoria Junior School, Westfield Boys School and Sir Christopher Hatton School.",
    "Nintendo continued to sell the revised Wii model and the Wii Mini alongside the Wii U during the Wii U's first release year.",
    "Dorsa has a Bachelor of Music in Composition from California State University, Northridge in 2001, Master of Music in Harpsichord Performance at Cal State Northridge in 2004, and a Doctor of Musical Arts at the University of Michigan, Ann Arbor in 2008."
]

# Processing prompts
entities_per_sentence = model.predict(prompts)

# We print the detected span along with their labels.
for entities in entities_per_sentence:
    for entity in entities:
        print(f"Span: {entity['span']}, Label: {entity['label']},  Score: {entity['score']}

Explanation

Purpose: This script uses the SpanMarker framework to identify and classify entities in text data
Prerequisites:
- Install the required libraries ( span_marker).
  - pip install span_marker
  - Ensure that you have an internet connection to download the pre-trained model.
Expected Output:
- "label": The string label for the found entity.
- "score": The probability score indicating the model its confidence.
- "span": The entity span as a string.
- You can customize the prompts and analyze different types of sensitive information.

Span: Battle of Camulodunum, Label: event-attack/battle/war/militaryconflict,  Score: 0.9433467388153076
Span: Quintus Petillius Cerialis, Label: person-soldier,  Score: 0.5088788866996765
Span: Camulodunum, Label: location-GPE,  Score: 0.9582282304763794
Span: Colchester, Label: location-GPE,  Score: 0.9701601266860962
Span: Wellingborough, Label: location-GPE,  Score: 0.9946461319923401
Span: Northamptonshire, Label: location-GPE,  Score: 0.9666397571563721
Span: Victoria Junior School, Label: organization-education,  Score: 0.9887372255325317
Span: Westfield Boys School, Label: organization-education,  Score: 0.9936436414718628
Span: Sir Christopher Hatton School, Label: organization-education,  Score: 0.9760763645172119
Span: Nintendo, Label: organization-company,  Score: 0.9846269488334656
Span: Wii, Label: product-other,  Score: 0.9477915167808533
Span: Wii Mini, Label: product-other,  Score: 0.9741908311843872
Span: Wii U, Label: product-other,  Score: 0.9770299196243286
Span: Wii U', Label: product-other,  Score: 0.9612721800804138
Span: Dorsa, Label: person-other,  Score: 0.4802011549472809
Span: Bachelor of Music in Composition, Label: other-educationaldegree,  Score: 0.8189163208007812
Span: California State University, Label: organization-education,  Score: 0.9621313810348511
Span: Northridge, Label: location-GPE,  Score: 0.9189688563346863
Span: Master of Music in Harpsichord Performance, Label: other-educationaldegree,  Score: 0.61509770154953
Span: Cal State Northridge, Label: organization-education,  Score: 0.8989953398704529
Span: Doctor of Musical Arts, Label: other-educationaldegree,  Score: 0.7796889543533325
Span: University of Michigan, Label: organization-education,  Score: 0.8410290479660034
Span: Ann Arbor, Label: location-GPE,  Score: 0.7744064331054688

This example demonstrates how to use the SpanMarker module for entity recognition, making it easier to detect sensitive information in text data. You can decide which entities are considered leakage risks (e.g., Labels like location-GPE, person-soldier, organization-education and thresholds of confidence Score). Once you’ve identified these sensitive entities, create filters in your code to exclude or mask this information as needed. Additionally, to improve your model’s ability to recognize potential leakage, consider training it on domain-specific data. This can help the model better understand the context and importance of different entity types in your particular field.

Metrics for Quantifying Data Leakage

Percentage of Prompts with Sensitive Information
- Definition: Measures the prevalence of sensitive data in prompts.
- Calculation: (Number of prompts with sensitive data / Total prompts) * 100.
- Example: If 15 out of 100 prompts contain sensitive information, the percentage is 15%.
- Why it’s useful: Provides a quick overview of the scale of potential data leakage, helping prioritize mitigation efforts.
Frequency of Specific Patterns
- Definition: Counts occurrences of defined sensitive data patterns (e.g., email addresses, passwords).
- Process: Use regex to identify and count these patterns in the dataset.
- Example: In 1000 entries, you might find 50 email addresses and 30 phone numbers.
- Why it’s useful: Identifies which types of sensitive information are most commonly leaked, allowing for targeted protection measures.
Number of Detected Entities
- Definition: Uses Named Entity Recognition (NER) to count sensitive entities (e.g., names, locations).
- Process: Identify and tally sensitive entities in the text.
- Example: In 500 documents, you might detect 200 person names and 150 locations.
- Why it’s useful: Captures more complex forms of sensitive information that pattern matching might miss, providing a more comprehensive view of potential data leakage.

These metrics collectively offer a multi-faceted approach to quantifying data leakage, enabling organizations to assess risk, prioritize security measures, and track the effectiveness of their data protection strategies.

Further Learning : Quality and Safety for LLM Applications - free course on DeepLearning.AI

Explanation

Leave a Reply Cancel reply