FeaturesUse CasesBlogAPI ReferenceWhy CorePlexMLPricing
Start Free
← Back to Blog
Case Studies11 min read

A Guide to Privacy Compliance with CorePlexML

CorePlexML Team·

The Challenge: A Growing Regulatory Landscape

The regulatory environment around data privacy has intensified dramatically over the past several years. Organizations that collect, store, or process personal data now face a patchwork of overlapping regulations, each with its own definitions, requirements, and penalties. HIPAA imposes strict rules on protected health information in the United States. GDPR grants sweeping data rights to individuals across the European Union. PCI-DSS mandates specific protections for payment card data worldwide. CCPA and its successor CPRA give California residents granular control over how their personal information is used and sold.

For machine learning teams, the compliance burden is particularly acute. ML pipelines consume vast amounts of data, often combining multiple sources that may contain personally identifiable information (PII) in unexpected places. A dataset used for churn prediction might include email addresses, phone numbers, or IP addresses. A medical dataset almost certainly contains protected health information. A financial dataset may embed credit card numbers or account identifiers in free-text fields that are easy to overlook during manual review.

The consequences of non-compliance are severe. GDPR fines can reach 4% of annual global revenue. HIPAA violations carry penalties up to $1.9 million per violation category per year. Beyond financial penalties, a data breach erodes customer trust and can take years to recover from. The challenge is clear: ML teams need automated, systematic tools to detect and protect sensitive data before it enters the training pipeline.

How the Privacy Suite Works

CorePlexML's Privacy Suite addresses this challenge with a four-stage workflow: scan, detect, transform, and audit. Each stage is designed to integrate seamlessly into your existing ML pipeline, whether you work through the UI or automate everything via the SDK.

The workflow begins with a scan, where the Privacy Suite examines every column and every value in your dataset version. Detection algorithms identify PII instances and classify them by type and confidence level. You then review the findings, confirming true positives and dismissing false positives. Next, you select transformation actions for each detected PII type. Finally, the suite applies the transformations, creates a new compliant dataset version, and logs every action in an immutable audit trail.

This approach ensures that the original dataset is never modified. You always retain access to the raw data (subject to access controls), and every compliant version maintains full lineage back to its source. If regulations change or you need to apply a different transformation strategy, you can re-run the process against the original data at any time.

The PII Detection Engine

At the core of the Privacy Suite is a detection engine capable of identifying over 72 types of personally identifiable information. This breadth of coverage is critical because PII extends far beyond names and email addresses. The engine detects Social Security numbers, passport numbers, driver's license numbers, credit card numbers, IBANs, medical record numbers, DEA numbers, NPI numbers, biometric identifiers, genetic data markers, IP addresses (v4 and v6), MAC addresses, geolocation coordinates, vehicle identification numbers, and dozens more.

The engine uses four complementary detection methods to maximize both precision and recall:

Pattern matching handles structured PII that follows well-defined formats. Regular expressions tuned for each PII type identify Social Security numbers (XXX-XX-XXXX), credit card numbers (Luhn-validated), email addresses, phone numbers in various international formats, and similar structured identifiers. Pattern matching is fast and highly precise for well-formatted data.

NLP models detect PII embedded in unstructured text. Named entity recognition identifies person names, organization names, addresses, and dates within free-text fields like medical notes, customer comments, or support tickets. This catches PII that pattern matching would miss because it lacks a fixed format.

ML classifiers handle ambiguous cases where the data type is not immediately obvious. A column of 10-digit numbers might be phone numbers, account IDs, or zip codes with extensions. Classifiers use statistical features of the data (distribution, cardinality, digit patterns) to determine the most likely PII type.

Contextual analysis uses column names, neighboring columns, and table-level patterns to improve detection accuracy. A column named "SSN" or "social_security" provides a strong signal even if the values themselves are partially masked. A column next to "first_name" and "last_name" that contains numeric strings is more likely to be a phone number than a random ID.

You can initiate a scan through the SDK with a single call:

from coreplexml import CorePlexClient

client = CorePlexClient(
    base_url="https://api.coreplexml.io",
    api_key="your-api-key"
)

scan_result = client.privacy.scan(
    dataset_version_id="dv_xyz789",
    compliance_profile="HIPAA",
    confidence_threshold=0.75
)

print(f"PII instances found: {scan_result.total_findings}")
for finding in scan_result.findings:
    print(f"  Column: {finding.column} | Type: {finding.pii_type} "
          f"| Confidence: {finding.confidence:.0%} | Count: {finding.count}")

The confidence_threshold parameter controls sensitivity. Lower values catch more potential PII but may increase false positives. Higher values reduce noise but risk missing genuine PII in unusual formats. A threshold of 0.75 provides a good balance for most use cases.

Compliance Profile Deep Dive

Manually configuring which PII types to detect and how to transform them for each regulation is error-prone and time-consuming. Compliance profiles solve this by encapsulating the requirements of each regulation into a single, selectable configuration.

Here is how the four built-in profiles compare across key dimensions:

| Dimension | HIPAA | GDPR | PCI-DSS | CCPA |

|---|---|---|---|---|

| Scope | Protected Health Information (PHI) | All personal data of EU residents | Payment card data | Personal information of CA residents |

| Key PII Types | Medical records, insurance IDs, treatment dates, provider names, biometric data | Names, emails, IPs, location data, genetic data, biometric data | Card numbers, CVVs, PINs, cardholder names, expiration dates | Names, SSNs, emails, purchase history, browsing history, geolocation |

| Minimum Requirement | De-identification (Safe Harbor or Expert Determination) | Pseudonymization or anonymization | Encryption of cardholder data at rest and in transit | Right to opt-out of data sale; right to deletion |

| Recommended Actions | Redact, generalize, suppress for Safe Harbor; hash or pseudonymize for limited datasets | Pseudonymize for processing; anonymize for analytics | Encrypt or tokenize; mask for display | Redact or pseudonymize for opt-out requests; suppress for deletion |

| Audit Requirements | 6-year retention of compliance logs | Demonstrate compliance on request (accountability principle) | Quarterly scans and annual assessments | Respond to consumer requests within 45 days |

When you select a compliance profile, the Privacy Suite automatically configures the detection engine to focus on the relevant PII types and pre-selects the recommended transformation action for each type. You can override any recommendation before applying transformations.

Transformation Actions in Detail

The Privacy Suite provides eight transformation actions, each suited to different use cases and regulatory requirements:

Mask replaces characters with asterisks or other symbols while preserving the format and length of the original value. An email like john.doe@company.com becomes j.d@c*.com. Masking is useful when you need to preserve the general structure of data for validation or display purposes without exposing the actual values.

Redact removes the value entirely, replacing it with a placeholder like [REDACTED] or an empty string. This is the most aggressive protection and is appropriate when the PII column has no analytical value for your ML pipeline.

Hash applies a one-way cryptographic hash function (SHA-256 by default) to the value. The same input always produces the same hash, so you can use hashed values for joins and deduplication, but you cannot reverse the hash to recover the original value. Hashing is ideal when you need referential integrity without exposing identifiers.

Encrypt uses reversible AES-256 encryption with a managed key. Unlike hashing, encrypted values can be decrypted by authorized users. This is appropriate when downstream processes legitimately need access to the original values under controlled conditions.

Generalize reduces the precision of a value to make it less identifying. Exact ages become age ranges (25-30), precise locations become city or state level, timestamps lose their time component and retain only the date. Generalization preserves analytical utility while reducing re-identification risk, making it a strong choice for features you want to keep in your training data.

Suppress removes the entire column from the dataset. This is appropriate when a column contains only PII with no analytical value, such as a full name column in a fraud detection dataset where the name is irrelevant to the prediction task.

Pseudonymize replaces real values with realistic but fake values, maintaining consistency across the dataset. All occurrences of "John Doe" become "Michael Rivera," and the mapping is consistent within a single transformation run. Pseudonymization preserves the statistical properties of the data (name length distribution, format patterns) while eliminating re-identification risk.

Tokenize maps each unique value to a random opaque token. Unlike pseudonymization, tokens bear no resemblance to the original values. A separate token vault stores the mappings, accessible only to authorized users. Tokenization is the gold standard for PCI-DSS compliance because it removes cardholder data from the processing environment entirely.

Audit Trail and Reporting

Compliance is not just about protecting data; it is about proving you protected it. The Privacy Suite maintains an immutable audit trail that logs every action taken on every dataset. Each audit entry records the timestamp, the user who initiated the action, the dataset version ID, the PII type detected, the column affected, the transformation action applied, and a data lineage reference linking the compliant version back to its source.

You can export the complete audit history for a dataset or project at any time:

audit_report = client.privacy.export_audit(
    project_id="proj_abc123",
    format="json",                   # Also supports "csv" and "pdf"
    date_range_start="2026-01-01",
    date_range_end="2026-02-28"
)

print(f"Total audit entries: {audit_report.total_entries}")
print(f"Export format: {audit_report.format}")
print(f"Download URL: {audit_report.download_url}")

These reports are designed to be directly usable in compliance reviews. HIPAA auditors can see every PHI transformation with timestamps and user attribution. GDPR data protection officers can demonstrate that pseudonymization was applied before processing. PCI-DSS assessors can verify that cardholder data was tokenized or encrypted at rest. The audit trail is append-only, meaning entries cannot be modified or deleted, ensuring its integrity for regulatory scrutiny.

End-to-End SDK Integration

Here is a complete workflow that scans a dataset, reviews findings, applies transformations, and verifies the result:

# Step 1: Scan the dataset
scan = client.privacy.scan(
    dataset_version_id="dv_xyz789",
    compliance_profile="GDPR",
    confidence_threshold=0.80
)

# Step 2: Review findings and build transformation rules
rules = []
for finding in scan.findings:
    if finding.confidence >= 0.90:
        rules.append({
            "column": finding.column,
            "pii_type": finding.pii_type,
            "action": finding.recommended_action
        })

# Step 3: Apply transformations (creates a new dataset version)
result = client.privacy.apply_transformations(
    dataset_version_id="dv_xyz789",
    rules=rules
)

print(f"New compliant version: {result.new_version_id}")

# Step 4: Verify — re-scan the compliant version
verification = client.privacy.scan(
    dataset_version_id=result.new_version_id,
    compliance_profile="GDPR",
    confidence_threshold=0.80
)

print(f"Remaining PII findings: {verification.total_findings}")
# Should be 0 if all transformations applied correctly

This pattern is designed for CI/CD integration. You can embed the scan-transform-verify cycle into your data pipeline so that every new dataset version is automatically checked and protected before it reaches the training stage.

Best Practices for Privacy in ML

Drawing from real-world deployments, here are the practices that consistently produce the best outcomes:

Scan before training, every time. Make privacy scanning a mandatory gate in your ML pipeline. Even if a dataset was clean last month, a new data extraction may introduce PII from a changed upstream source. Automating the scan as a pipeline step ensures nothing slips through.

Use compliance profiles as your starting point. The built-in profiles encode regulatory expertise that would take weeks to replicate manually. Start with the profile that matches your primary regulation, then layer additional custom rules for organization-specific requirements.

Audit regularly, not just when regulators ask. Proactive auditing catches issues early. Schedule monthly audit report exports and review them with your compliance team. This also builds a track record that demonstrates ongoing diligence, which is valuable during regulatory investigations.

Combine Privacy Suite with SynthGen for extra safety. For particularly sensitive datasets, apply Privacy Suite transformations first, then use SynthGen to generate a fully synthetic version of the compliant data. The synthetic data preserves the statistical relationships needed for ML training while eliminating any residual re-identification risk. This two-layer approach is especially valuable for sharing datasets across organizational boundaries or with external partners.

Choose the right transformation for each use case. Not all PII requires the same treatment. A name column that is irrelevant to your prediction task should be suppressed or redacted. An age column that is analytically important should be generalized into ranges. A customer ID needed for joins should be hashed or tokenized. Matching the transformation to the column's role in your pipeline preserves analytical utility while achieving compliance.

Document your decisions. The audit trail captures what transformations were applied, but it does not capture why you chose them. Maintain a brief compliance rationale document for each project that explains your risk assessment, the regulations you are targeting, and why you selected specific transformation actions for each PII type. This context is invaluable during compliance reviews and when onboarding new team members.

Privacy compliance in ML is not a one-time checkbox. It is an ongoing discipline that must evolve as regulations change, datasets grow, and new data sources are integrated. CorePlexML's Privacy Suite provides the automation and auditability that make this discipline manageable at scale, so your team can focus on building models that deliver value while respecting the privacy of the individuals whose data makes that value possible.