Synthetic data generation
Generate millions of statistically faithful synthetic records using CTGAN, CopulaGAN, TVAE, and Gaussian Copula. Share data without privacy risk.

4 generation engines for every data type
Choose the optimal engine for your data distribution. Each engine excels in different scenarios.
Conditional Tabular GAN. Best for mixed-type data (numeric + categorical). Handles class imbalance and missing values natively.
Copula-based GAN for complex multivariate correlations. Preserves non-linear relationships between features better than standard GANs.
Tabular Variational Autoencoder. Excels with high-dimensional data and complex distributions. Balanced speed and quality.
Statistical model for continuous data. Captures linear and non-linear relationships. Fastest training time of all engines.
Key Capabilities
Everything you need to get the most out of this module.
4 AI Engines
CTGAN, CopulaGAN, TVAE, and Gaussian Copula — choose the best engine for your data distribution.
Quality Metrics
KL divergence, correlation preservation, and coverage metrics ensure statistical fidelity.
Privacy Scoring
Re-identification risk assessment, k-anonymity verification, and privacy guarantees.
Scale
Generate 10M+ synthetic records. Batch processing with configurable output sizes.
Statistical fidelity meets privacy guarantees
Every synthetic dataset is scored for both data quality and privacy protection.
KL Divergence
Measures how well synthetic distributions match original data. Lower values indicate higher fidelity to the source distribution.
Correlation Preservation
Validates that feature relationships in synthetic data mirror the original. Critical for maintaining data utility in downstream models.
Coverage Metrics
Ensures synthetic data covers the full range of marginal distributions. Prevents mode collapse and edge-case gaps.
Re-identification Risk
Assesses likelihood of tracing synthetic records back to original individuals. Scored from 0 (safe) to 1 (high risk).
k-Anonymity Verification
Verifies minimum group sizes in quasi-identifier combinations. Ensures no individual is uniquely identifiable in the synthetic dataset.
Differential Privacy
Configurable epsilon-delta privacy parameters. Mathematical guarantees on information leakage from synthetic data.
Generate synthetic data programmatically
Train models, generate records, and validate quality — all from the SDK.
from coreplexml import CorePlexMLClient
client = CorePlexMLClient(
base_url="https://api.coreplexml.io",
api_key="sk_your_api_key"
)
# Train a CTGAN model on your dataset
model = client.synthgen.create_model(
project_id="proj_abc",
dataset_version_id="dsv_customer_q1",
name="customer-synth-v1",
model_type="ctgan",
config={"epochs": 300, "batch_size": 500}
)
# Wait for training
client.synthgen.wait(model["id"])
# Generate 100,000 synthetic records
synthetic = client.synthgen.generate(
model_id=model["id"],
num_rows=100_000,
seed=42
)
print(f"Generated: {synthetic['num_rows']} rows")
print(f"KL Divergence: {synthetic['quality']['kl_divergence']:.4f}")
print(f"Re-ID Risk: {synthetic['privacy']['reidentification_risk']:.4f}")
# Download synthetic dataset
client.synthgen.download(model["id"], output_path="synthetic_data.csv")SynthGen API
Endpoints for model training, data generation, and quality assessment.
/api/synthgen/modelsTrain a synthetic data model (CTGAN, CopulaGAN, TVAE, Gaussian Copula)
/api/synthgen/models/{id}Get model details, training status, and quality metrics
/api/synthgen/models/{id}/generateGenerate synthetic records (up to 10M+ rows)
/api/synthgen/models/{id}Delete a synthetic data model
Synthetic data generation

Quality metrics and distribution analysis
Ready to get started?
Start building with CorePlexML today. Free tier available — no credit card required.