Synthetic Data Generation: CTGAN, CopulaGAN, and TVAE Compared
Introduction
Access to high-quality data is the single biggest constraint in modern machine learning. Privacy regulations like GDPR and HIPAA restrict how real data can be shared across teams and environments. Development and QA teams need realistic test data that does not expose customer information. Imbalanced datasets produce biased models that fail on minority classes. Synthetic data generation addresses all three of these problems by creating artificial records that preserve the statistical properties of real data without containing any actual individual's information.
CorePlexML's SynthGen module provides three deep generative engines for tabular synthetic data: CTGAN, CopulaGAN, and TVAE. Each engine takes a different approach to learning and reproducing the joint distribution of your data. Choosing the right one depends on your dataset characteristics, your quality requirements, and how much time you can afford to spend on training.
This guide compares all three engines, walks through quality evaluation metrics, and demonstrates end-to-end SDK workflows for generating and validating synthetic data.
The Three Engines
CTGAN: Conditional Tabular GAN
CTGAN uses a generative adversarial network architecture specifically designed for tabular data. Unlike image-focused GANs, CTGAN introduces a conditional generator that handles the unique challenges of mixed-type columns. Continuous columns are modeled with a variational Gaussian mixture, while categorical columns use a conditional vector to ensure proper representation of all categories.
CTGAN excels at handling imbalanced datasets. The conditional training mechanism samples from all categories during training, preventing the generator from ignoring rare classes. This makes it the default choice when your dataset contains significant class imbalance, such as fraud detection data where positive cases represent less than 1% of records.
The tradeoff is training stability. GANs are notoriously susceptible to mode collapse, where the generator learns to produce only a subset of the real data's diversity. CTGAN mitigates this through training-by-sampling, but you may still need to tune the number of epochs and batch size to achieve stable convergence on complex datasets.
Best for: Mixed tabular data with categorical and continuous columns, imbalanced classes, datasets requiring faithful representation of minority groups.
CopulaGAN: Correlation-Preserving Generation
CopulaGAN extends CTGAN by adding a copula-based transformation layer that explicitly models the dependency structure between columns before GAN training begins. A copula captures the correlation between variables independently of their marginal distributions, meaning CopulaGAN first learns how columns relate to each other, then uses the GAN to generate data that respects those relationships.
This two-stage approach produces synthetic data with significantly better inter-column correlation preservation compared to vanilla CTGAN. If your dataset contains columns with strong dependencies, such as age and income in a lending dataset, or temperature and pressure in a sensor dataset, CopulaGAN will reproduce those relationships more accurately.
The additional copula fitting step adds training time, and CopulaGAN can be sensitive to the number of distinct values in categorical columns. Very high cardinality categorical features may require preprocessing before training.
Best for: Financial and actuarial data, any dataset where preserving statistical correlations between columns is critical, sensor and time-series-derived tabular data.
TVAE: Tabular Variational Autoencoder
TVAE takes a fundamentally different approach from the GAN-based engines. Instead of training a generator and discriminator in an adversarial loop, TVAE uses a variational autoencoder that learns a compressed latent representation of your data and generates new records by sampling from that latent space.
The practical advantage is speed and stability. TVAE trains significantly faster than both CTGAN and CopulaGAN because it avoids the adversarial training dynamic that can slow convergence. There is no mode collapse risk. Training is also more predictable: loss curves decrease steadily rather than oscillating between generator and discriminator.
TVAE works especially well on larger datasets where GAN training time becomes a bottleneck. For datasets with more than 100,000 rows, TVAE can produce usable synthetic data in a fraction of the time required by CTGAN.
The tradeoff is that TVAE may not capture complex multi-modal distributions as precisely as CTGAN, and its correlation preservation is generally not as strong as CopulaGAN's.
Best for: Large datasets, rapid prototyping, scenarios where training speed matters more than maximum distributional fidelity.
Comparison Table
| Criterion | CTGAN | CopulaGAN | TVAE |
|----------------------------|--------------|---------------|---------------|
| Training Speed | Moderate | Slowest | Fastest |
| Correlation Preservation | Good | Best | Adequate |
| Mode Collapse Risk | Moderate | Moderate | None |
| Imbalanced Class Handling | Best | Good | Adequate |
| Training Stability | Moderate | Moderate | Most Stable |
| Best Dataset Size | Small-Medium | Small-Medium | Medium-Large |
| Best Use Case | General-purpose | Correlated data | High-volume generation |
Quality Metrics
Generating synthetic data is only useful if you can verify its quality. CorePlexML evaluates synthetic data across three dimensions.
Statistical Similarity
The Kolmogorov-Smirnov (KS) test compares the cumulative distribution of each column between real and synthetic data. A KS statistic close to zero indicates that the synthetic column's distribution closely matches the real one. CorePlexML runs this test per-column and reports an aggregate similarity score. A score above 0.85 generally indicates good statistical fidelity.
Privacy Metrics
The Distance to Closest Record (DCR) metric measures how far each synthetic record is from its nearest neighbor in the real dataset. If the DCR is too small, the synthetic data may be memorizing real records rather than generating novel ones. CorePlexML flags any synthetic records that fall below a configurable distance threshold and reports the overall DCR distribution.
Utility Metrics
The ML Efficacy test answers the most practical question: can a model trained on synthetic data perform comparably to one trained on real data? CorePlexML trains the same algorithm on both the real and synthetic datasets, then evaluates both models on a held-out real test set. The ratio of synthetic-trained performance to real-trained performance gives you a direct measure of how useful the synthetic data is for downstream ML tasks.
SDK Workflow
Here is an end-to-end workflow using the CorePlexML Python SDK.
from coreplexml import CorePlexMLClient
client = CorePlexMLClient(
base_url="https://api.coreplexml.io",
api_key="sk_your_api_key"
)
# Step 1: Create a SynthGen model on an existing dataset version
model = client.synthgen.create_model(
dataset_version_id="dv_abc123",
engine="CTGAN",
epochs=300,
batch_size=500
)
print(f"Training job started: {model['job_id']}")
# Step 2: Poll until training completes
status = client.jobs.wait(model["job_id"], timeout=3600)
# Step 3: Generate synthetic records
synthetic = client.synthgen.generate(
model_id=model["model_id"],
num_rows=10000
)
print(f"Generated {synthetic['row_count']} rows")
# Step 4: Evaluate quality
evaluation = client.synthgen.evaluate(
model_id=model["model_id"],
metric="ks_test"
)
print(f"KS similarity score: {evaluation['aggregate_score']}")
# Step 5: Run ML efficacy test
efficacy = client.synthgen.evaluate(
model_id=model["model_id"],
metric="ml_efficacy"
)
print(f"Efficacy ratio: {efficacy['ratio']}")
To try a different engine, change the engine parameter to "CopulaGAN" or "TVAE". All other API calls remain identical.
Combining with the Privacy Suite
A powerful pattern is to chain the Privacy Suite with SynthGen for maximum data protection. First, scan the real dataset for PII. Then apply anonymization transforms (masking, generalization, pseudonymization). Finally, train a SynthGen model on the anonymized data.
# Scan for PII
scan = client.privacy.scan(dataset_version_id="dv_abc123")
print(f"Found {scan['pii_count']} PII columns")
# Apply anonymization
anon = client.privacy.transform(
dataset_version_id="dv_abc123",
profile="HIPAA"
)
# Train synthetic model on anonymized data
model = client.synthgen.create_model(
dataset_version_id=anon["output_version_id"],
engine="CopulaGAN",
epochs=300
)
This produces synthetic data that is two layers removed from real individuals: anonymized first, then synthesized. The resulting dataset can typically be shared freely across teams and environments without privacy concerns.
Best Practices
Epoch tuning. Start with 300 epochs for CTGAN and CopulaGAN. Monitor the loss curves via the training job logs. If the discriminator loss flatlines early, increase epochs. If it oscillates wildly, reduce the learning rate or increase batch size. For TVAE, 150-200 epochs is usually sufficient.
Data preprocessing. SynthGen handles mixed types natively, but you will get better results if you clean your data first. Remove columns with more than 90% missing values. Consolidate rare categories that appear fewer than 5 times. Normalize column names to avoid encoding issues.
Validation strategy. Always hold out a test set from your real data before training the synthetic model. Use this test set for the ML efficacy evaluation. Never evaluate synthetic quality using the same data that trained the generator.
Privacy verification. Run the DCR metric on every synthetic dataset before sharing it externally. Set a minimum distance threshold appropriate to your compliance requirements.
When NOT to Use Synthetic Data
Synthetic data is not a universal solution. Avoid it when exact record-level accuracy matters, such as financial auditing or regulatory reporting where every row must trace to a real transaction. Do not use synthetic data as a replacement for collecting more real data when the real data is simply too small: generators trained on fewer than 500 rows rarely produce useful output. And be cautious when your downstream task is sensitive to rare edge cases that the generator may not have learned to reproduce.
Synthetic data is a tool for augmentation, privacy, and testing. It complements real data rather than replacing it. Used correctly with the right engine and proper validation, it can dramatically accelerate your ML development cycle while keeping your data program compliant and secure.