ML Engineering10 min read

A/B Testing ML Models in Production

CorePlexML Team·February 28, 2026

Beyond Offline Metrics

Every data scientist has experienced the gap between offline evaluation and production reality. A model achieves a strong AUC on the holdout set, gets deployed with confidence, and then underperforms the model it replaced on the metrics that actually matter to the business. The reason is straightforward: offline metrics measure predictive accuracy on a static snapshot of data. Business outcomes depend on how predictions interact with user behavior, downstream systems, and real-world conditions that are difficult to replicate in a test harness.

A/B testing bridges this gap by comparing models on live production traffic. Instead of trusting a single number computed on historical data, you observe how each model performs on the same population of users, at the same time, under the same conditions. The result is a direct, apples-to-apples comparison of business impact.

Consider a churn prediction model. Two candidates might have nearly identical AUC on your test set, say 0.87 and 0.88. But in production, the model with 0.87 AUC might produce better-calibrated probabilities that lead to more effective retention campaigns. The only way to discover this is to test both models on real users and measure the outcome you care about: retained customers.

A/B testing also protects against subtler failure modes. A model might perform well on average but degrade badly for a specific user segment. It might introduce latency that affects conversion rates. It might interact poorly with a caching layer or a feature store. These are production phenomena that offline evaluation cannot detect.

Statistical Foundations

Before configuring an A/B test, it is worth understanding the statistical framework that underlies it. CorePlexML supports both frequentist and Bayesian approaches, and the choice affects how you interpret results and make decisions.

Frequentist Testing

The frequentist approach formulates a null hypothesis (typically that both models perform equally) and calculates the probability of observing the measured difference if the null hypothesis were true. This probability is the p-value. If the p-value falls below your significance level (commonly 0.05), you reject the null hypothesis and conclude that the difference is statistically significant.

The key parameters are the significance level (alpha), which controls the false positive rate, and the statistical power (1 minus beta), which controls the probability of detecting a real difference when one exists. These two parameters, combined with the minimum detectable effect (the smallest difference you consider meaningful), determine the required sample size.

Bayesian Testing

The Bayesian approach starts with a prior belief about each model's performance and updates it as data arrives. Instead of a p-value, you get a posterior probability that one model is better than the other. This is often more intuitive for decision-makers: rather than saying the p-value is 0.03, you can say there is a 97% probability that Model B outperforms Model A.

Bayesian testing also handles the peeking problem more gracefully. In frequentist testing, repeatedly checking results and stopping early inflates the false positive rate. Bayesian posteriors can be checked at any time without this penalty, though you still need sufficient data for the posterior to be reliable.

Setting Up an A/B Test

CorePlexML's A/B testing framework is built into the deployment system. You create an A/B test by specifying two or more model variants, a traffic split, and the metric you want to optimize. Here is a complete setup using the SDK:

from coreplexml import CorePlexClient

client = CorePlexClient(
    base_url="https://api.coreplexml.io",
    api_key="your-api-key"
)

ab_test = client.ab_tests.create(
    project_id="proj_abc123",
    name="Churn Model Q1 Comparison",
    variants=[
        {
            "name": "control",
            "model_id": "model_xgb_v3",
            "traffic_percentage": 50
        },
        {
            "name": "challenger",
            "model_id": "model_gbm_v4",
            "traffic_percentage": 50
        }
    ],
    primary_metric="retention_rate",
    secondary_metrics=["prediction_latency_p99", "auc"],
    statistical_method="bayesian",
    confidence_level=0.95,
    minimum_sample_size=5000,
    maximum_duration_days=14
)

print(f"A/B Test ID: {ab_test.id}")
print(f"Status: {ab_test.status}")
print(f"Required samples per variant: {ab_test.required_sample_size}")

The primary_metric is the business outcome you are optimizing for. Unlike offline metrics such as AUC or RMSE, this should be a metric that directly reflects business value: conversion rate, revenue per user, customer retention, or any custom metric you define. The secondary_metrics are guardrail metrics that you monitor to ensure the winning model does not degrade other important dimensions. A model that improves retention by 2% but doubles latency might not be a net positive.

The confidence_level parameter sets the threshold for declaring a winner. At 0.95, the system requires 95% confidence (frequentist) or 95% posterior probability (Bayesian) before declaring a result. Higher confidence levels require more data but reduce the risk of a false positive.

Traffic Splitting and Assignment

Traffic splitting in ML A/B tests requires more care than in typical web experiments. Each user must be consistently assigned to the same variant for the duration of the test. If a user receives predictions from Model A on Monday and Model B on Tuesday, their outcomes cannot be cleanly attributed to either model.

CorePlexML handles this through deterministic hashing. Each prediction request includes a user identifier (or session identifier), which is hashed to produce a consistent variant assignment. The same user always sees the same model, regardless of when or how many times they make requests.

# Making predictions during an A/B test
# The variant is selected automatically based on user_id
prediction = client.deployments.predict(
    deployment_id=ab_test.deployment_id,
    features={
        "tenure_months": 14,
        "monthly_charges": 72.50,
        "contract_type": "month-to-month",
        "tech_support": "no"
    },
    user_id="user_98765"
)

print(f"Prediction: {prediction.value}")
print(f"Variant served: {prediction.variant}")
print(f"Model: {prediction.model_id}")

You can also configure traffic splits that are not 50/50. A 90/10 split lets you test a risky new model with minimal exposure while still collecting enough data to reach significance, though it will take longer. For multi-variant tests comparing three or more models, splits can be arbitrary as long as they sum to 100.

Monitoring an Active Test

Once the test is running, CorePlexML continuously tracks metrics for each variant and provides a real-time dashboard. You can also query the current state programmatically:

status = client.ab_tests.get_status(test_id=ab_test.id)

for variant in status.variants:
    print(f"\n--- {variant.name} ---")
    print(f"  Samples: {variant.sample_count}")
    print(f"  Primary metric ({status.primary_metric}): {variant.primary_value:.4f}")
    print(f"  95% CI: [{variant.confidence_interval[0]:.4f}, "
          f"{variant.confidence_interval[1]:.4f}]")
    for metric_name, metric_value in variant.secondary_values.items():
        print(f"  {metric_name}: {metric_value:.4f}")

print(f"\nRelative improvement: {status.relative_improvement:+.2%}")
print(f"P-value: {status.p_value:.4f}")
print(f"Posterior probability (challenger > control): "
      f"{status.posterior_probability:.4f}")
print(f"Estimated days remaining: {status.estimated_days_remaining}")

The estimated_days_remaining field is calculated from the current accumulation rate and the remaining sample size needed for significance. This helps you plan around the test and communicate timelines to stakeholders.

Sample Size and Duration

Insufficient sample size is the most common reason A/B tests produce misleading results. CorePlexML calculates the minimum required sample size based on your significance level, power, and the minimum detectable effect you specify. If you want to detect a 1% improvement in retention rate with 95% confidence and 80% power, you will need far more samples than if you are looking for a 5% improvement.

As a practical guideline, most ML A/B tests need at least 1,000 samples per variant for metrics with low variance (like binary classification outcomes), and 5,000 or more for metrics with high variance (like revenue or session duration). CorePlexML displays a progress bar showing how far each variant is toward the required sample count.

Analyzing Results

When the test reaches the required sample size, CorePlexML generates a comprehensive results report. The report includes the estimated effect size, confidence intervals, p-values or posterior probabilities, and a recommendation.

results = client.ab_tests.get_results(test_id=ab_test.id)

print(f"Test: {results.name}")
print(f"Duration: {results.duration_days} days")
print(f"Total samples: {results.total_samples}")
print(f"\nWinner: {results.winner}")
print(f"Effect size: {results.effect_size:+.4f}")
print(f"Relative improvement: {results.relative_improvement:+.2%}")
print(f"P-value: {results.p_value:.4f}")
print(f"Posterior probability: {results.posterior_probability:.4f}")
print(f"95% Confidence interval: [{results.ci_lower:+.4f}, "
      f"{results.ci_upper:+.4f}]")
print(f"\nRecommendation: {results.recommendation}")

The effect size tells you not just whether the difference is statistically significant, but whether it is practically meaningful. A model that improves click-through rate by 0.01% might be statistically significant with enough data, but the business impact is negligible. CorePlexML flags results where the effect size falls below a minimum practical threshold you define.

The confidence interval gives you a range of plausible values for the true effect. If the interval is tight and does not include zero, you can be confident in both the direction and magnitude of the improvement. If the interval is wide, the estimate is uncertain and you may want to continue the test.

Guardrail Metrics

The secondary metrics you specified serve as guardrail metrics. Even if the challenger wins on the primary metric, CorePlexML checks that guardrail metrics have not degraded beyond acceptable limits. For example, if prediction latency increases by 50%, the system flags this as a concern even if retention improved:

for guardrail in results.guardrail_checks:
    status_label = "PASS" if guardrail.passed else "FAIL"
    print(f"  [{status_label}] {guardrail.metric}: "
          f"{guardrail.control_value:.4f} -> "
          f"{guardrail.challenger_value:.4f} "
          f"(threshold: {guardrail.threshold})")

Declaring a Winner and Promoting

When you are satisfied with the results, you can declare a winner and promote the winning model to serve all traffic. This transitions the deployment from A/B test mode to standard single-model mode:

promotion = client.ab_tests.declare_winner(
    test_id=ab_test.id,
    winner_variant="challenger",
    promotion_strategy="canary",
    promotion_steps=[25, 50, 75, 100],
    notes="GBM v4 showed 3.2% improvement in retention rate "
          "with no degradation in latency or AUC."
)

print(f"Promotion ID: {promotion.id}")
print(f"Strategy: {promotion.strategy}")
print(f"Current traffic: {promotion.current_step}%")

The promotion itself can use any deployment strategy. A canary promotion is recommended even after a successful A/B test, because it adds an extra layer of safety during the transition from split traffic to full traffic on the winning model. The A/B test validated the model on a subset of traffic; the canary promotion validates it at scale.

Best Practices

Define your primary metric before the test starts. Choosing the metric after seeing results introduces bias. Commit to the metric, the significance level, and the minimum detectable effect in advance, and document these decisions.

Do not peek and stop early. If you are using frequentist testing, checking results repeatedly and stopping when you see significance inflates your false positive rate dramatically. Either commit to a fixed sample size or use the Bayesian method, which is more robust to interim analysis.

Run the test long enough to capture temporal patterns. User behavior varies by day of week, time of month, and season. A test that runs only on weekdays might miss weekend patterns. Aim for at least one full business cycle, typically one to two weeks minimum.

Use guardrail metrics. A model that wins on your primary metric but degrades latency, fairness, or coverage may cause more harm than good. Define guardrails in advance and honor them.

Account for novelty and primacy effects. Users sometimes respond differently to a new model simply because it is new, not because it is better. These effects fade over time. If your test shows a strong initial effect that diminishes, consider extending the test duration.

Document and archive results. Every A/B test produces institutional knowledge about what works and what does not. CorePlexML stores full test histories including configurations, metrics, and outcomes so your team can review past experiments and build on prior learnings.

Production ML is ultimately about delivering business value, not optimizing abstract metrics. A/B testing is the most reliable method for connecting model changes to real-world outcomes. By investing in rigorous experimentation, you transform model deployment from a leap of faith into a data-driven decision.

For more on CorePlexML's deployment and testing infrastructure, visit the MLOps features page.