ML Engineering11 min read

Safe Model Rollouts with Canary Deployments

CorePlexML Team·February 28, 2026

The Deployment Problem

Deploying a machine learning model to production is fundamentally different from deploying traditional software. When you ship a new version of a web application, you can write integration tests that verify the output is correct for known inputs. When you ship a new model, correctness is probabilistic. The model might perform beautifully on your test set but behave unexpectedly on production data that falls outside the distribution it was trained on. A feature that was always present during training might arrive as null in production. A categorical value the model has never seen might appear for the first time. The interactions between model predictions and downstream business logic might produce edge cases that no offline evaluation anticipated.

These failure modes make ML deployments inherently riskier than typical software releases. The consequences range from mildly embarrassing (a recommendation engine suggesting irrelevant products) to financially catastrophic (a fraud detection model allowing a wave of fraudulent transactions through). The deployment strategy you choose determines how much risk you accept and how quickly you can recover when something goes wrong.

Deployment Strategies Compared

CorePlexML supports four deployment strategies, each suited to different risk profiles and operational requirements. Understanding when to use each one is critical to building a reliable production ML practice.

Direct Deployment

Direct deployment replaces the current model with the new one instantly. All traffic switches to the new model in a single operation. This is the fastest strategy but offers no safety net. If the new model is broken, every user is affected immediately, and rollback requires a second deployment.

Direct deployment is appropriate only for development environments, for non-critical models where brief degradation is acceptable, or when you have already validated the model extensively through other means (such as a completed shadow deployment).

Canary Deployment

Canary deployment is the workhorse strategy for production ML. It routes a small fraction of traffic to the new model while the existing model continues serving the majority. You monitor the new model's behavior on real traffic and gradually increase its share if metrics hold steady. If metrics degrade, the system automatically rolls back before significant damage occurs.

The name comes from the coal mining practice of sending a canary into the mine to detect toxic gases. The canary model tests the production environment with minimal blast radius.

Blue-Green Deployment

Blue-green deployment maintains two parallel environments. The blue environment runs the current model and the green environment runs the new one. When you switch, all traffic moves from blue to green atomically. If problems emerge, you switch back to blue instantly.

The advantage over canary is that rollback is instantaneous and complete. There is no partial state, no split traffic, no need to wait for metrics to converge. The disadvantage is cost: both environments must be running simultaneously, doubling compute requirements during the transition window.

Shadow Deployment

Shadow deployment runs the new model alongside the current one, processing every request, but never serving its predictions to users. Both models' outputs are logged and compared offline. This is the safest strategy because the new model has zero production impact, but it requires the most resources (every request runs through two models) and produces no business outcome data (since shadow predictions are never acted upon).

How Canary Deployments Work

A canary deployment in CorePlexML follows a structured progression through traffic stages. At each stage, the system evaluates the canary model's health against configurable thresholds. If the canary passes, it advances to the next stage. If it fails, it rolls back automatically.

Here is a typical canary configuration:

from coreplexml import CorePlexClient

client = CorePlexClient(
    base_url="https://api.coreplexml.io",
    api_key="your-api-key"
)

deployment = client.deployments.create(
    project_id="proj_abc123",
    model_id="model_gbm_v4",
    name="Fraud Detector v4 (Canary)",
    strategy="canary",
    canary_config={
        "stages": [
            {"traffic_percentage": 5, "duration_minutes": 60},
            {"traffic_percentage": 25, "duration_minutes": 120},
            {"traffic_percentage": 50, "duration_minutes": 180},
            {"traffic_percentage": 100, "duration_minutes": 0}
        ],
        "health_checks": {
            "error_rate_threshold": 0.01,
            "latency_p99_threshold_ms": 200,
            "prediction_quality_metric": "auc",
            "prediction_quality_threshold": 0.85,
            "min_samples_per_stage": 500
        },
        "auto_advance": True,
        "auto_rollback": True,
        "rollback_on_error_spike": True
    }
)

print(f"Deployment ID: {deployment.id}")
print(f"Current stage: {deployment.current_stage}")
print(f"Traffic on canary: {deployment.canary_traffic_percentage}%")

The stages array defines the traffic progression. Each stage specifies a traffic percentage and a minimum duration. The canary must survive the full duration at each stage before advancing. This ensures that transient spikes do not trigger premature advancement, and that the model is tested on enough traffic to reveal patterns that only emerge at scale.

The min_samples_per_stage parameter adds a sample count requirement on top of the duration. Even if 60 minutes have passed, the canary will not advance from 5% traffic until it has processed at least 500 requests. This prevents advancement in low-traffic periods where the duration has elapsed but the statistical evidence is insufficient.

Health Checks in Detail

CorePlexML evaluates canary health across three dimensions: reliability, performance, and prediction quality.

Reliability Metrics

Error rate measures the fraction of prediction requests that result in errors (HTTP 5xx, timeouts, malformed responses). A healthy model should have an error rate near zero. The error_rate_threshold parameter defines the maximum acceptable rate. If the canary's error rate exceeds this threshold during any evaluation window, rollback is triggered.

Request success rate is the inverse of error rate but is sometimes easier to reason about. An error rate threshold of 0.01 is equivalent to requiring a 99% success rate.

Performance Metrics

Latency percentiles capture how quickly the model responds to prediction requests. CorePlexML tracks p50, p95, and p99 latency. The p99 is particularly important for canary evaluation because it reveals tail latency problems that affect the worst-case user experience. A model might have acceptable median latency but occasionally take several seconds to respond due to input patterns that trigger expensive computation paths.

# Checking canary health during deployment
health = client.deployments.get_canary_health(
    deployment_id=deployment.id
)

print(f"Stage: {health.current_stage}")
print(f"Canary traffic: {health.canary_traffic_percentage}%")
print(f"Samples processed: {health.canary_samples}")
print(f"\nReliability:")
print(f"  Error rate: {health.error_rate:.4f} "
      f"(threshold: {health.error_rate_threshold})")
print(f"\nPerformance:")
print(f"  Latency p50: {health.latency_p50_ms:.1f}ms")
print(f"  Latency p95: {health.latency_p95_ms:.1f}ms")
print(f"  Latency p99: {health.latency_p99_ms:.1f}ms "
      f"(threshold: {health.latency_p99_threshold_ms}ms)")
print(f"\nPrediction Quality:")
print(f"  AUC: {health.prediction_quality:.4f} "
      f"(threshold: {health.prediction_quality_threshold})")
print(f"\nOverall: {health.status}")
print(f"Next advancement in: {health.minutes_until_advancement} minutes")

Prediction Quality Metrics

Prediction quality compares the canary model's predictions against the baseline model's predictions or against ground truth labels when available. For classification models, CorePlexML can compute AUC, accuracy, precision, and recall in real time by collecting prediction-outcome pairs. For regression models, it tracks RMSE and MAE.

When ground truth labels are delayed (which is common in many ML applications), CorePlexML falls back to comparing the canary's prediction distribution against the baseline's. Significant distributional divergence, even without labels, can indicate a problem.

Automated Rollback

Automated rollback is the safety mechanism that makes canary deployments practical at scale. Without it, a human operator would need to monitor every deployment continuously and intervene manually if problems emerge. With auto-rollback, the system detects problems and recovers without human intervention.

Rollback is triggered when any health check metric crosses its configured threshold. The process is immediate: traffic is redirected entirely to the baseline model, the canary model is taken offline, and an alert is sent to all configured notification channels with details about what triggered the rollback.

# Configure rollback alerts
client.alerts.create(
    deployment_id=deployment.id,
    name="Canary rollback notification",
    event_type="canary_rollback",
    channels=[
        {
            "type": "slack",
            "webhook_url": "https://hooks.slack.com/services/T00/B00/xxx"
        },
        {
            "type": "email",
            "address": "ml-team@company.com"
        }
    ]
)

The rollback alert includes the trigger reason (which metric exceeded which threshold), the duration the canary was active, the number of requests it processed, and a comparison of metrics between the canary and baseline. This information is critical for post-mortem analysis: understanding why the canary failed guides the investigation that leads to a fix.

CorePlexML also supports a rollback_on_error_spike mode that triggers rollback on sudden error rate increases even if the absolute error rate remains below the threshold. This catches scenarios where the error rate jumps from 0% to 0.5% in a short window, which, while still below a 1% threshold, may indicate a systematic problem that is worsening.

Blue-Green Deployments

For situations where you need instant, complete switchover with no partial traffic state, blue-green deployment is the appropriate choice:

deployment = client.deployments.create(
    project_id="proj_abc123",
    model_id="model_gbm_v4",
    name="Fraud Detector v4 (Blue-Green)",
    strategy="blue-green",
    blue_green_config={
        "rollback_window_minutes": 30,
        "health_checks": {
            "error_rate_threshold": 0.005,
            "latency_p99_threshold_ms": 150
        },
        "auto_rollback": True
    }
)

# Switch traffic from blue to green
client.deployments.switch(deployment_id=deployment.id)

# If needed, rollback is instant
# client.deployments.rollback(deployment_id=deployment.id)

The rollback_window_minutes parameter defines how long both environments remain active after the switch. During this window, the system monitors the green environment and automatically rolls back to blue if health checks fail. After the window closes without issues, the blue environment is decommissioned.

Blue-green is preferred over canary when the cost of serving even a small percentage of traffic with a bad model is unacceptable. Real-time financial systems, safety-critical applications, and models where individual bad predictions have high consequences are good candidates for blue-green deployment with a short monitoring window.

Shadow Deployments

Shadow deployment eliminates production risk entirely by never serving the new model's predictions to users:

deployment = client.deployments.create(
    project_id="proj_abc123",
    model_id="model_gbm_v4",
    name="Fraud Detector v4 (Shadow)",
    strategy="shadow",
    shadow_config={
        "comparison_window_days": 7,
        "log_predictions": True,
        "compare_metrics": ["auc", "precision", "recall", "latency_p99"],
        "async_inference": True
    }
)

# After the comparison window, review results
comparison = client.deployments.get_shadow_comparison(
    deployment_id=deployment.id
)

print(f"Comparison period: {comparison.start_date} to {comparison.end_date}")
print(f"Total requests compared: {comparison.total_requests}")
print(f"\n{'Metric':<20} {'Baseline':<12} {'Shadow':<12} {'Delta':<12}")
print("-" * 56)
for metric in comparison.metrics:
    print(f"{metric.name:<20} {metric.baseline_value:<12.4f} "
          f"{metric.shadow_value:<12.4f} {metric.delta:+<12.4f}")

The async_inference flag is important for production performance. When enabled, the shadow model processes requests asynchronously, so the user-facing response time is determined solely by the baseline model. The shadow model's predictions are computed in the background and logged for later analysis. This eliminates the latency overhead of running two models in series.

Shadow deployment is ideal when you are making a major change to your model architecture, training data, or feature set, and you want to validate production behavior before exposing users to the new model. After the comparison window, you can promote the shadow model to a canary or blue-green deployment for the final transition.

A Real-World Deployment Workflow

In practice, most teams use a combination of strategies in sequence. Here is a workflow that balances safety with deployment velocity:

Stage 1: Shadow validation. Deploy the new model as a shadow alongside the current production model. Run for 3-7 days, comparing predictions and latency. Review the comparison report for anomalies, distributional shifts, or unexpected prediction patterns.

Stage 2: Canary rollout. If the shadow comparison is clean, promote to a canary deployment. Start at 5% traffic with strict health check thresholds. Advance through 25%, 50%, and 100% over the course of 24-48 hours. Auto-rollback protects against regressions at each stage.

Stage 3: Post-deployment monitoring. After the canary reaches 100%, the new model is the production model. Continue monitoring with standard alerts for data drift, performance degradation, and error rates. The model registry records the full deployment history for audit purposes.

# Promote shadow to canary after successful comparison
promotion = client.deployments.promote(
    deployment_id=shadow_deployment.id,
    target_strategy="canary",
    canary_config={
        "stages": [
            {"traffic_percentage": 5, "duration_minutes": 120},
            {"traffic_percentage": 25, "duration_minutes": 240},
            {"traffic_percentage": 50, "duration_minutes": 360},
            {"traffic_percentage": 100, "duration_minutes": 0}
        ],
        "auto_advance": True,
        "auto_rollback": True
    }
)

print(f"Promotion ID: {promotion.id}")
print(f"Strategy: shadow -> canary")
print(f"Starting at: {promotion.initial_traffic_percentage}%")

Best Practices

Start conservatively. Begin canary deployments at 5% or less. The cost of a slower rollout is measured in hours. The cost of a bad deployment reaching 100% of traffic can be measured in revenue and user trust.

Set meaningful thresholds. Health check thresholds should be based on your baseline model's actual performance, not arbitrary numbers. If your current model has a p99 latency of 120ms, setting the threshold at 200ms is reasonable. Setting it at 50ms will cause false rollbacks.

Monitor business metrics, not just model metrics. A model might have acceptable AUC and latency but still cause problems downstream. If your model feeds into a pricing engine, monitor the pricing outcomes, not just the model's prediction accuracy.

Use auto-advance for standard deployments. Manual advancement through canary stages is appropriate for the first deployment of a new model type, but for routine model updates (retraining on fresh data, minor feature changes), auto-advance reduces operational burden without sacrificing safety.

Keep both models warm. During a canary deployment, the baseline model should remain fully loaded and ready to absorb 100% of traffic at any moment. Do not scale down baseline resources until the canary has completed all stages and the deployment is finalized.

Review rollback post-mortems. Every rollback is a learning opportunity. CorePlexML logs the complete history of each deployment, including which metric triggered the rollback, at what stage, and with what values. Reviewing these post-mortems helps you improve both your models and your deployment configurations over time.

Safe model deployment is not about eliminating risk entirely. It is about controlling your exposure, measuring outcomes on real traffic, and having automated safeguards that protect your users and your business when things do not go as planned. Canary deployments provide that balance, and combined with blue-green and shadow strategies, they give you a complete toolkit for production ML at any scale.

For more details on CorePlexML's deployment strategies and monitoring capabilities, visit the MLOps features page.