Detecting Model Drift and Setting Up Production Alerts
The Silent Failure of ML Models
Machine learning models degrade in production, and they do it silently. Unlike a web service that crashes with a stack trace or a database that runs out of disk space, a drifting model continues to return predictions. The HTTP status codes are all 200. The latency is within bounds. Every health check passes. But the predictions are becoming progressively less accurate as the relationship between the input data and the real world shifts beneath the model's feet.
This is drift, and it is the most common cause of ML model failure in production. A model trained on last year's customer behavior may not capture this year's preferences. A fraud detection model trained before a new attack vector emerged will miss the new pattern entirely. A demand forecasting model trained during stable economic conditions will produce unreliable predictions during a recession.
Drift detection and alerting are the practices that turn silent failure into observable, actionable events. Instead of discovering degradation weeks later through declining business metrics, you detect it within hours and respond before significant damage occurs.
Types of Drift
Understanding the different forms of drift is essential for configuring effective monitoring. Each type has different causes, different detection methods, and different remediation strategies.
Data Drift
Data drift, also called covariate shift, occurs when the distribution of input features changes between training time and inference time. The model's learned relationship between features and target may still hold, but the model is now operating on data that looks different from what it was trained on.
For example, a model trained on transaction data where the average transaction amount was 50 dollars may start seeing transactions with an average of 200 dollars due to seasonal purchasing patterns or a change in product mix. The feature distributions have shifted, and the model may be making predictions in regions of the feature space where it has limited training data.
Data drift is the easiest form of drift to detect because it requires only the input features, not the ground truth labels. You can compare the current feature distributions against the training distributions and flag significant deviations.
Concept Drift
Concept drift occurs when the relationship between the input features and the target variable changes. The features may still have the same distributions, but the mapping from features to outcomes has shifted. This is more dangerous than data drift because the model's predictions become systematically wrong even though the inputs look normal.
A classic example is credit scoring. The factors that predict default may change during an economic downturn. Income and employment history meant one thing in a stable economy and something different during a recession. The feature distributions might not change dramatically (people still have the same range of incomes), but the predictive relationship between income and default risk has shifted.
Concept drift requires ground truth labels to detect directly. Since labels are often delayed (you do not know if a customer churned until the churn window closes), concept drift detection typically relies on monitoring prediction quality metrics over sliding windows and looking for sustained degradation.
Prediction Drift
Prediction drift focuses on the model's output distribution rather than its input distribution. If the model's predictions shift significantly (for example, the average predicted churn probability increases from 8% to 15%), this may indicate an underlying problem even if no specific feature drift or concept drift has been identified.
Prediction drift is a useful aggregate signal. It can catch problems that feature-level drift detection misses, such as interactions between features that have each shifted only slightly but together produce a large shift in predictions.
Population Stability Index
The Population Stability Index (PSI) is the primary metric CorePlexML uses for drift detection. PSI measures how much a feature's distribution has changed between two time periods by comparing the frequency of values across bins.
For a numeric feature, PSI divides the value range into bins (typically 10-20) based on the training distribution's quantiles. For each bin, it calculates the proportion of values in the training data and the proportion in the current data, then computes a weighted divergence. For categorical features, each category is treated as a bin.
The formula for a single bin is: (actual_proportion - expected_proportion) multiplied by the natural logarithm of (actual_proportion divided by expected_proportion). The total PSI is the sum across all bins.
PSI values have well-established interpretation thresholds. A PSI below 0.1 indicates no significant drift and the distributions are essentially stable. A PSI between 0.1 and 0.25 indicates moderate drift that warrants investigation. A PSI above 0.25 indicates significant drift that likely requires action, such as retraining the model or investigating the data source.
Setting Up Drift Monitoring
CorePlexML's drift monitoring runs continuously in the background, comparing current prediction request data against the training distribution at configurable intervals:
from coreplexml import CorePlexClient
client = CorePlexClient(
base_url="https://api.coreplexml.io",
api_key="your-api-key"
)
# Configure drift monitoring for a deployment
drift_config = client.monitoring.configure_drift(
deployment_id="dep_fraud_v3",
config={
"evaluation_window_minutes": 60,
"reference_dataset_version_id": "dsv_train_2025q4",
"features_to_monitor": "all",
"psi_bins": 20,
"thresholds": {
"feature_psi_warning": 0.1,
"feature_psi_critical": 0.25,
"overall_drift_warning": 0.15,
"overall_drift_critical": 0.30
},
"prediction_drift": {
"enabled": True,
"window_size_hours": 24,
"comparison_window_size_hours": 168
}
}
)
print(f"Drift monitoring configured for deployment: {drift_config.deployment_id}")
print(f"Monitoring {drift_config.feature_count} features")
print(f"Evaluation interval: every {drift_config.evaluation_window_minutes} minutes")
The features_to_monitor parameter can be set to "all" to monitor every input feature, or you can specify a list of feature names to focus on the most important ones. Monitoring all features provides comprehensive coverage but can generate noise if some features are naturally volatile. A common middle ground is to monitor all features but only alert on the top features by importance.
The prediction_drift section enables monitoring of the model's output distribution. It compares the prediction distribution over a recent window (24 hours in this example) against a longer baseline window (168 hours, or one week). This catches shifts in model behavior that might not be visible at the individual feature level.
The Alert System
Drift detection is only useful if it triggers a response. CorePlexML's alert system supports six configurable metrics, each targeting a different aspect of model health:
- drift_psi: Fires when the overall PSI score or any individual feature's PSI exceeds a threshold
- accuracy_degradation: Fires when prediction accuracy drops below a defined level (requires ground truth labels)
- error_rate: Fires when the percentage of failed prediction requests exceeds a threshold
- latency_p99: Fires when the 99th percentile response time exceeds a limit
- model_staleness: Fires when the model has not been retrained within a defined period
- prediction_anomaly: Fires when the prediction distribution deviates significantly from the expected range
Creating Alert Channels
Before creating alert rules, you configure the channels through which alerts are delivered. CorePlexML supports Slack, email, and generic webhooks:
# Create a Slack alert channel
slack_channel = client.alerts.create_channel(
project_id="proj_abc123",
name="ML Team Slack",
channel_type="slack",
config={
"webhook_url": "https://hooks.slack.com/services/T00/B00/xxxxx",
"channel_name": "#ml-alerts",
"mention_on_critical": "@ml-oncall"
}
)
# Create an email alert channel
email_channel = client.alerts.create_channel(
project_id="proj_abc123",
name="ML Team Email",
channel_type="email",
config={
"recipients": [
"ml-team@company.com",
"data-engineering@company.com"
],
"subject_prefix": "[CorePlexML Alert]"
}
)
# Create a webhook channel for integration with PagerDuty or custom systems
webhook_channel = client.alerts.create_channel(
project_id="proj_abc123",
name="PagerDuty Integration",
channel_type="webhook",
config={
"url": "https://events.pagerduty.com/v2/enqueue",
"method": "POST",
"headers": {
"Content-Type": "application/json",
"Authorization": "Token token=your-pagerduty-token"
},
"payload_template": {
"routing_key": "your-routing-key",
"event_action": "trigger",
"payload": {
"summary": "{{alert_name}}: {{metric_value}}",
"severity": "{{severity}}",
"source": "coreplexml"
}
}
}
)
Creating Alert Rules
Alert rules connect a metric condition to one or more channels. Each rule defines what to monitor, when to fire, and where to send notifications:
# Alert on high drift
drift_alert = client.alerts.create_rule(
deployment_id="dep_fraud_v3",
name="Fraud model drift warning",
metric="drift_psi",
condition="greater_than",
threshold=0.15,
severity="warning",
channels=[slack_channel.id],
evaluation_window_minutes=60,
cooldown_minutes=240,
description="Overall PSI drift score exceeds warning threshold"
)
# Alert on critical drift with escalation
drift_critical = client.alerts.create_rule(
deployment_id="dep_fraud_v3",
name="Fraud model drift critical",
metric="drift_psi",
condition="greater_than",
threshold=0.30,
severity="critical",
channels=[slack_channel.id, email_channel.id, webhook_channel.id],
evaluation_window_minutes=30,
cooldown_minutes=60,
description="Overall PSI drift score exceeds critical threshold"
)
# Alert on accuracy degradation
accuracy_alert = client.alerts.create_rule(
deployment_id="dep_fraud_v3",
name="Fraud model accuracy drop",
metric="accuracy_degradation",
condition="less_than",
threshold=0.90,
severity="critical",
channels=[slack_channel.id, webhook_channel.id],
evaluation_window_minutes=120,
cooldown_minutes=360,
description="Model accuracy has dropped below 90% threshold"
)
# Alert on model staleness
staleness_alert = client.alerts.create_rule(
deployment_id="dep_fraud_v3",
name="Fraud model staleness",
metric="model_staleness",
condition="greater_than",
threshold=30,
severity="warning",
channels=[email_channel.id],
evaluation_window_minutes=1440,
cooldown_minutes=1440,
description="Model has not been retrained in 30+ days"
)
Severity Levels and Escalation
CorePlexML supports three severity levels: info, warning, and critical. The severity level determines the urgency of the notification and can be used to route alerts to different channels. A common pattern is to send info and warning alerts to Slack only, while critical alerts go to Slack, email, and PagerDuty simultaneously.
The cooldown_minutes parameter prevents alert fatigue by suppressing duplicate notifications for a configured period after an alert fires. Without cooldown, a drifting model would generate a new alert every evaluation cycle until the drift is resolved, potentially flooding your channels with hundreds of identical notifications.
Suppression Windows
During planned maintenance, data migrations, or known data pipeline changes, you may want to temporarily suppress alerts to avoid false positives:
# Create a suppression window for planned maintenance
suppression = client.alerts.create_suppression(
deployment_id="dep_fraud_v3",
name="Q1 data migration window",
start_time="2026-03-15T02:00:00Z",
end_time="2026-03-15T06:00:00Z",
suppress_severities=["warning", "info"],
notes="Planned data pipeline migration. "
"Critical alerts remain active."
)
Notice that the suppression window only suppresses warning and info alerts. Critical alerts remain active even during maintenance, because truly critical issues (like a 50% error rate) need immediate attention regardless of planned activities.
Auto-Retraining Policies
Detecting drift is only half the solution. Closing the loop means automatically retraining the model when drift is detected, validating the new model, and promoting it to production if it meets quality thresholds.
CorePlexML supports three retraining triggers: scheduled, drift-based, and performance-based.
Scheduled Retraining
Scheduled retraining runs on a fixed cadence regardless of whether drift has been detected. This ensures model freshness for applications where data accumulates at a predictable rate:
scheduled_policy = client.retraining.create_policy(
deployment_id="dep_fraud_v3",
name="Weekly fraud model refresh",
trigger="schedule",
schedule_config={
"frequency": "weekly",
"day_of_week": "sunday",
"hour_utc": 3,
"timezone": "UTC"
},
training_config={
"max_runtime_secs": 900,
"max_models": 30,
"stopping_metric": "AUC",
"balance_classes": True
},
validation_config={
"min_improvement": 0.0,
"validation_dataset_version_id": "dsv_holdout_2026q1",
"metrics_to_compare": ["auc", "precision", "recall"]
},
auto_promote=True,
promotion_strategy="canary"
)
Drift-Triggered Retraining
Drift-triggered retraining activates when the monitoring system detects significant drift, providing a reactive response to environmental changes:
drift_policy = client.retraining.create_policy(
deployment_id="dep_fraud_v3",
name="Drift-triggered fraud retrain",
trigger="drift",
drift_config={
"metric": "drift_psi",
"threshold": 0.25,
"sustained_minutes": 120
},
training_config={
"max_runtime_secs": 600,
"max_models": 20,
"stopping_metric": "AUC",
"balance_classes": True
},
validation_config={
"min_improvement": 0.005,
"metrics_to_compare": ["auc", "precision"]
},
auto_promote=False,
max_retrains_per_week=3
)
The sustained_minutes parameter requires the drift to persist for a minimum duration before triggering retraining. This prevents spurious retraining from transient data anomalies that resolve on their own. The max_retrains_per_week parameter caps the number of automated retrains to prevent runaway resource consumption during periods of high volatility.
Performance-Based Retraining
Performance-based retraining triggers when prediction quality drops below a defined threshold. Unlike drift-based retraining, which responds to input changes, performance-based retraining responds to outcome changes:
performance_policy = client.retraining.create_policy(
deployment_id="dep_fraud_v3",
name="Performance-triggered fraud retrain",
trigger="performance",
performance_config={
"metric": "auc",
"threshold": 0.88,
"evaluation_window_days": 7,
"min_samples": 10000
},
training_config={
"max_runtime_secs": 600,
"max_models": 20,
"stopping_metric": "AUC"
},
validation_config={
"min_improvement": 0.01,
"metrics_to_compare": ["auc", "precision", "recall"]
},
auto_promote=True,
promotion_strategy="canary"
)
The min_samples requirement ensures that the performance metric is computed on enough data to be reliable. Without this guard, a small number of mislabeled examples could trigger unnecessary retraining.
Retraining Validation
Whether triggered by schedule, drift, or performance, every automated retraining job goes through a validation step before the new model can be promoted. CorePlexML compares the candidate model's metrics against the current production model and applies the min_improvement threshold.
If auto_promote is true and the candidate meets the improvement threshold, it is deployed using the specified promotion_strategy (typically canary). If the candidate does not meet the threshold, it is registered in the model registry with a "development" stage and an alert is sent notifying the team that automated retraining did not produce an improvement.
This validation step is critical. Retraining on drifted data does not guarantee a better model. If the drift is caused by a data quality issue (a broken pipeline, incorrect labels, or corrupted features), retraining on bad data will produce a bad model. The validation step catches this by requiring the new model to outperform the current one on a held-out validation set.
The End-to-End Workflow
When all components are configured, the monitoring and retraining system operates as a closed loop:
- The drift monitoring system evaluates feature distributions and prediction distributions at configured intervals.
- When drift exceeds a threshold, an alert fires through the configured channels, notifying the team.
- If a retraining policy is configured for the detected trigger, a retraining job is automatically enqueued.
- The retraining job trains a new model using the latest available data.
- The validation step compares the candidate against the current production model.
- If the candidate meets the improvement threshold and auto-promote is enabled, it is deployed via canary rollout.
- The canary deployment monitors the new model's health and either advances to full traffic or rolls back automatically.
- The model registry records the entire chain: the drift event, the retraining job, the validation results, the deployment, and the outcome.
This closed loop transforms model maintenance from a manual, reactive process into an automated, proactive one. The team is still notified at every step, and they retain the ability to intervene (by disabling auto-promote or adjusting thresholds), but the routine work of monitoring, retraining, and deploying happens without human intervention.
Best Practices
Combine multiple retraining triggers. Scheduled retraining provides a baseline freshness guarantee, drift-triggered retraining responds to environmental changes, and performance-based retraining catches degradation from any cause. Together, they provide comprehensive coverage.
Set conservative auto-promote thresholds. A minimum improvement of 0% means the new model only needs to match the current one. A threshold of 1% or higher provides a stronger signal that the new model is genuinely better, not just different.
Monitor the monitors. Review your drift metrics and alert history regularly. If you are getting frequent false alarms, your thresholds may be too sensitive. If you never get alerts, your thresholds may be too lax, or your monitoring might not be working as expected.
Use suppression windows judiciously. Suppression windows are a necessary tool for maintenance periods, but they should be short and well-documented. A forgotten suppression window that lasts weeks defeats the purpose of monitoring entirely.
Document your escalation paths. When a critical alert fires at 3 AM, the on-call engineer needs to know exactly what to do. Document the response procedures for each alert type: who to contact, what dashboards to check, and what remediation actions are available.
Drift detection and alerting are not optional for production ML. They are the observability layer that turns a deployed model from a black box into a managed system. Invest in configuring them thoughtfully, and they will protect your models, your users, and your business from the silent failures that make production ML so challenging.
For more on CorePlexML's monitoring and retraining infrastructure, visit the MLOps features page.