MLOps Best Practices: From Model to Production
Why MLOps Matters
There is a well-documented gap between building a machine learning model that performs well in a notebook and operating that model reliably in production. Research from industry surveys consistently shows that the majority of ML projects never make it to production, and among those that do, many degrade silently over time as the data they were trained on drifts away from the data they encounter in the real world.
MLOps, short for Machine Learning Operations, is the set of practices that bridges this gap. It encompasses deployment automation, continuous monitoring, version management, and automated retraining. Without MLOps, a deployed model is a static artifact that grows stale from the moment it goes live. With MLOps, it becomes a living system that adapts, alerts, and improves.
CorePlexML's MLOps module provides a complete framework for production ML, covering the full lifecycle from the moment a model leaves the training pipeline to its eventual retirement. In this guide, we will walk through each component in depth: deployment strategies, monitoring and drift detection, A/B testing, auto-retraining workflows, and model registry management.
Deployment Strategies
The deployment strategy you choose determines how your new model replaces or augments the existing one. Each strategy makes a different trade-off between speed, safety, and resource usage. CorePlexML supports four strategies natively.
Direct Deployment
Direct deployment is the simplest approach: the new model immediately replaces the current one for all traffic. There is no gradual rollout and no parallel running. The switch is instantaneous.
from coreplexml import CorePlexClient
client = CorePlexClient(
base_url="https://api.coreplexml.io",
api_key="your-api-key"
)
deployment = client.deployments.create(
project_id="proj_abc123",
model_id="model_best_xgb",
name="Churn Predictor v2",
strategy="direct"
)
Direct deployment is best suited for development and staging environments, for low-risk models where a brief period of degraded performance is acceptable, or for situations where you have already validated the model extensively offline and are confident in its production behavior. The advantage is speed and simplicity. The risk is that if the new model performs poorly on production data, all users are affected immediately.
Canary Deployment
Canary deployment mitigates this risk by routing only a small fraction of production traffic to the new model while the existing model continues serving the majority. You monitor the new model's metrics on real traffic, and if everything looks good, you gradually increase its traffic share until it serves 100%.
deployment = client.deployments.create(
project_id="proj_abc123",
model_id="model_best_xgb",
name="Churn Predictor v2",
strategy="canary",
traffic_percentage=5, # Start with 5% of traffic
auto_rollback=True, # Enable automatic rollback
rollback_threshold=0.03, # Roll back if AUC drops by 3%
promotion_steps=[5, 25, 50, 100] # Gradual traffic increase
)
The promotion_steps parameter defines the traffic percentages the deployment will progress through. At each step, CorePlexML evaluates the canary model's performance against the baseline. If the canary meets or exceeds the baseline within the rollback threshold, it advances to the next step. If performance degrades beyond the threshold, the system automatically rolls back to the baseline model and sends an alert to your configured notification channels.
Canary deployments are the recommended default for production environments. They provide a safety net against unexpected model behavior while still allowing you to ship improvements frequently. The trade-off is that the rollout takes longer (typically hours to days depending on your traffic volume and step configuration) and requires sufficient traffic at each step to produce statistically meaningful metric comparisons.
Blue-Green Deployment
Blue-green deployment maintains two complete, identical environments. The "blue" environment runs the current model, and the "green" environment runs the new one. When you are ready to switch, all traffic moves from blue to green atomically, in a single instant.
deployment = client.deployments.create(
project_id="proj_abc123",
model_id="model_best_xgb",
name="Churn Predictor v2",
strategy="blue-green",
auto_rollback=True,
rollback_window_minutes=30 # Monitor for 30 min before confirming
)
The key advantage of blue-green is that rollback is instantaneous. If the green environment shows problems after the switch, you redirect traffic back to blue immediately with no downtime and no partial state. This makes it ideal for mission-critical models where any degradation in prediction quality has immediate business impact, such as real-time fraud detection or dynamic pricing engines.
The trade-off is resource cost. During the transition period (and the rollback window), both environments must be running simultaneously, effectively doubling your compute and memory requirements. For large models or high-throughput endpoints, this can be significant.
Shadow Deployment
Shadow deployment takes the most conservative approach: the new model runs alongside the current one and processes every request, but its predictions are never served to users. Instead, both models' predictions are logged and compared offline.
deployment = client.deployments.create(
project_id="proj_abc123",
model_id="model_best_xgb",
name="Churn Predictor v2 (Shadow)",
strategy="shadow",
comparison_window_days=7 # Collect data for 7 days
)
Shadow deployment is the safest strategy because the new model has zero impact on production outcomes. It is particularly valuable when you are making a significant change, such as switching from a classification model to a regression model, changing the feature set substantially, or deploying a model trained on a fundamentally different dataset. After the comparison window, you analyze the logged predictions to decide whether to promote the shadow model to active service.
The cost is latency and compute. Every request runs through two models instead of one, which increases response time and resource usage. However, shadow predictions can be computed asynchronously if real-time comparison is not needed, mitigating the latency impact.
Monitoring and Drift Detection
A model that performed well during training can degrade in production for reasons that have nothing to do with the model itself. The world changes. Customer behavior shifts. Upstream data pipelines introduce new formats or new null patterns. Monitoring is the practice of continuously measuring model health so you can detect and respond to these changes before they cause business impact.
Performance Metrics
CorePlexML tracks three categories of performance metrics continuously:
Prediction quality metrics measure how accurate the model's outputs are. For classification models, this includes accuracy, precision, recall, F1 score, and AUC. For regression models, this includes RMSE, MAE, and R-squared. These metrics require ground truth labels, which may arrive with a delay (e.g., you know whether a customer churned only after the churn window closes). CorePlexML supports delayed label ingestion and retroactively computes quality metrics when labels become available.
Latency metrics measure how quickly the model responds to prediction requests. The platform tracks p50, p95, and p99 latency percentiles, as well as average response time. Latency spikes can indicate model complexity issues, resource contention, or infrastructure problems.
Throughput metrics measure the volume of predictions served over time. Sudden drops in throughput may indicate upstream pipeline failures, while unexpected spikes may signal a need to scale resources.
Data Drift
Data drift occurs when the statistical distribution of input features changes between training time and inference time. For example, if your model was trained on customer data with an average age of 35, but your user base shifts younger over time, the model is now making predictions on data it was not optimized for. CorePlexML uses statistical tests (Kolmogorov-Smirnov for numeric features, chi-squared for categorical features) to compare current feature distributions against the training distribution and flags significant deviations.
Concept Drift
Concept drift is more subtle and more dangerous. It occurs when the relationship between input features and the target variable changes, even if the feature distributions themselves remain stable. For example, the same customer profile that indicated high churn risk six months ago might now indicate low risk because the company improved its retention program. Concept drift makes the model's learned patterns obsolete. CorePlexML detects concept drift by monitoring prediction quality metrics over sliding windows and flagging sustained degradation.
Alert Configuration
When drift or performance degradation is detected, CorePlexML can notify your team through multiple channels:
alert = client.alerts.create(
deployment_id=deployment.id,
name="Churn model drift alert",
metric="data_drift_score",
condition="greater_than",
threshold=0.15, # Drift score above 0.15
channels=[
{"type": "slack", "webhook_url": "https://hooks.slack.com/..."},
{"type": "email", "address": "ml-team@company.com"},
{"type": "webhook", "url": "https://api.company.com/alerts"}
],
evaluation_window_minutes=60, # Check every hour
cooldown_minutes=240 # Don't re-alert for 4 hours
)
The cooldown_minutes parameter prevents alert fatigue by suppressing duplicate notifications for a configured period after an alert fires. The evaluation_window_minutes parameter controls how frequently the metric is checked. For production-critical models, a shorter window (15-30 minutes) provides faster response times, while a longer window (60-120 minutes) reduces noise from transient fluctuations.
A/B Testing
While canary deployments compare a new model against a baseline primarily for safety (does it perform at least as well?), A/B testing is designed for rigorous comparison (which model performs better?). CorePlexML's A/B testing framework lets you split traffic between two or more model versions and measure their relative performance with statistical significance.
A/B testing is particularly valuable when you have multiple candidate models that perform similarly in offline evaluation and you want to determine which one delivers better business outcomes in production. For example, two models might have nearly identical AUC on your test set, but one might produce better-calibrated probability estimates that lead to more effective targeting in a marketing campaign.
To run an A/B test, you configure a traffic split between model variants and define the success metric you want to optimize. CorePlexML collects predictions and outcomes for each variant, computes the metric for each, and performs a statistical significance test (typically a two-proportion z-test for classification metrics or a t-test for continuous metrics) to determine whether the observed difference is real or due to random variation.
The platform reports the estimated effect size, the confidence interval, and the p-value. It also provides a recommended minimum sample size based on your desired statistical power and minimum detectable effect, so you know how long to run the test before drawing conclusions.
Auto-Retraining Workflows
Monitoring tells you when a model is degrading. Auto-retraining closes the loop by automatically training a replacement model when degradation is detected.
Drift-Triggered Retraining
This is the most responsive approach. When the monitoring system detects significant data drift or concept drift, it automatically enqueues a retraining job using the latest available data.
policy = client.retraining.create_policy(
deployment_id=deployment.id,
trigger="drift",
drift_metric="data_drift_score",
drift_threshold=0.20, # Retrain when drift exceeds 0.20
training_config={
"max_runtime_secs": 600,
"max_models": 20,
"stopping_metric": "AUC"
},
auto_promote=False # Require manual review before promotion
)
Setting auto_promote=False is a safety measure. The retraining job produces a new model version, but it does not automatically replace the current deployment. Instead, the model is registered in the model registry with a status of "candidate," and your team is notified to review it before promotion. This prevents a scenario where a retrained model on drifted data actually performs worse than the current model.
Schedule-Based Retraining
For use cases where data accumulates at a predictable rate, schedule-based retraining ensures models are refreshed on a regular cadence regardless of whether drift has been detected.
policy = client.retraining.create_policy(
deployment_id=deployment.id,
trigger="schedule",
schedule="weekly", # Options: daily, weekly, monthly
day_of_week="sunday", # For weekly schedule
hour_utc=3, # Run at 3:00 AM UTC
training_config={
"max_runtime_secs": 900,
"max_models": 30,
"stopping_metric": "AUC"
},
auto_promote=True # Promote automatically if metrics improve
)
With auto_promote=True, the system automatically compares the new model's validation metrics against the current production model. If the new model is equal or better, it is promoted through the deployment strategy (canary, blue-green, etc.). If it is worse, the candidate is shelved and an alert is sent. This provides hands-off model freshness while still guarding against regressions.
Performance-Based Retraining
This trigger activates when a specific prediction quality metric drops below a defined threshold, regardless of whether drift has been detected.
policy = client.retraining.create_policy(
deployment_id=deployment.id,
trigger="performance",
performance_metric="accuracy",
performance_threshold=0.85, # Retrain if accuracy drops below 85%
evaluation_window_days=7, # Measured over a 7-day window
training_config={
"max_runtime_secs": 600,
"max_models": 20,
"stopping_metric": "AUC"
},
auto_promote=True
)
Performance-based retraining is a catch-all that responds to any form of degradation, whether caused by drift, data quality issues, or other factors. It pairs well with drift-triggered retraining: the drift trigger catches gradual distributional changes early, while the performance trigger catches sudden quality drops from any cause.
The Automatic Promotion Pipeline
When auto-promotion is enabled, the retraining system follows a structured pipeline. First, the new model is trained and validated on a holdout set. Second, the model is registered in the model registry with its metrics, lineage, and training configuration. Third, the system compares the candidate's metrics against the current production model. If the candidate meets the promotion criteria, it is deployed using the deployment's configured strategy (for example, a canary rollout starting at 5%). If the canary succeeds, the candidate becomes the new production model. The entire pipeline runs without human intervention but produces a detailed audit log at every step.
Model Registry
The model registry is the backbone of reproducibility and governance in CorePlexML's MLOps framework. Every model produced by an experiment or retraining job is automatically registered with its complete lineage: the dataset version it was trained on, the experiment configuration, the hyperparameters, the validation metrics, the training duration, and the parent model it was derived from (if applicable).
Each model in the registry has a lifecycle status: candidate (newly trained, awaiting review), staging (approved for testing), production (actively serving traffic), or archived (retired from service). Status transitions are logged with timestamps and the user or automation that initiated them.
The registry also tracks promotion history, recording every time a model was deployed, to which endpoint, with which strategy, and what happened during the rollout. This history is invaluable for debugging production issues ("when did this model go live?"), for compliance audits ("who approved this model for production?"), and for organizational learning ("which training configurations consistently produce the best models?").
Key Takeaways
Building reliable ML systems in production is fundamentally different from training models in notebooks. Here are the practices that consistently separate successful ML deployments from those that degrade and fail:
Start with canary deployments. Unless you have a strong reason for another strategy, canary deployment provides the best balance of speed and safety for most production use cases. It lets you ship improvements frequently while automatically protecting against regressions.
Set up monitoring before you deploy, not after. Configure alerts for data drift, concept drift, and prediction quality degradation before your model starts serving traffic. The first few hours and days after deployment are when problems are most likely to surface and easiest to diagnose.
Configure at least two retraining triggers. Drift-triggered retraining catches gradual changes early, while performance-based retraining catches sudden drops from any cause. Together, they provide comprehensive coverage against model staleness.
Use the model registry as your single source of truth. Resist the temptation to deploy models from notebooks or ad-hoc scripts. Every model that serves production traffic should be registered with full lineage and versioning. This discipline pays dividends in debugging, compliance, and team coordination.
Run A/B tests for consequential model changes. When the stakes are high, do not rely on offline metrics alone. Split traffic between the old and new model, measure real business outcomes, and wait for statistical significance before committing. The extra days of testing frequently reveal production behaviors that offline evaluation missed.
Treat MLOps as infrastructure, not overhead. The monitoring, alerting, retraining, and registry capabilities described in this guide may seem like additional work on top of model development. In practice, they dramatically reduce the total cost of operating ML systems by catching problems early, automating routine maintenance, and providing the audit trails that governance requires. Teams that invest in MLOps spend less time firefighting and more time building new capabilities.
Production ML is a continuous process, not a one-time event. The model you deploy today will need to be monitored, updated, and eventually replaced. CorePlexML's MLOps framework provides the automation and observability to manage this lifecycle at scale, so your models remain reliable long after they leave the training pipeline.
Further Reading
For deeper dives into specific MLOps topics covered in this guide, see our dedicated articles:
- A/B Testing ML Models in Production — statistical foundations, SDK setup, and best practices for model experimentation
- Safe Model Rollouts with Canary Deployments — progressive traffic ramp, blue-green swaps, and shadow deployment strategies
- Model Registry: Version, Stage, and Govern — semantic versioning, model cards, lineage tracking, and governance workflows
- Detecting Model Drift and Setting Up Alerts — PSI monitoring, multi-channel alerts, and auto-retraining policies
- Enterprise Security: RBAC, SSO, and Multi-Tenancy — role-based access, identity federation, and audit logging