MLOps(机器学习操作)将DevOps类似的实践应用于ML — 可靠地、大规模地管理完整的ML生命周期(数据、训练、部署、监控)。它解决了在生产环境中部署和维护ML所面临的独特的运营挑战。
ML生命周期
ML projects involve a full lifecycle (not just training a model):
1. DATA → collect, clean, label, version data (data is foundational)
2. TRAINING → develop, train, and evaluate models (experimentation, tuning)
3. DEPLOYMENT → put the model into production (serving predictions/inference)
4. MONITORING → track performance in production; detect issues
5. MAINTENANCE → retrain/update models as data and performance change
→ an ongoing cycle, not a one-time effort
MLOps解决的问题
MLOps → practices/tools to manage the ML lifecycle reliably (DevOps for ML):
✓ REPRODUCIBILITY → version data, code, AND models; track experiments
✓ AUTOMATION → automate training, testing, deployment pipelines (CI/CD for ML)
✓ DEPLOYMENT → reliably serve models; scaling, versioning, rollback
✓ MONITORING → track model performance, data DRIFT (data changing over time → model
degrades), errors → know when to retrain
✓ collaboration between data scientists, ML engineers, and ops
