Evaluating ML models means measuring how well they perform — using appropriate metrics (accuracy, precision, recall, etc.) on test data the model hasn't seen. Proper evaluation is essential for knowing whether a model actually works and is reliable.
Evaluating on unseen data
→ evaluate on a TEST set the model did NOT train on → measures GENERALIZATION (real performance)
→ training accuracy alone is misleading (a model can memorize training data)
→ train/validation/test split; cross-validation → reliable performance estimates
Common metrics
CLASSIFICATION:
ACCURACY → % correct (but misleading for IMBALANCED data — e.g. 99% 'not fraud')
PRECISION → of predicted positives, how many are actually positive (avoid false positives)
RECALL → of actual positives, how many were found (avoid false negatives/missing cases)
F1 → balance of precision and recall
CONFUSION MATRIX → true/false positives/negatives breakdown
REGRESSION:
MAE, MSE/RMSE → average prediction error (how far off predictions are)
→ choose metrics that fit the problem (accuracy isn't always right)
