在机器学习中,数据至关重要——训练数据的质量和数量在很大程度上决定了模型的性能。"垃圾进,垃圾出"(garbage in, garbage out)这一原则在此体现得淋漓尽致:即使是再优秀的算法,遇到劣质数据也会失效;而优质数据往往比算法的选择更具影响力。
为什么数据如此重要
ML models LEARN from data → the data fundamentally shapes what they learn:
→ GARBAGE IN, GARBAGE OUT → poor data → poor model (no algorithm fixes bad data)
→ good DATA is often MORE impactful than the algorithm (data > model tweaks, often)
→ models can only be as good as the data they learn from
→ data is frequently the most important factor in ML success
数据质量
✓ ACCURATE/correct → wrong labels/values → the model learns wrong things
✓ RELEVANT → data representative of the real problem/distribution
✓ CLEAN → handle missing values, errors, duplicates, noise
✓ UNBIASED → biased data → biased model (perpetuates/amplifies bias — a serious issue)
✓ CONSISTENT, well-labeled → good labels are crucial for supervised learning
