Learning Goals
3 min- Explain why a single split can give a misleading score.
- Run k-fold cross-validation with
cross_val_score. - Report mean ± std accuracy, not a single number.
- Use
stratifyto keep class balance in splits.
Warm-Up · The Lucky Split
5 minfrom sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier X, y = load_wine(return_X_y=True) for seed in [0, 1, 2, 3, 4]: Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=seed) acc = KNeighborsClassifier().fit(Xtr, ytr).score(Xte, yte) print(f"seed {seed}: {acc:.1%}")
seed 0: 75.6% seed 1: 68.9% seed 2: 82.2% seed 3: 71.1% seed 4: 77.8%
Same model, same data — accuracy swings 14 points just by changing which rows landed in the test set. Reporting any single one would be dishonest.
One split is one sample of one random experiment. Cross-validation runs the experiment k times on different splits and reports the average — a far more trustworthy estimate of real-world accuracy.
New Concept · k-Fold Cross-Validation
14 minThe idea
5-fold CV: split data into 5 equal parts. Run 5 times — each time, 1 part is the test, 4 are training. Every row is tested exactly once. Average the 5 scores. fold 1: [TEST][ train train train train ] fold 2: [train][TEST][ train train train ] fold 3: [train train][TEST][ train train ] fold 4: [train train train][TEST][ train ] fold 5: [train train train train][TEST]
cross_val_score
from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_wine X, y = load_wine(return_X_y=True) scores = cross_val_score(KNeighborsClassifier(), X, y, cv=5) print(scores.round(3)) # the 5 fold scores print(f"{scores.mean():.1%} ± {scores.std():.1%}")
[0.694 0.75 0.806 0.743 0.743] 74.7% ± 3.6%
Report the mean ± std. The std tells you how stable the model is — a big std means the model is fragile to the data split.
Stratified splits — keep class balance
For classification, scikit-learn uses stratified folds by default — each fold has the same class proportions as the whole. To stratify a single split:
from sklearn.model_selection import train_test_split Xtr, Xte, ytr, yte = train_test_split( X, y, test_size=0.25, stratify=y, random_state=0)
Without stratify, a rare class might land entirely in train OR test by bad luck.
The golden rule
Use CV to CHOOSE a model / settings. Keep a final, untouched test set for the ONE final score. Never tune on the test set — that leaks information.
Worked Example · Honest Model Comparison
12 min# honest_compare.py — CV instead of one lucky split from sklearn.datasets import load_wine from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression X, y = load_wine(return_X_y=True) models = { "KNN": KNeighborsClassifier(), "Decision Tree": DecisionTreeClassifier(random_state=0), "Random Forest": RandomForestClassifier(random_state=0), "Logistic Reg": LogisticRegression(max_iter=5000), } print(f"{'model':<16} {'mean':>6} {'std':>6}") for name, m in models.items(): s = cross_val_score(m, X, y, cv=5) print(f"{name:<16} {s.mean():>6.1%} {s.std():>6.1%}")
Sample output
model mean std KNN 74.7% 3.6% Decision Tree 88.8% 5.1% Random Forest 97.8% 2.0% Logistic Reg 94.9% 5.2%
Read the diff
Random Forest wins on both counts — highest mean AND lowest std (most stable). KNN's low score is a hint it needs feature scaling (Lesson 18). This table is honest in a way the single-split leaderboard from Lesson 9 was not.
Try It Yourself
13 minRun cross_val_score on iris with a decision tree. Print mean ± std.
Compare cv=3, cv=5, cv=10 for one model. Does the mean shift much? What about the std?
For KNN, use cross-validation (not a single split) to pick the best n_neighbors from 1-20. Plot mean accuracy vs k.
Hint
import matplotlib.pyplot as plt ks = range(1, 21) means = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X, y, cv=5).mean() for k in ks] plt.plot(ks, means, marker="o") plt.xlabel("k"); plt.ylabel("CV accuracy"); plt.show() print("best k:", ks[means.index(max(means))])
Mini-Challenge · Reliable Leaderboard
8 minRebuild the Lesson 9 leaderboard but with cross-validation. Sort by mean, and flag any model whose std is > 5% as "unstable".
Show one possible solution
def cv_leaderboard(X, y, models, cv=5): rows = [] for name, m in models.items(): s = cross_val_score(m, X, y, cv=cv) rows.append((name, s.mean(), s.std())) for name, mean, std in sorted(rows, key=lambda r: -r[1]): flag = " ⚠️ unstable" if std > 0.05 else "" print(f" {name:<16} {mean:.1%} ± {std:.1%}{flag}")
Non-negotiables: CV not single split, sorted by mean, instability flag on high-variance models.
Recap
3 minOne split is a coin toss; cross-validation runs k splits and averages for an honest estimate. Report mean ± std. Use stratified folds for classification. Tune with CV, but keep a final untouched test set for the one true score — never tune on it. Next: what accuracy actually measures (and when it lies).
Vocabulary Card
- cross-validation
- Splitting data into k folds, training/testing k times, averaging the scores.
- fold
- One of the k equal parts the data is divided into.
- stratify
- Keeping each split's class proportions equal to the whole dataset's.
- variance (of a model)
- How much its score changes across splits; high std = fragile.
Homework
4 minOn any dataset, produce a cross-validated leaderboard of 4 models. Write a short note: which model you'd ship and why — considering BOTH mean accuracy and stability (std).
Use cv_leaderboard from the mini-challenge. A good note picks the model with the best mean unless a close runner-up is much more stable.