PY-L5-10 · Train-Test Split & Cross-Validation

Learning Goals

3 min

Explain why a single split can give a misleading score.
Run k-fold cross-validation with cross_val_score.
Report mean ± std accuracy, not a single number.
Use stratify to keep class balance in splits.

Warm-Up · The Lucky Split

5 min

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = load_wine(return_X_y=True)
for seed in [0, 1, 2, 3, 4]:
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=seed)
    acc = KNeighborsClassifier().fit(Xtr, ytr).score(Xte, yte)
    print(f"seed {seed}: {acc:.1%}")

seed 0: 75.6%
seed 1: 68.9%
seed 2: 82.2%
seed 3: 71.1%
seed 4: 77.8%

Same model, same data — accuracy swings 14 points just by changing which rows landed in the test set. Reporting any single one would be dishonest.

Today's big idea

One split is one sample of one random experiment. Cross-validation runs the experiment k times on different splits and reports the average — a far more trustworthy estimate of real-world accuracy.

New Concept · k-Fold Cross-Validation

14 min

The idea

5-fold CV: split data into 5 equal parts.
Run 5 times — each time, 1 part is the test, 4 are training.
Every row is tested exactly once. Average the 5 scores.

  fold 1:  [TEST][ train  train  train  train ]
  fold 2:  [train][TEST][ train  train  train ]
  fold 3:  [train  train][TEST][ train  train ]
  fold 4:  [train  train  train][TEST][ train ]
  fold 5:  [train  train  train  train][TEST]

cross_val_score

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)
scores = cross_val_score(KNeighborsClassifier(), X, y, cv=5)

print(scores.round(3))               # the 5 fold scores
print(f"{scores.mean():.1%} ± {scores.std():.1%}")

[0.694 0.75  0.806 0.743 0.743]
74.7% ± 3.6%

Report the mean ± std. The std tells you how stable the model is — a big std means the model is fragile to the data split.

Stratified splits — keep class balance

For classification, scikit-learn uses stratified folds by default — each fold has the same class proportions as the whole. To stratify a single split:

from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=0)

Without stratify, a rare class might land entirely in train OR test by bad luck.

The golden rule

Use CV to CHOOSE a model / settings.
Keep a final, untouched test set for the ONE final score.
Never tune on the test set — that leaks information.

Worked Example · Honest Model Comparison

12 min

# honest_compare.py — CV instead of one lucky split
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X, y = load_wine(return_X_y=True)

models = {
    "KNN":           KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(random_state=0),
    "Random Forest": RandomForestClassifier(random_state=0),
    "Logistic Reg":  LogisticRegression(max_iter=5000),
}

print(f"{'model':<16} {'mean':>6} {'std':>6}")
for name, m in models.items():
    s = cross_val_score(m, X, y, cv=5)
    print(f"{name:<16} {s.mean():>6.1%} {s.std():>6.1%}")

Sample output

model              mean    std
KNN               74.7%   3.6%
Decision Tree     88.8%   5.1%
Random Forest     97.8%   2.0%
Logistic Reg      94.9%   5.2%

Read the diff

Random Forest wins on both counts — highest mean AND lowest std (most stable). KNN's low score is a hint it needs feature scaling (Lesson 18). This table is honest in a way the single-split leaderboard from Lesson 9 was not.

Try It Yourself

13 min

01 🟢 5-fold score

Run cross_val_score on iris with a decision tree. Print mean ± std.

02 🟡 Does k matter?

Compare cv=3, cv=5, cv=10 for one model. Does the mean shift much? What about the std?

03 🔴 Tune k with CV

For KNN, use cross-validation (not a single split) to pick the best n_neighbors from 1-20. Plot mean accuracy vs k.

Hint

import matplotlib.pyplot as plt
ks = range(1, 21)
means = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X, y, cv=5).mean()
         for k in ks]
plt.plot(ks, means, marker="o")
plt.xlabel("k"); plt.ylabel("CV accuracy"); plt.show()
print("best k:", ks[means.index(max(means))])

Mini-Challenge · Reliable Leaderboard

8 min

Rebuild the Lesson 9 leaderboard but with cross-validation. Sort by mean, and flag any model whose std is > 5% as "unstable".

Show one possible solution

def cv_leaderboard(X, y, models, cv=5):
    rows = []
    for name, m in models.items():
        s = cross_val_score(m, X, y, cv=cv)
        rows.append((name, s.mean(), s.std()))
    for name, mean, std in sorted(rows, key=lambda r: -r[1]):
        flag = "  ⚠️ unstable" if std > 0.05 else ""
        print(f"  {name:<16} {mean:.1%} ± {std:.1%}{flag}")

Non-negotiables: CV not single split, sorted by mean, instability flag on high-variance models.

Recap

3 min

One split is a coin toss; cross-validation runs k splits and averages for an honest estimate. Report mean ± std. Use stratified folds for classification. Tune with CV, but keep a final untouched test set for the one true score — never tune on it. Next: what accuracy actually measures (and when it lies).

Vocabulary Card

cross-validation: Splitting data into k folds, training/testing k times, averaging the scores.
fold: One of the k equal parts the data is divided into.
stratify: Keeping each split's class proportions equal to the whole dataset's.
variance (of a model): How much its score changes across splits; high std = fragile.

Homework

4 min

On any dataset, produce a cross-validated leaderboard of 4 models. Write a short note: which model you'd ship and why — considering BOTH mean accuracy and stability (std).

from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier X, y = load_wine(return_X_y=True) for seed in [0, 1, 2, 3, 4]: Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=seed) acc = KNeighborsClassifier().fit(Xtr, ytr).score(Xte, yte) print(f"seed {seed}: {acc:.1%}")

5-fold CV: split data into 5 equal parts. Run 5 times — each time, 1 part is the test, 4 are training. Every row is tested exactly once. Average the 5 scores. fold 1: [TEST][ train train train train ] fold 2: [train][TEST][ train train train ] fold 3: [train train][TEST][ train train ] fold 4: [train train train][TEST][ train ] fold 5: [train train train train][TEST]

from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import load_wine X, y = load_wine(return_X_y=True) scores = cross_val_score(KNeighborsClassifier(), X, y, cv=5) print(scores.round(3)) # the 5 fold scores print(f"{scores.mean():.1%} ± {scores.std():.1%}")

# honest_compare.py — CV instead of one lucky split from sklearn.datasets import load_wine from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression X, y = load_wine(return_X_y=True) models = { "KNN": KNeighborsClassifier(), "Decision Tree": DecisionTreeClassifier(random_state=0), "Random Forest": RandomForestClassifier(random_state=0), "Logistic Reg": LogisticRegression(max_iter=5000), } print(f"{'model':<16} {'mean':>6} {'std':>6}") for name, m in models.items(): s = cross_val_score(m, X, y, cv=5) print(f"{name:<16} {s.mean():>6.1%} {s.std():>6.1%}")

import matplotlib.pyplot as plt ks = range(1, 21) means = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X, y, cv=5).mean() for k in ks] plt.plot(ks, means, marker="o") plt.xlabel("k"); plt.ylabel("CV accuracy"); plt.show() print("best k:", ks[means.index(max(means))])

def cv_leaderboard(X, y, models, cv=5): rows = [] for name, m in models.items(): s = cross_val_score(m, X, y, cv=cv) rows.append((name, s.mean(), s.std())) for name, mean, std in sorted(rows, key=lambda r: -r[1]): flag = " ⚠️ unstable" if std > 0.05 else "" print(f" {name:<16} {mean:.1%} ± {std:.1%}{flag}")