Learning Goals
3 min- Understand why exact-equality tests fail for ML.
- Test data prep, shapes, and ranges deterministically.
- Use seeds + metric thresholds instead of exact predictions.
- Test the pipeline around the model, not the model's "truth".
Warm-Up · "Correct" Is Fuzzy
5 min# you CAN'T do this — ML isn't exact, and retraining changes weights: assert model.predict(X)[0] == "cat" # might be "dog" tomorrow # you CAN test these: assert preds.shape == (100,) # right shape assert set(preds) <= {0, 1, 2} # only valid classes assert accuracy >= 0.85 # meets a quality BAR
Don't test that a model gives a specific answer — test the things that MUST be true: data is prepped correctly, outputs have the right shape and range, and quality clears a threshold. Most ML bugs are actually data/pipeline bugs, which ARE testable like normal code.
New Concept · What You CAN Test
14 min1. Test the data prep (deterministic, normal code)
def test_prep_shapes(): X, y = prepare(df, label="churn") assert X.shape[0] == y.shape[0] # same number of rows assert not X.isna().any().any() # no missing values left assert set(y.unique()) <= {0, 1} # label encoded correctly
2. Test output shape & range
def test_predict_output(): preds = model.predict(X_test) assert preds.shape == (len(X_test),) # one prediction per row assert set(preds) <= {0, 1, 2} # only valid classes probs = model.predict_proba(X_test) assert (probs.sum(axis=1) == pytest.approx(1.0)).all() # probs sum to 1
3. Test reproducibility with a seed
def test_deterministic_with_seed(): m1 = RandomForestClassifier(random_state=42).fit(X, y) m2 = RandomForestClassifier(random_state=42).fit(X, y) assert (m1.predict(X) == m2.predict(X)).all() # same seed → same model
4. Test a quality THRESHOLD, not an exact score
def test_model_meets_bar(): acc = cross_val_score(model, X, y, cv=5).mean() assert acc >= 0.80, f"accuracy {acc:.2f} below the 0.80 bar" # a "smoke test" that the model hasn't badly regressed
5. Test invariants / metamorphic properties
"shuffling the input rows shouldn't change predictions" "scaling all features by a constant shouldn't change a tree's output" "a known obvious case should be classified correctly" e.g. a clearly-spam email → predicted spam
def test_obvious_case(): assert classify("FREE MONEY CLICK NOW WIN PRIZE") == "spam"
Worked Example · Testing a Classifier Pipeline
12 min# test_ml.py — pin down everything except the model's "opinions" import pytest, numpy as np from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score @pytest.fixture(scope="module") def data(): return load_iris(return_X_y=True) @pytest.fixture(scope="module") def model(data): X, y = data return RandomForestClassifier(random_state=42).fit(X, y) def test_prediction_shape(model, data): X, y = data preds = model.predict(X) assert preds.shape == (len(X),) def test_only_valid_classes(model, data): X, y = data assert set(model.predict(X)) <= {0, 1, 2} def test_probabilities_sum_to_one(model, data): X, y = data probs = model.predict_proba(X) assert probs.sum(axis=1) == pytest.approx(np.ones(len(X))) def test_reproducible(data): X, y = data a = RandomForestClassifier(random_state=42).fit(X, y).predict(X) b = RandomForestClassifier(random_state=42).fit(X, y).predict(X) assert (a == b).all() def test_quality_bar(data): X, y = data acc = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=5).mean() assert acc >= 0.90, f"iris should be easy; got {acc:.2f}"
$ pytest test_ml.py -v test_prediction_shape PASSED test_only_valid_classes PASSED test_probabilities_sum_to_one PASSED test_reproducible PASSED test_quality_bar PASSED 5 passed
Read the diff
Not one test asserts "this flower is species 1". Instead: shape, valid classes, probabilities summing to 1, reproducibility under a seed, and a quality floor. These tests catch the real ML bugs — wrong output shape, leaked labels, a model that silently regressed below the bar — without being brittle about individual predictions.
Try It Yourself
13 minFor a model you trained in Level 5, test the prediction shape and that outputs are valid classes only.
Add a threshold test: CV accuracy ≥ some bar. Lower the bar slightly below your real score so it passes but would fail on a big regression.
Write a test asserting an obvious case is classified correctly (a clearly-positive review → positive), and that shuffling input rows doesn't change the set of predictions.
Mini-Challenge · Catch a Data Leak with a Test
8 minWrite a test that flags suspiciously-perfect accuracy (≥ 0.99 on a hard problem) as a likely data-leak red flag, OR a test that asserts the label column is NOT present in the feature matrix. Catch the Level-5 leakage bug with a test.
Show one possible solution
def test_no_label_in_features(): X, y = prepare(df, label="target") assert "target" not in X.columns # leakage guard def test_accuracy_not_suspiciously_perfect(): acc = cross_val_score(model, X, y, cv=5).mean() # on a genuinely hard problem, ~1.0 usually means leakage assert acc < 0.999, "near-perfect accuracy — check for data leakage!"
Non-negotiables: a structural leak guard (label not in X) and/or a "too good to be true" sanity check. These catch the most damaging ML bug — leakage — automatically.
Recap
3 minDon't test exact predictions — test what must hold: prep correctness (shapes, no NaN, encoded labels), output shape/range, reproducibility under a seed, a quality threshold, and metamorphic properties / obvious cases. Add leakage guards. Most ML bugs are data/pipeline bugs, which test like normal code. Next: end-to-end testing through a real browser.
Vocabulary Card
- non-determinism
- ML outputs vary with random seeds/retraining — so exact-equality tests are brittle.
- seed
- Fixing randomness (
random_state) to make training reproducible in tests. - threshold test
- Asserting a metric clears a bar rather than equals a value.
- metamorphic test
- Asserting a property that must hold (shuffling rows doesn't change results).
Homework
4 minTake a Level-5 ML model. Write a test suite covering: prep shapes + no-NaN, output shape + valid classes, reproducibility (seed), a quality threshold, an obvious-case metamorphic test, and a leakage guard. No test should assert an individual prediction.
Model on test_ml.py + the leakage guards. The discipline: every test pins something that MUST be true, never a specific model opinion.