PY-L6-38 · Testing AI / ML Code

Learning Goals

3 min

Understand why exact-equality tests fail for ML.
Test data prep, shapes, and ranges deterministically.
Use seeds + metric thresholds instead of exact predictions.
Test the pipeline around the model, not the model's "truth".

Warm-Up · "Correct" Is Fuzzy

5 min

# you CAN'T do this — ML isn't exact, and retraining changes weights:
assert model.predict(X)[0] == "cat"      # might be "dog" tomorrow

# you CAN test these:
assert preds.shape == (100,)              # right shape
assert set(preds) <= {0, 1, 2}            # only valid classes
assert accuracy >= 0.85                   # meets a quality BAR

Today's big idea

Don't test that a model gives a specific answer — test the things that MUST be true: data is prepped correctly, outputs have the right shape and range, and quality clears a threshold. Most ML bugs are actually data/pipeline bugs, which ARE testable like normal code.

New Concept · What You CAN Test

14 min

1. Test the data prep (deterministic, normal code)

def test_prep_shapes():
    X, y = prepare(df, label="churn")
    assert X.shape[0] == y.shape[0]          # same number of rows
    assert not X.isna().any().any()          # no missing values left
    assert set(y.unique()) <= {0, 1}         # label encoded correctly

2. Test output shape & range

def test_predict_output():
    preds = model.predict(X_test)
    assert preds.shape == (len(X_test),)     # one prediction per row
    assert set(preds) <= {0, 1, 2}           # only valid classes
    probs = model.predict_proba(X_test)
    assert (probs.sum(axis=1) == pytest.approx(1.0)).all()  # probs sum to 1

3. Test reproducibility with a seed

def test_deterministic_with_seed():
    m1 = RandomForestClassifier(random_state=42).fit(X, y)
    m2 = RandomForestClassifier(random_state=42).fit(X, y)
    assert (m1.predict(X) == m2.predict(X)).all()   # same seed → same model

4. Test a quality THRESHOLD, not an exact score

def test_model_meets_bar():
    acc = cross_val_score(model, X, y, cv=5).mean()
    assert acc >= 0.80, f"accuracy {acc:.2f} below the 0.80 bar"
    # a "smoke test" that the model hasn't badly regressed

5. Test invariants / metamorphic properties

"shuffling the input rows shouldn't change predictions"
"scaling all features by a constant shouldn't change a tree's output"
"a known obvious case should be classified correctly"
  e.g. a clearly-spam email → predicted spam

def test_obvious_case():
    assert classify("FREE MONEY CLICK NOW WIN PRIZE") == "spam"

Worked Example · Testing a Classifier Pipeline

12 min

# test_ml.py — pin down everything except the model's "opinions"
import pytest, numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

@pytest.fixture(scope="module")
def data():
    return load_iris(return_X_y=True)

@pytest.fixture(scope="module")
def model(data):
    X, y = data
    return RandomForestClassifier(random_state=42).fit(X, y)

def test_prediction_shape(model, data):
    X, y = data
    preds = model.predict(X)
    assert preds.shape == (len(X),)

def test_only_valid_classes(model, data):
    X, y = data
    assert set(model.predict(X)) <= {0, 1, 2}

def test_probabilities_sum_to_one(model, data):
    X, y = data
    probs = model.predict_proba(X)
    assert probs.sum(axis=1) == pytest.approx(np.ones(len(X)))

def test_reproducible(data):
    X, y = data
    a = RandomForestClassifier(random_state=42).fit(X, y).predict(X)
    b = RandomForestClassifier(random_state=42).fit(X, y).predict(X)
    assert (a == b).all()

def test_quality_bar(data):
    X, y = data
    acc = cross_val_score(RandomForestClassifier(random_state=42),
                          X, y, cv=5).mean()
    assert acc >= 0.90, f"iris should be easy; got {acc:.2f}"

$ pytest test_ml.py -v
test_prediction_shape PASSED
test_only_valid_classes PASSED
test_probabilities_sum_to_one PASSED
test_reproducible PASSED
test_quality_bar PASSED
5 passed

Read the diff

Not one test asserts "this flower is species 1". Instead: shape, valid classes, probabilities summing to 1, reproducibility under a seed, and a quality floor. These tests catch the real ML bugs — wrong output shape, leaked labels, a model that silently regressed below the bar — without being brittle about individual predictions.

Try It Yourself

13 min

01 🟢 Shape + range tests

For a model you trained in Level 5, test the prediction shape and that outputs are valid classes only.

02 🟡 Quality bar

Add a threshold test: CV accuracy ≥ some bar. Lower the bar slightly below your real score so it passes but would fail on a big regression.

03 🔴 Metamorphic test

Write a test asserting an obvious case is classified correctly (a clearly-positive review → positive), and that shuffling input rows doesn't change the set of predictions.

Mini-Challenge · Catch a Data Leak with a Test

8 min

Write a test that flags suspiciously-perfect accuracy (≥ 0.99 on a hard problem) as a likely data-leak red flag, OR a test that asserts the label column is NOT present in the feature matrix. Catch the Level-5 leakage bug with a test.

Show one possible solution

def test_no_label_in_features():
    X, y = prepare(df, label="target")
    assert "target" not in X.columns        # leakage guard

def test_accuracy_not_suspiciously_perfect():
    acc = cross_val_score(model, X, y, cv=5).mean()
    # on a genuinely hard problem, ~1.0 usually means leakage
    assert acc < 0.999, "near-perfect accuracy — check for data leakage!"

Non-negotiables: a structural leak guard (label not in X) and/or a "too good to be true" sanity check. These catch the most damaging ML bug — leakage — automatically.

Recap

3 min

Don't test exact predictions — test what must hold: prep correctness (shapes, no NaN, encoded labels), output shape/range, reproducibility under a seed, a quality threshold, and metamorphic properties / obvious cases. Add leakage guards. Most ML bugs are data/pipeline bugs, which test like normal code. Next: end-to-end testing through a real browser.

Vocabulary Card

non-determinism: ML outputs vary with random seeds/retraining — so exact-equality tests are brittle.
seed: Fixing randomness (random_state) to make training reproducible in tests.
threshold test: Asserting a metric clears a bar rather than equals a value.
metamorphic test: Asserting a property that must hold (shuffling rows doesn't change results).

Homework

4 min

Take a Level-5 ML model. Write a test suite covering: prep shapes + no-NaN, output shape + valid classes, reproducibility (seed), a quality threshold, an obvious-case metamorphic test, and a leakage guard. No test should assert an individual prediction.

# you CAN'T do this — ML isn't exact, and retraining changes weights: assert model.predict(X)[0] == "cat" # might be "dog" tomorrow # you CAN test these: assert preds.shape == (100,) # right shape assert set(preds) <= {0, 1, 2} # only valid classes assert accuracy >= 0.85 # meets a quality BAR

def test_prep_shapes(): X, y = prepare(df, label="churn") assert X.shape[0] == y.shape[0] # same number of rows assert not X.isna().any().any() # no missing values left assert set(y.unique()) <= {0, 1} # label encoded correctly

def test_predict_output(): preds = model.predict(X_test) assert preds.shape == (len(X_test),) # one prediction per row assert set(preds) <= {0, 1, 2} # only valid classes probs = model.predict_proba(X_test) assert (probs.sum(axis=1) == pytest.approx(1.0)).all() # probs sum to 1

def test_deterministic_with_seed(): m1 = RandomForestClassifier(random_state=42).fit(X, y) m2 = RandomForestClassifier(random_state=42).fit(X, y) assert (m1.predict(X) == m2.predict(X)).all() # same seed → same model

def test_model_meets_bar(): acc = cross_val_score(model, X, y, cv=5).mean() assert acc >= 0.80, f"accuracy {acc:.2f} below the 0.80 bar" # a "smoke test" that the model hasn't badly regressed

"shuffling the input rows shouldn't change predictions" "scaling all features by a constant shouldn't change a tree's output" "a known obvious case should be classified correctly" e.g. a clearly-spam email → predicted spam

# test_ml.py — pin down everything except the model's "opinions" import pytest, numpy as np from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score @pytest.fixture(scope="module") def data(): return load_iris(return_X_y=True) @pytest.fixture(scope="module") def model(data): X, y = data return RandomForestClassifier(random_state=42).fit(X, y) def test_prediction_shape(model, data): X, y = data preds = model.predict(X) assert preds.shape == (len(X),) def test_only_valid_classes(model, data): X, y = data assert set(model.predict(X)) <= {0, 1, 2} def test_probabilities_sum_to_one(model, data): X, y = data probs = model.predict_proba(X) assert probs.sum(axis=1) == pytest.approx(np.ones(len(X))) def test_reproducible(data): X, y = data a = RandomForestClassifier(random_state=42).fit(X, y).predict(X) b = RandomForestClassifier(random_state=42).fit(X, y).predict(X) assert (a == b).all() def test_quality_bar(data): X, y = data acc = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=5).mean() assert acc >= 0.90, f"iris should be easy; got {acc:.2f}"

def test_no_label_in_features(): X, y = prepare(df, label="target") assert "target" not in X.columns # leakage guard def test_accuracy_not_suspiciously_perfect(): acc = cross_val_score(model, X, y, cv=5).mean() # on a genuinely hard problem, ~1.0 usually means leakage assert acc < 0.999, "near-perfect accuracy — check for data leakage!"