PY-L5-09 · ML 101 — Train, Test, Predict

Learning Goals

3 min

Install scikit-learn; load a built-in dataset.
Follow the universal recipe: model.fit(X, y) → model.predict(X_new) → model.score.
Understand why we split data into train and test sets.
Swap one model for another by changing a single line.

Warm-Up · The Recipe

5 min

pip install scikit-learn

Every supervised model in scikit-learn:

  model = SomeModel()        # 1. choose
  model.fit(X_train, y_train) # 2. learn from training data
  preds = model.predict(X_test) # 3. predict on new data
  model.score(X_test, y_test)  # 4. how good?

Today's big idea

scikit-learn gives every algorithm the same interface. Learn fit / predict / score once and KNN, trees, forests, regression all work identically. The hard part is the data, not the API.

New Concept · fit, predict, score

14 min

Load data

from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
X = iris.data       # 150 × 4 features
y = iris.target     # 0,1,2 — three species
print(X.shape, y.shape)

Split into train and test

Never test on data the model trained on — it would just recite memorised answers. Hold some back:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
# 80% to learn from, 20% kept secret for the exam

fit — learn the pattern

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

predict — apply it to new data

preds = model.predict(X_test)
print(preds[:5])         # e.g. [1 0 2 1 1]
print(y_test[:5].values) # the true answers

score — measure accuracy

acc = model.score(X_test, y_test)
print(f"accuracy: {acc:.1%}")    # e.g. 96.7%

For classifiers, score returns accuracy (fraction correct). For regressors, it returns R² (Lesson 15). Same method, sensible default per task.

Swapping models — one line

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()    # only this line changed
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Worked Example · Compare Three Models

12 min

# compare.py — same data, three models, one loop
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

models = {
    "KNN (k=3)":      KNeighborsClassifier(n_neighbors=3),
    "Decision Tree":  DecisionTreeClassifier(random_state=42),
    "Logistic Reg":   LogisticRegression(max_iter=200),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    print(f"  {name:<16} {acc:.1%}")

Sample output

  KNN (k=3)        96.7%
  Decision Tree    100.0%
  Logistic Reg     100.0%

Read the diff

Three completely different algorithms, identical code — only the constructor changed. That's the power of scikit-learn's uniform interface. (Don't over-read the 100%: iris is famously easy, and the test set is tiny. Lesson 10 shows how to measure more honestly.)

Try It Yourself

13 min

01 🟢 Your first model

Load iris, split, fit a KNN, print the test accuracy.

02 🟡 Predict a new flower

After fitting, predict the species of a flower with measurements [5.1, 3.5, 1.4, 0.2]. Map the number back to a species name.

Hint

import numpy as np
names = load_iris().target_names
p = model.predict(np.array([[5.1, 3.5, 1.4, 0.2]]))
print(names[p[0]])    # 'setosa'

03 🔴 The k sweep

Loop n_neighbors from 1 to 20, record test accuracy for each, and report the best k.

Hint

best_k, best_acc = None, 0
for k in range(1, 21):
    m = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train)
    acc = m.score(X_test, y_test)
    if acc > best_acc:
        best_k, best_acc = k, acc
print(f"best k = {best_k} ({best_acc:.1%})")

Mini-Challenge · The Model Leaderboard

8 min

Build a function leaderboard(X, y, models) that splits the data, fits each model, and prints a sorted accuracy table. Test it on the wine dataset (load_wine) with four different models.

Show one possible solution

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

def leaderboard(X, y, models):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=0)
    rows = []
    for name, m in models.items():
        m.fit(Xtr, ytr)
        rows.append((name, m.score(Xte, yte)))
    for name, acc in sorted(rows, key=lambda r: -r[1]):
        print(f"  {name:<18} {acc:.1%}")

X, y = load_wine(return_X_y=True)
leaderboard(X, y, {
    "KNN":            KNeighborsClassifier(),
    "Decision Tree":  DecisionTreeClassifier(random_state=0),
    "Random Forest":  RandomForestClassifier(random_state=0),
    "Logistic Reg":   LogisticRegression(max_iter=5000),
})

Non-negotiables: one split shared across all models, sorted output. Notice Random Forest usually tops the board — we'll see why in Lesson 14.

Recap

3 min

The universal recipe: fit(X_train, y_train), predict(X_test), score(X_test, y_test). Always split train/test so you measure generalisation, not memorisation. Swapping algorithms is a one-line change. Tomorrow we make the split smarter with cross-validation.

Vocabulary Card

fit: Train the model — learn parameters from training data.
predict: Apply the trained model to new inputs.
score: Default metric — accuracy for classifiers, R² for regressors.
train/test split: Holding back data the model never saw, to measure real performance.

Homework

4 min

Use your own prepped dataset (or load_breast_cancer). Run the leaderboard with at least three models. Write down: which model won, the accuracy, and one reason the others might have lost.

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
leaderboard(X, y, {
    "KNN":           KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(random_state=0),
    "Logistic Reg":  LogisticRegression(max_iter=5000),
})
# Typically Logistic Reg / RF ~96%, KNN lower because it's
# sensitive to unscaled features (fix that in Lesson 18).

Every supervised model in scikit-learn: model = SomeModel() # 1. choose model.fit(X_train, y_train) # 2. learn from training data preds = model.predict(X_test) # 3. predict on new data model.score(X_test, y_test) # 4. how good?

# compare.py — same data, three models, one loop from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) models = { "KNN (k=3)": KNeighborsClassifier(n_neighbors=3), "Decision Tree": DecisionTreeClassifier(random_state=42), "Logistic Reg": LogisticRegression(max_iter=200), } for name, model in models.items(): model.fit(X_train, y_train) acc = model.score(X_test, y_test) print(f" {name:<16} {acc:.1%}")

best_k, best_acc = None, 0 for k in range(1, 21): m = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train) acc = m.score(X_test, y_test) if acc > best_acc: best_k, best_acc = k, acc print(f"best k = {best_k} ({best_acc:.1%})")

from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression def leaderboard(X, y, models): Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=0) rows = [] for name, m in models.items(): m.fit(Xtr, ytr) rows.append((name, m.score(Xte, yte))) for name, acc in sorted(rows, key=lambda r: -r[1]): print(f" {name:<18} {acc:.1%}") X, y = load_wine(return_X_y=True) leaderboard(X, y, { "KNN": KNeighborsClassifier(), "Decision Tree": DecisionTreeClassifier(random_state=0), "Random Forest": RandomForestClassifier(random_state=0), "Logistic Reg": LogisticRegression(max_iter=5000), })