PY-L5-12 · Your First Classifier — K-Nearest Neighbours

Learning Goals

3 min

Explain KNN in one sentence: classify by majority vote of nearest neighbours.
Understand why distance means features must be scaled.
Pick k and see how it changes the decision boundary.
Build a scale-then-KNN pipeline.

Warm-Up · Ask the Neighbours

5 min

New point: ?
Its 5 nearest neighbours: 🐱 🐱 🐶 🐱 🐶
Majority vote → 🐱 (3 cats vs 2 dogs)

That's the whole algorithm. "Tell me your friends and I'll tell you who you are." KNN doesn't really "train" — it just memorises the data and computes distances at prediction time.

Today's big idea

KNN classifies by proximity. Because it's all about distance, a feature measured in big units (salary in thousands) will dominate one in small units (age). You MUST scale features first, or the big-number feature wins by accident.

New Concept · KNN + Scaling

14 min

The plain version

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(Xtr, ytr)
print(knn.score(Xte, yte))

Why scaling matters

Two features:  age (20-60)   salary (20000-100000)
Distance is dominated by salary — age barely counts.
After scaling both to mean 0 / std 1, they contribute equally.

The right way — a Pipeline

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

clf = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=5),
)
clf.fit(Xtr, ytr)
print(clf.score(Xte, yte))

A Pipeline chains steps so the scaler is fit on training data only and applied consistently — this prevents data leakage from the test set.

Choosing k

small k (1)   very flexible, follows noise (overfits)
large k (50)  very smooth, may ignore real structure (underfits)
sweet spot    found by cross-validation

Odd k avoids ties

For two classes, use an odd k so the vote can't tie.

Worked Example · Scale Saves the Day

12 min

# knn_scaling.py — the dramatic difference scaling makes
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = load_wine(return_X_y=True)   # features on wildly different scales

raw   = KNeighborsClassifier(n_neighbors=5)
piped = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))

print("KNN, no scaling :", cross_val_score(raw,   X, y, cv=5).mean().round(3))
print("KNN, scaled     :", cross_val_score(piped, X, y, cv=5).mean().round(3))

Sample output

KNN, no scaling : 0.691
KNN, scaled     : 0.961

Read the diff

Same algorithm, same k, same data — scaling jumped accuracy from 69% to 96%. The wine features include "proline" (hundreds) and "hue" (around 1); without scaling, proline drowned everything else. This single habit fixes most "why is my KNN bad?" questions.

Try It Yourself

13 min

01 🟢 Pipeline KNN

Build a scale+KNN pipeline on iris. Print CV accuracy.

02 🟡 Best k via CV

Sweep k=1..20 inside the pipeline using cross-validation. Plot CV accuracy vs k; report the best.

03 🔴 Hand-rolled KNN

Implement knn_predict(X_train, y_train, x_new, k) from scratch with NumPy (distances → k smallest → majority vote). Check it matches sklearn on a few points.

Hint

import numpy as np
from collections import Counter

def knn_predict(Xtr, ytr, x, k=5):
    d = np.sqrt(((Xtr - x) ** 2).sum(axis=1))
    nearest = ytr[np.argsort(d)[:k]]
    return Counter(nearest).most_common(1)[0][0]

Mini-Challenge · Decision Boundary

8 min

Using two features of iris, plot the KNN decision boundary for k=1 and k=15 side by side. See how small k makes a jagged boundary (overfit) and large k makes a smooth one.

Show the structure

import numpy as np, matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

X = load_iris().data[:, :2]   # 2 features so we can plot
y = load_iris().target

xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200),
                     np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200))
grid = np.c_[xx.ravel(), yy.ravel()]

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
for ax, k in zip(axes, [1, 15]):
    m = KNeighborsClassifier(n_neighbors=k).fit(X, y)
    Z = m.predict(grid).reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3)
    ax.scatter(X[:,0], X[:,1], c=y, edgecolor="k", s=20)
    ax.set_title(f"k = {k}")
fig.savefig("boundary.png", dpi=130)

Non-negotiables: a mesh grid, predict over it, compare two k values. Small k = wiggly (overfit), large k = smooth (may underfit).

Recap

3 min

KNN classifies by majority vote of the k nearest neighbours. It relies on distance, so always scale features (a Pipeline with StandardScaler does it safely). Small k overfits, large k underfits — tune with CV, use odd k for two classes. Next: a model that learns explicit rules — the decision tree.

Vocabulary Card

KNN: K-Nearest Neighbours — predict from the majority class of the k closest training points.
StandardScaler: Scales each feature to mean 0, std 1 — essential for distance-based models.
Pipeline: Chains preprocessing + model so steps run in order and fit only on training data.
decision boundary: The line/surface where the predicted class changes.

Homework

4 min

On any dataset with features on different scales, show the before/after: KNN CV accuracy without scaling vs with a pipeline. Find the best k via CV. Write one sentence on what scaling did for you.

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier X, y = load_iris(return_X_y=True) Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) knn = KNeighborsClassifier(n_neighbors=5) knn.fit(Xtr, ytr) print(knn.score(Xte, yte))

from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler clf = make_pipeline( StandardScaler(), KNeighborsClassifier(n_neighbors=5), ) clf.fit(Xtr, ytr) print(clf.score(Xte, yte))

# knn_scaling.py — the dramatic difference scaling makes from sklearn.datasets import load_wine from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler X, y = load_wine(return_X_y=True) # features on wildly different scales raw = KNeighborsClassifier(n_neighbors=5) piped = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)) print("KNN, no scaling :", cross_val_score(raw, X, y, cv=5).mean().round(3)) print("KNN, scaled :", cross_val_score(piped, X, y, cv=5).mean().round(3))

import numpy as np from collections import Counter def knn_predict(Xtr, ytr, x, k=5): d = np.sqrt(((Xtr - x) ** 2).sum(axis=1)) nearest = ytr[np.argsort(d)[:k]] return Counter(nearest).most_common(1)[0][0]

import numpy as np, matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.neighbors import KNeighborsClassifier X = load_iris().data[:, :2] # 2 features so we can plot y = load_iris().target xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200), np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200)) grid = np.c_[xx.ravel(), yy.ravel()] fig, axes = plt.subplots(1, 2, figsize=(11, 4)) for ax, k in zip(axes, [1, 15]): m = KNeighborsClassifier(n_neighbors=k).fit(X, y) Z = m.predict(grid).reshape(xx.shape) ax.contourf(xx, yy, Z, alpha=0.3) ax.scatter(X[:,0], X[:,1], c=y, edgecolor="k", s=20) ax.set_title(f"k = {k}") fig.savefig("boundary.png", dpi=130)