Learning Goals
3 min- Explain KNN in one sentence: classify by majority vote of nearest neighbours.
- Understand why distance means features must be scaled.
- Pick
kand see how it changes the decision boundary. - Build a scale-then-KNN pipeline.
Warm-Up · Ask the Neighbours
5 minNew point: ? Its 5 nearest neighbours: 🐱 🐱 🐶 🐱 🐶 Majority vote → 🐱 (3 cats vs 2 dogs)
That's the whole algorithm. "Tell me your friends and I'll tell you who you are." KNN doesn't really "train" — it just memorises the data and computes distances at prediction time.
KNN classifies by proximity. Because it's all about distance, a feature measured in big units (salary in thousands) will dominate one in small units (age). You MUST scale features first, or the big-number feature wins by accident.
New Concept · KNN + Scaling
14 minThe plain version
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier X, y = load_iris(return_X_y=True) Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) knn = KNeighborsClassifier(n_neighbors=5) knn.fit(Xtr, ytr) print(knn.score(Xte, yte))
Why scaling matters
Two features: age (20-60) salary (20000-100000) Distance is dominated by salary — age barely counts. After scaling both to mean 0 / std 1, they contribute equally.
The right way — a Pipeline
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler clf = make_pipeline( StandardScaler(), KNeighborsClassifier(n_neighbors=5), ) clf.fit(Xtr, ytr) print(clf.score(Xte, yte))
A Pipeline chains steps so the scaler is fit on training data only and applied consistently — this prevents data leakage from the test set.
Choosing k
small k (1) very flexible, follows noise (overfits) large k (50) very smooth, may ignore real structure (underfits) sweet spot found by cross-validation
Odd k avoids ties
For two classes, use an odd k so the vote can't tie.
Worked Example · Scale Saves the Day
12 min# knn_scaling.py — the dramatic difference scaling makes from sklearn.datasets import load_wine from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler X, y = load_wine(return_X_y=True) # features on wildly different scales raw = KNeighborsClassifier(n_neighbors=5) piped = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)) print("KNN, no scaling :", cross_val_score(raw, X, y, cv=5).mean().round(3)) print("KNN, scaled :", cross_val_score(piped, X, y, cv=5).mean().round(3))
Sample output
KNN, no scaling : 0.691 KNN, scaled : 0.961
Read the diff
Same algorithm, same k, same data — scaling jumped accuracy from 69% to 96%. The wine features include "proline" (hundreds) and "hue" (around 1); without scaling, proline drowned everything else. This single habit fixes most "why is my KNN bad?" questions.
Try It Yourself
13 minBuild a scale+KNN pipeline on iris. Print CV accuracy.
Sweep k=1..20 inside the pipeline using cross-validation. Plot CV accuracy vs k; report the best.
Implement knn_predict(X_train, y_train, x_new, k) from scratch with NumPy (distances → k smallest → majority vote). Check it matches sklearn on a few points.
Hint
import numpy as np from collections import Counter def knn_predict(Xtr, ytr, x, k=5): d = np.sqrt(((Xtr - x) ** 2).sum(axis=1)) nearest = ytr[np.argsort(d)[:k]] return Counter(nearest).most_common(1)[0][0]
Mini-Challenge · Decision Boundary
8 minUsing two features of iris, plot the KNN decision boundary for k=1 and k=15 side by side. See how small k makes a jagged boundary (overfit) and large k makes a smooth one.
Show the structure
import numpy as np, matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.neighbors import KNeighborsClassifier X = load_iris().data[:, :2] # 2 features so we can plot y = load_iris().target xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200), np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200)) grid = np.c_[xx.ravel(), yy.ravel()] fig, axes = plt.subplots(1, 2, figsize=(11, 4)) for ax, k in zip(axes, [1, 15]): m = KNeighborsClassifier(n_neighbors=k).fit(X, y) Z = m.predict(grid).reshape(xx.shape) ax.contourf(xx, yy, Z, alpha=0.3) ax.scatter(X[:,0], X[:,1], c=y, edgecolor="k", s=20) ax.set_title(f"k = {k}") fig.savefig("boundary.png", dpi=130)
Non-negotiables: a mesh grid, predict over it, compare two k values. Small k = wiggly (overfit), large k = smooth (may underfit).
Recap
3 minKNN classifies by majority vote of the k nearest neighbours. It relies on distance, so always scale features (a Pipeline with StandardScaler does it safely). Small k overfits, large k underfits — tune with CV, use odd k for two classes. Next: a model that learns explicit rules — the decision tree.
Vocabulary Card
- KNN
- K-Nearest Neighbours — predict from the majority class of the k closest training points.
- StandardScaler
- Scales each feature to mean 0, std 1 — essential for distance-based models.
- Pipeline
- Chains preprocessing + model so steps run in order and fit only on training data.
- decision boundary
- The line/surface where the predicted class changes.
Homework
4 minOn any dataset with features on different scales, show the before/after: KNN CV accuracy without scaling vs with a pipeline. Find the best k via CV. Write one sentence on what scaling did for you.
Reuse knn_scaling.py and add the k-sweep from Try-It #2. The sentence should note the accuracy gain scaling produced.