Learning Goals
3 min- Explain how K-Means iterates: assign → move centres → repeat.
- Run
KMeans, read cluster labels and centres. - Pick k with the elbow method and silhouette score.
- Remember: scale features, and there's no "right" answer to grade against.
Warm-Up · Group the Dots
5 minK-Means with k=3: 1. Drop 3 random "centres" onto the data. 2. Assign each point to its nearest centre. 3. Move each centre to the average of its points. 4. Repeat 2-3 until centres stop moving.
No labels needed. The algorithm finds structure purely from how points cluster in space. Used for customer segmentation, image colour reduction, document grouping.
Unsupervised learning finds patterns without answers. You can't measure "accuracy" — there are no true labels. Instead you judge clusters by how tight and well-separated they are (inertia, silhouette) and by whether they make business sense.
New Concept · KMeans & Choosing k
14 minRun it
from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans X = StandardScaler().fit_transform(load_iris().data) # scale first! km = KMeans(n_clusters=3, n_init=10, random_state=0) labels = km.fit_predict(X) print(labels[:10]) # cluster id per point: e.g. [1 1 1 ...] print(km.cluster_centers_.shape) # (3, 4) — 3 centres in 4-D
n_init=10 runs the random-start algorithm 10 times and keeps the best — avoids unlucky starts.
The elbow method — picking k
import matplotlib.pyplot as plt inertias = [] ks = range(1, 9) for k in ks: km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X) inertias.append(km.inertia_) # total within-cluster distance plt.plot(ks, inertias, marker="o") plt.xlabel("k"); plt.ylabel("inertia"); plt.title("Elbow method") plt.savefig("elbow.png", dpi=150)
Inertia always falls as k rises. Look for the "elbow" — where adding clusters stops helping much. That bend is a good k.
Silhouette score — a number for cluster quality
from sklearn.metrics import silhouette_score for k in [2, 3, 4, 5]: lbl = KMeans(n_clusters=k, n_init=10, random_state=0).fit_predict(X) print(k, round(silhouette_score(X, lbl), 3))
Silhouette ranges -1 to 1; higher = tighter, better-separated clusters. Pick the k with the highest silhouette.
Reading clusters
Clusters are just IDs (0, 1, 2) — they have no inherent meaning. You interpret them: "cluster 2 is high-spend, low-frequency customers." That interpretation is where the value is.
Worked Example · Segment & Profile
12 min# segment.py — cluster customers and profile each group import numpy as np, pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans # fake customer data: annual spend, visits per month rng = np.random.default_rng(0) df = pd.DataFrame({ "spend": np.r_[rng.normal(200, 40, 60), rng.normal(800, 80, 60), rng.normal(450, 50, 60)], "visits": np.r_[rng.normal(2, 0.6, 60), rng.normal(3, 0.8, 60), rng.normal(12, 2, 60)], }) Xs = StandardScaler().fit_transform(df) df["cluster"] = KMeans(n_clusters=3, n_init=10, random_state=0).fit_predict(Xs) # profile: average behaviour per cluster profile = df.groupby("cluster").agg( n=("spend", "size"), avg_spend=("spend", "mean"), avg_visits=("visits", "mean"), ).round(1) print(profile)
Sample output
n avg_spend avg_visits cluster 0 60 201.3 2.0 ← occasional low spenders 1 60 798.7 3.0 ← big-ticket rare buyers 2 60 450.2 12.1 ← frequent mid spenders
Read the diff
K-Means found three groups with no labels at all. The profiling step (groupby on the cluster id) is what turns anonymous IDs into actionable segments: "cluster 2 visits often — target them with a loyalty programme." Clustering finds the groups; you give them meaning.
Try It Yourself
13 minCluster iris (ignore the labels) into 3 groups using 2 features. Scatter the points coloured by cluster, and mark the centres.
Plot the elbow curve and print silhouette scores for k=2..6. Do they agree on the best k?
Load a photo (Lesson 5), reshape pixels to (n, 3), cluster colours into k=8, and replace each pixel with its cluster centre colour. You've compressed the palette to 8 colours.
Hint
import numpy as np from PIL import Image from sklearn.cluster import KMeans arr = np.array(Image.open("cat.jpg")) pixels = arr.reshape(-1, 3) km = KMeans(n_clusters=8, n_init=10, random_state=0).fit(pixels) new = km.cluster_centers_[km.labels_].astype(np.uint8).reshape(arr.shape)
Mini-Challenge · Auto-Pick k
8 minWrite best_k(X, k_range) that returns the k with the highest silhouette score, plus a one-line profile of the resulting clusters. Test it on any dataset.
Show one possible solution
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score def best_k(X, k_range=range(2, 8)): scores = {} for k in k_range: lbl = KMeans(n_clusters=k, n_init=10, random_state=0).fit_predict(X) scores[k] = silhouette_score(X, lbl) best = max(scores, key=scores.get) print("silhouettes:", {k: round(s, 3) for k, s in scores.items()}) print(f"best k = {best} (silhouette {scores[best]:.3f})") return best
Non-negotiables: scale before clustering, use silhouette to choose, return the best k. There's no "accuracy" here — cluster quality is the score.
Recap
3 minK-Means finds k cluster centres by iterating assign-and-move. No labels, so no accuracy — judge with the elbow (inertia), silhouette score, and business sense. Always scale features. Clusters are anonymous IDs until you profile and name them. That ends the classic-ML toolkit; next we engineer features, then build two full projects.
Vocabulary Card
- clustering
- Unsupervised grouping of similar samples, with no labels.
- centroid
- A cluster's centre — the mean of its assigned points.
- inertia
- Total squared distance from points to their centroids; lower = tighter clusters.
- silhouette
- A −1..1 score of how well-separated clusters are; higher is better.
Homework
4 minTake any unlabelled (or label-ignored) dataset. Use best_k to choose k, cluster, then profile and name each cluster in plain English. Submit the profile table + your cluster names.
Combine best_k with the profiling groupby from segment.py. The naming step is the point — "budget regulars", "big spenders", etc.