PY-L5-17 · Clustering — K-Means (Unsupervised)

Learning Goals

3 min

Explain how K-Means iterates: assign → move centres → repeat.
Run KMeans, read cluster labels and centres.
Pick k with the elbow method and silhouette score.
Remember: scale features, and there's no "right" answer to grade against.

Warm-Up · Group the Dots

5 min

K-Means with k=3:
1. Drop 3 random "centres" onto the data.
2. Assign each point to its nearest centre.
3. Move each centre to the average of its points.
4. Repeat 2-3 until centres stop moving.

No labels needed. The algorithm finds structure purely from how points cluster in space. Used for customer segmentation, image colour reduction, document grouping.

Today's big idea

Unsupervised learning finds patterns without answers. You can't measure "accuracy" — there are no true labels. Instead you judge clusters by how tight and well-separated they are (inertia, silhouette) and by whether they make business sense.

New Concept · KMeans & Choosing k

14 min

Run it

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

X = StandardScaler().fit_transform(load_iris().data)   # scale first!

km = KMeans(n_clusters=3, n_init=10, random_state=0)
labels = km.fit_predict(X)
print(labels[:10])              # cluster id per point: e.g. [1 1 1 ...]
print(km.cluster_centers_.shape) # (3, 4) — 3 centres in 4-D

n_init=10 runs the random-start algorithm 10 times and keeps the best — avoids unlucky starts.

The elbow method — picking k

import matplotlib.pyplot as plt

inertias = []
ks = range(1, 9)
for k in ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X)
    inertias.append(km.inertia_)   # total within-cluster distance

plt.plot(ks, inertias, marker="o")
plt.xlabel("k"); plt.ylabel("inertia"); plt.title("Elbow method")
plt.savefig("elbow.png", dpi=150)

Inertia always falls as k rises. Look for the "elbow" — where adding clusters stops helping much. That bend is a good k.

Silhouette score — a number for cluster quality

from sklearn.metrics import silhouette_score
for k in [2, 3, 4, 5]:
    lbl = KMeans(n_clusters=k, n_init=10, random_state=0).fit_predict(X)
    print(k, round(silhouette_score(X, lbl), 3))

Silhouette ranges -1 to 1; higher = tighter, better-separated clusters. Pick the k with the highest silhouette.

Reading clusters

Clusters are just IDs (0, 1, 2) — they have no inherent meaning. You interpret them: "cluster 2 is high-spend, low-frequency customers." That interpretation is where the value is.

Worked Example · Segment & Profile

12 min

# segment.py — cluster customers and profile each group
import numpy as np, pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# fake customer data: annual spend, visits per month
rng = np.random.default_rng(0)
df = pd.DataFrame({
    "spend":  np.r_[rng.normal(200, 40, 60),  rng.normal(800, 80, 60),
                    rng.normal(450, 50, 60)],
    "visits": np.r_[rng.normal(2, 0.6, 60),   rng.normal(3, 0.8, 60),
                    rng.normal(12, 2, 60)],
})

Xs = StandardScaler().fit_transform(df)
df["cluster"] = KMeans(n_clusters=3, n_init=10, random_state=0).fit_predict(Xs)

# profile: average behaviour per cluster
profile = df.groupby("cluster").agg(
    n=("spend", "size"),
    avg_spend=("spend", "mean"),
    avg_visits=("visits", "mean"),
).round(1)
print(profile)

Sample output

          n  avg_spend  avg_visits
cluster
0        60      201.3         2.0   ← occasional low spenders
1        60      798.7         3.0   ← big-ticket rare buyers
2        60      450.2        12.1   ← frequent mid spenders

Read the diff

K-Means found three groups with no labels at all. The profiling step (groupby on the cluster id) is what turns anonymous IDs into actionable segments: "cluster 2 visits often — target them with a loyalty programme." Clustering finds the groups; you give them meaning.

Try It Yourself

13 min

01 🟢 Cluster & plot

Cluster iris (ignore the labels) into 3 groups using 2 features. Scatter the points coloured by cluster, and mark the centres.

02 🟡 Elbow + silhouette

Plot the elbow curve and print silhouette scores for k=2..6. Do they agree on the best k?

03 🔴 Image colour quantisation

Load a photo (Lesson 5), reshape pixels to (n, 3), cluster colours into k=8, and replace each pixel with its cluster centre colour. You've compressed the palette to 8 colours.

Hint

import numpy as np
from PIL import Image
from sklearn.cluster import KMeans

arr = np.array(Image.open("cat.jpg"))
pixels = arr.reshape(-1, 3)
km = KMeans(n_clusters=8, n_init=10, random_state=0).fit(pixels)
new = km.cluster_centers_[km.labels_].astype(np.uint8).reshape(arr.shape)

Mini-Challenge · Auto-Pick k

8 min

Write best_k(X, k_range) that returns the k with the highest silhouette score, plus a one-line profile of the resulting clusters. Test it on any dataset.

Show one possible solution

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def best_k(X, k_range=range(2, 8)):
    scores = {}
    for k in k_range:
        lbl = KMeans(n_clusters=k, n_init=10, random_state=0).fit_predict(X)
        scores[k] = silhouette_score(X, lbl)
    best = max(scores, key=scores.get)
    print("silhouettes:", {k: round(s, 3) for k, s in scores.items()})
    print(f"best k = {best} (silhouette {scores[best]:.3f})")
    return best

Non-negotiables: scale before clustering, use silhouette to choose, return the best k. There's no "accuracy" here — cluster quality is the score.

Recap

3 min

K-Means finds k cluster centres by iterating assign-and-move. No labels, so no accuracy — judge with the elbow (inertia), silhouette score, and business sense. Always scale features. Clusters are anonymous IDs until you profile and name them. That ends the classic-ML toolkit; next we engineer features, then build two full projects.

Vocabulary Card

clustering: Unsupervised grouping of similar samples, with no labels.
centroid: A cluster's centre — the mean of its assigned points.
inertia: Total squared distance from points to their centroids; lower = tighter clusters.
silhouette: A −1..1 score of how well-separated clusters are; higher is better.

Homework

4 min

Take any unlabelled (or label-ignored) dataset. Use best_k to choose k, cluster, then profile and name each cluster in plain English. Submit the profile table + your cluster names.

from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans X = StandardScaler().fit_transform(load_iris().data) # scale first! km = KMeans(n_clusters=3, n_init=10, random_state=0) labels = km.fit_predict(X) print(labels[:10]) # cluster id per point: e.g. [1 1 1 ...] print(km.cluster_centers_.shape) # (3, 4) — 3 centres in 4-D

import matplotlib.pyplot as plt inertias = [] ks = range(1, 9) for k in ks: km = KMeans(n_clusters=k, n_init=10, random_state=0).fit(X) inertias.append(km.inertia_) # total within-cluster distance plt.plot(ks, inertias, marker="o") plt.xlabel("k"); plt.ylabel("inertia"); plt.title("Elbow method") plt.savefig("elbow.png", dpi=150)

# segment.py — cluster customers and profile each group import numpy as np, pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans # fake customer data: annual spend, visits per month rng = np.random.default_rng(0) df = pd.DataFrame({ "spend": np.r_[rng.normal(200, 40, 60), rng.normal(800, 80, 60), rng.normal(450, 50, 60)], "visits": np.r_[rng.normal(2, 0.6, 60), rng.normal(3, 0.8, 60), rng.normal(12, 2, 60)], }) Xs = StandardScaler().fit_transform(df) df["cluster"] = KMeans(n_clusters=3, n_init=10, random_state=0).fit_predict(Xs) # profile: average behaviour per cluster profile = df.groupby("cluster").agg( n=("spend", "size"), avg_spend=("spend", "mean"), avg_visits=("visits", "mean"), ).round(1) print(profile)

n avg_spend avg_visits cluster 0 60 201.3 2.0 ← occasional low spenders 1 60 798.7 3.0 ← big-ticket rare buyers 2 60 450.2 12.1 ← frequent mid spenders

import numpy as np from PIL import Image from sklearn.cluster import KMeans arr = np.array(Image.open("cat.jpg")) pixels = arr.reshape(-1, 3) km = KMeans(n_clusters=8, n_init=10, random_state=0).fit(pixels) new = km.cluster_centers_[km.labels_].astype(np.uint8).reshape(arr.shape)

from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score def best_k(X, k_range=range(2, 8)): scores = {} for k in k_range: lbl = KMeans(n_clusters=k, n_init=10, random_state=0).fit_predict(X) scores[k] = silhouette_score(X, lbl) best = max(scores, key=scores.get) print("silhouettes:", {k: round(s, 3) for k, s in scores.items()}) print(f"best k = {best} (silhouette {scores[best]:.3f})") return best