Learning Goals
3 min- Read a confusion matrix (TP, FP, FN, TN).
- Define accuracy, precision, recall, F1 — and when each matters.
- See why accuracy is misleading on imbalanced data.
- Use
classification_reportandconfusion_matrix.
Warm-Up · The 99% Trap
5 minA rare disease affects 1 in 100 people. A "model" that always predicts "healthy" is 99% accurate — and catches zero sick patients. Useless, despite the headline number.
y_true = [0]*99 + [1] # 99 healthy, 1 sick y_pred = [0]*100 # model says everyone is healthy accuracy = sum(t == p for t, p in zip(y_true, y_pred)) / len(y_true) print(f"accuracy: {accuracy:.0%}") # 99% — and it missed the one sick person
Accuracy alone hides the kind of mistakes a model makes. On imbalanced or high-stakes problems, you need precision and recall — they tell you which errors are happening.
New Concept · The Confusion Matrix
14 minFour outcomes (for a yes/no model)
PREDICTED
positive negative
ACTUAL positive TP ✅ FN ❌ (missed it)
negative FP ❌ TN ✅
(false alarm)
TP = true positive FN = false negative (miss)
FP = false positive TN = true negativeThe metrics
accuracy = (TP + TN) / everything "overall correctness" precision = TP / (TP + FP) "when it says YES, is it right?" recall = TP / (TP + FN) "of all real YESes, how many caught?" F1 = harmonic mean of P and R "balance of the two"
Which to optimise?
- Recall matters most when missing a positive is costly: cancer screening, fraud, security alerts. Better a false alarm than a miss.
- Precision matters most when a false alarm is costly: spam filter (don't bin a real email), "you may be pregnant" ads.
- F1 when you need a single number balancing both.
In scikit-learn
from sklearn.metrics import confusion_matrix, classification_report print(confusion_matrix(y_test, preds)) print(classification_report(y_test, preds))
precision recall f1-score support
0 0.97 0.99 0.98 71
1 0.98 0.95 0.96 43
accuracy 0.97 114support is how many real samples of each class there were — your imbalance check, built in.
Worked Example · Read a Real Scorecard
12 min# scorecard.py — full evaluation, not just accuracy from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import (confusion_matrix, classification_report, ConfusionMatrixDisplay) import matplotlib.pyplot as plt X, y = load_breast_cancer(return_X_y=True) # In this dataset: 0 = malignant, 1 = benign Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0) model = LogisticRegression(max_iter=5000).fit(Xtr, ytr) preds = model.predict(Xte) print("accuracy:", model.score(Xte, yte).round(3)) print("\nconfusion matrix:\n", confusion_matrix(yte, preds)) print("\n", classification_report(yte, preds, target_names=["malignant", "benign"])) ConfusionMatrixDisplay.from_predictions(yte, preds, display_labels=["malignant", "benign"]) plt.savefig("confusion.png", dpi=150)
Sample output
accuracy: 0.965
confusion matrix:
[[40 2]
[ 2 70]]
precision recall f1-score support
malignant 0.95 0.95 0.95 42
benign 0.97 0.97 0.97 72
accuracy 0.96 114Read the diff
Here, a false negative means calling a malignant tumour "benign" — the dangerous error. The matrix shows 2 such misses. For a medical tool you'd push recall on the malignant class higher even at the cost of more false alarms. Accuracy alone (96.5%) would never reveal that trade-off.
Try It Yourself
13 minTrain any classifier and print its classification_report. Identify the class with the lowest recall.
Given TP=40, FP=2, FN=2, TN=70, compute accuracy, precision, recall, F1 with a calculator-style script. Check against sklearn.
Hint
TP, FP, FN, TN = 40, 2, 2, 70 acc = (TP+TN)/(TP+FP+FN+TN) prec = TP/(TP+FP); rec = TP/(TP+FN) f1 = 2*prec*rec/(prec+rec) print(round(acc,3), round(prec,3), round(rec,3), round(f1,3))
Use predict_proba and a custom threshold. Lower the threshold for the positive class and watch recall rise while precision falls.
Hint
probs = model.predict_proba(Xte)[:, 1] # P(class 1) for thresh in [0.3, 0.5, 0.7]: preds = (probs >= thresh).astype(int) print(thresh, classification_report(yte, preds, output_dict=True)["1"])
Mini-Challenge · Pick the Metric
8 minFor each scenario, state whether you'd optimise precision, recall, or F1, and why in one sentence:
- Airport explosive detector.
- Email spam filter.
- Recommending which of 1000 products to show first.
- Predicting which patients need a follow-up call.
Show suggested answers
1. Recall — never miss a real threat; false alarms are acceptable. 2. Precision — never bin a real email; a little spam getting through is OK. 3. Precision @ top-k — the few you surface must be relevant. 4. Recall (lean) — better to call a few extra than miss someone who needs care.
Recap
3 minAccuracy hides the type of mistake. The confusion matrix shows TP/FP/FN/TN. Precision = "when it says yes, is it right"; recall = "did it catch all the real yeses"; F1 balances them. Pick the metric the real-world cost demands. Always read classification_report, not just the headline number.
Vocabulary Card
- confusion matrix
- Table of predicted vs actual classes — the source of all metrics.
- precision
- Of predicted positives, how many are truly positive.
- recall
- Of actual positives, how many were caught.
- F1 score
- Harmonic mean of precision and recall — one balanced number.
Homework
4 minOn an imbalanced dataset (or make one imbalanced by dropping rows), train a model and produce: confusion matrix image, classification report, and a paragraph stating which metric you'd report to a stakeholder and why.
Reuse the scorecard.py structure. The paragraph should name the costly error type for your dataset and justify the chosen metric accordingly.