PY-L5-11 · Accuracy, Precision, Recall

Learning Goals

3 min

Read a confusion matrix (TP, FP, FN, TN).
Define accuracy, precision, recall, F1 — and when each matters.
See why accuracy is misleading on imbalanced data.
Use classification_report and confusion_matrix.

Warm-Up · The 99% Trap

5 min

A rare disease affects 1 in 100 people. A "model" that always predicts "healthy" is 99% accurate — and catches zero sick patients. Useless, despite the headline number.

y_true = [0]*99 + [1]      # 99 healthy, 1 sick
y_pred = [0]*100           # model says everyone is healthy
accuracy = sum(t == p for t, p in zip(y_true, y_pred)) / len(y_true)
print(f"accuracy: {accuracy:.0%}")   # 99% — and it missed the one sick person

Today's big idea

Accuracy alone hides the kind of mistakes a model makes. On imbalanced or high-stakes problems, you need precision and recall — they tell you which errors are happening.

New Concept · The Confusion Matrix

14 min

Four outcomes (for a yes/no model)

                    PREDICTED
                  positive   negative
ACTUAL positive    TP ✅       FN ❌ (missed it)
       negative    FP ❌       TN ✅
                   (false alarm)

TP = true positive   FN = false negative (miss)
FP = false positive  TN = true negative

The metrics

accuracy  = (TP + TN) / everything       "overall correctness"
precision = TP / (TP + FP)               "when it says YES, is it right?"
recall    = TP / (TP + FN)               "of all real YESes, how many caught?"
F1        = harmonic mean of P and R     "balance of the two"

Which to optimise?

Recall matters most when missing a positive is costly: cancer screening, fraud, security alerts. Better a false alarm than a miss.
Precision matters most when a false alarm is costly: spam filter (don't bin a real email), "you may be pregnant" ads.
F1 when you need a single number balancing both.

In scikit-learn

from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))

              precision    recall  f1-score   support
           0       0.97      0.99      0.98        71
           1       0.98      0.95      0.96        43
    accuracy                           0.97       114

support is how many real samples of each class there were — your imbalance check, built in.

Worked Example · Read a Real Scorecard

12 min

# scorecard.py — full evaluation, not just accuracy
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix, classification_report,
                             ConfusionMatrixDisplay)
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
# In this dataset: 0 = malignant, 1 = benign
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2,
                                      stratify=y, random_state=0)

model = LogisticRegression(max_iter=5000).fit(Xtr, ytr)
preds = model.predict(Xte)

print("accuracy:", model.score(Xte, yte).round(3))
print("\nconfusion matrix:\n", confusion_matrix(yte, preds))
print("\n", classification_report(yte, preds,
                                   target_names=["malignant", "benign"]))

ConfusionMatrixDisplay.from_predictions(yte, preds,
    display_labels=["malignant", "benign"])
plt.savefig("confusion.png", dpi=150)

Sample output

accuracy: 0.965

confusion matrix:
 [[40  2]
 [ 2 70]]

              precision    recall  f1-score   support
   malignant       0.95      0.95      0.95        42
      benign       0.97      0.97      0.97        72
    accuracy                           0.96       114

Read the diff

Here, a false negative means calling a malignant tumour "benign" — the dangerous error. The matrix shows 2 such misses. For a medical tool you'd push recall on the malignant class higher even at the cost of more false alarms. Accuracy alone (96.5%) would never reveal that trade-off.

Try It Yourself

13 min

01 🟢 Print the report

Train any classifier and print its classification_report. Identify the class with the lowest recall.

02 🟡 Compute by hand

Given TP=40, FP=2, FN=2, TN=70, compute accuracy, precision, recall, F1 with a calculator-style script. Check against sklearn.

Hint

TP, FP, FN, TN = 40, 2, 2, 70
acc = (TP+TN)/(TP+FP+FN+TN)
prec = TP/(TP+FP); rec = TP/(TP+FN)
f1 = 2*prec*rec/(prec+rec)
print(round(acc,3), round(prec,3), round(rec,3), round(f1,3))

03 🔴 Trade precision for recall

Use predict_proba and a custom threshold. Lower the threshold for the positive class and watch recall rise while precision falls.

Hint

probs = model.predict_proba(Xte)[:, 1]   # P(class 1)
for thresh in [0.3, 0.5, 0.7]:
    preds = (probs >= thresh).astype(int)
    print(thresh, classification_report(yte, preds, output_dict=True)["1"])

Mini-Challenge · Pick the Metric

8 min

For each scenario, state whether you'd optimise precision, recall, or F1, and why in one sentence:

Airport explosive detector.
Email spam filter.
Recommending which of 1000 products to show first.
Predicting which patients need a follow-up call.

Show suggested answers

1. Recall — never miss a real threat; false alarms are acceptable.
2. Precision — never bin a real email; a little spam getting through is OK.
3. Precision @ top-k — the few you surface must be relevant.
4. Recall (lean) — better to call a few extra than miss someone who needs care.

Recap

3 min

Accuracy hides the type of mistake. The confusion matrix shows TP/FP/FN/TN. Precision = "when it says yes, is it right"; recall = "did it catch all the real yeses"; F1 balances them. Pick the metric the real-world cost demands. Always read classification_report, not just the headline number.

Vocabulary Card

confusion matrix: Table of predicted vs actual classes — the source of all metrics.
precision: Of predicted positives, how many are truly positive.
recall: Of actual positives, how many were caught.
F1 score: Harmonic mean of precision and recall — one balanced number.

Homework

4 min

On an imbalanced dataset (or make one imbalanced by dropping rows), train a model and produce: confusion matrix image, classification report, and a paragraph stating which metric you'd report to a stakeholder and why.

y_true = [0]*99 + [1] # 99 healthy, 1 sick y_pred = [0]*100 # model says everyone is healthy accuracy = sum(t == p for t, p in zip(y_true, y_pred)) / len(y_true) print(f"accuracy: {accuracy:.0%}") # 99% — and it missed the one sick person

PREDICTED positive negative ACTUAL positive TP ✅ FN ❌ (missed it) negative FP ❌ TN ✅ (false alarm) TP = true positive FN = false negative (miss) FP = false positive TN = true negative

accuracy = (TP + TN) / everything "overall correctness" precision = TP / (TP + FP) "when it says YES, is it right?" recall = TP / (TP + FN) "of all real YESes, how many caught?" F1 = harmonic mean of P and R "balance of the two"

precision recall f1-score support 0 0.97 0.99 0.98 71 1 0.98 0.95 0.96 43 accuracy 0.97 114

# scorecard.py — full evaluation, not just accuracy from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import (confusion_matrix, classification_report, ConfusionMatrixDisplay) import matplotlib.pyplot as plt X, y = load_breast_cancer(return_X_y=True) # In this dataset: 0 = malignant, 1 = benign Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0) model = LogisticRegression(max_iter=5000).fit(Xtr, ytr) preds = model.predict(Xte) print("accuracy:", model.score(Xte, yte).round(3)) print("\nconfusion matrix:\n", confusion_matrix(yte, preds)) print("\n", classification_report(yte, preds, target_names=["malignant", "benign"])) ConfusionMatrixDisplay.from_predictions(yte, preds, display_labels=["malignant", "benign"]) plt.savefig("confusion.png", dpi=150)

accuracy: 0.965 confusion matrix: [[40 2] [ 2 70]] precision recall f1-score support malignant 0.95 0.95 0.95 42 benign 0.97 0.97 0.97 72 accuracy 0.96 114

probs = model.predict_proba(Xte)[:, 1] # P(class 1) for thresh in [0.3, 0.5, 0.7]: preds = (probs >= thresh).astype(int) print(thresh, classification_report(yte, preds, output_dict=True)["1"])

1. Recall — never miss a real threat; false alarms are acceptable. 2. Precision — never bin a real email; a little spam getting through is OK. 3. Precision @ top-k — the few you surface must be relevant. 4. Recall (lean) — better to call a few extra than miss someone who needs care.