PY-L5-16 · Logistic Regression — Predicting Yes/No

Learning Goals

3 min

Understand the sigmoid: turns any number into a 0-1 probability.
Fit LogisticRegression; read predict vs predict_proba.
Move the decision threshold to trade precision for recall.
Remember to scale features (it's linear under the hood).

Warm-Up · The Sigmoid Squash

5 min

import numpy as np
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

print(sigmoid(-5))   # 0.007  → very unlikely
print(sigmoid(0))    # 0.5    → coin flip
print(sigmoid(5))    # 0.993  → very likely

Linear regression can output any number, even -200 or 3000 — nonsense for a probability. The sigmoid squashes the linear output into [0, 1], so it reads as "probability of yes".

Today's big idea

Logistic regression computes a linear score, then sigmoids it into a probability. Predict "yes" when the probability passes a threshold (default 0.5). Move that threshold and you trade precision for recall.

New Concept · predict_proba & Thresholds

14 min

Fit (with scaling — it's linear)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X, y = load_breast_cancer(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, stratify=y, random_state=0)

clf = make_pipeline(StandardScaler(),
                    LogisticRegression(max_iter=5000)).fit(Xtr, ytr)
print(clf.score(Xte, yte).round(3))

Probabilities, not just labels

probs = clf.predict_proba(Xte)     # shape (n, 2): [P(class0), P(class1)]
print(probs[:3].round(3))
print(clf.predict(Xte)[:3])         # the 0/1 decision at threshold 0.5

[[0.01 0.99]
 [0.97 0.03]
 [0.12 0.88]]
[1 0 1]

Custom threshold

# Be more cautious about calling something "benign" (class 1):
p_benign = probs[:, 1]
strict = (p_benign >= 0.7).astype(int)   # need 70% confidence for "benign"

Raising the threshold for "benign" means fewer false "benign" calls (higher precision for benign) but more borderline cases flagged as malignant (lower recall). This is the precision/recall dial from Lesson 11.

Coefficients still interpretable

lr = clf.named_steps["logisticregression"]
# Positive coefficient → pushes toward class 1 as the feature rises

Multiclass too

Logistic regression handles 3+ classes automatically (one-vs-rest or softmax). Same fit/predict.

Worked Example · Threshold Tuning

12 min

# threshold.py — see precision/recall trade as the threshold moves
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score

X, y = load_breast_cancer(return_X_y=True)
# treat malignant (0) as the "positive" we care about catching
y_pos = (y == 0).astype(int)
Xtr, Xte, ytr, yte = train_test_split(X, y_pos, stratify=y_pos, random_state=0)

clf = make_pipeline(StandardScaler(),
                    LogisticRegression(max_iter=5000)).fit(Xtr, ytr)
proba = clf.predict_proba(Xte)[:, 1]   # P(malignant)

print(f"{'thresh':>7} {'precision':>10} {'recall':>8}")
for t in [0.3, 0.4, 0.5, 0.6, 0.7]:
    pred = (proba >= t).astype(int)
    p = precision_score(yte, pred)
    r = recall_score(yte, pred)
    print(f"{t:>7} {p:>10.2f} {r:>8.2f}")

Sample output

 thresh  precision   recall
    0.3       0.89     1.00
    0.4       0.93     0.98
    0.5       0.95     0.95
    0.6       0.97     0.93
    0.7       1.00     0.88

Read the diff

Lower threshold → catch every malignant case (recall 1.00) at the cost of more false alarms (precision 0.89). For cancer screening you'd pick a low threshold — missing a tumour is far worse than a false alarm. The model gave you a probability; you choose the operating point based on real-world cost.

Try It Yourself

13 min

01 🟢 Probabilities

Fit logistic regression on any 2-class dataset. Print predict_proba for the first 5 test samples.

02 🟡 Plot the sigmoid fit

Use one feature only. Scatter the data (0/1) and overlay the model's predicted probability curve across that feature's range.

03 🔴 ROC curve

Plot the ROC curve and compute AUC with sklearn.metrics.roc_curve / roc_auc_score. What does AUC = 0.99 mean?

Hint

from sklearn.metrics import roc_auc_score, RocCurveDisplay
print("AUC:", round(roc_auc_score(yte, proba), 3))
RocCurveDisplay.from_predictions(yte, proba)

AUC ≈ 1.0 means the model ranks positives above negatives almost perfectly across all thresholds.

Mini-Challenge · Pick the Operating Point

8 min

For a spam filter, a false positive (real email → spam) is very costly; a false negative (spam → inbox) is mild. Sweep thresholds and pick the one that gives precision ≥ 0.99 with the highest possible recall. Report it.

Show one possible solution

import numpy as np
from sklearn.metrics import precision_score, recall_score

best = None
for t in np.linspace(0.5, 0.99, 50):
    pred = (proba >= t).astype(int)
    if pred.sum() == 0:
        continue
    p = precision_score(yte, pred)
    r = recall_score(yte, pred)
    if p >= 0.99 and (best is None or r > best[2]):
        best = (t, p, r)
print(f"chosen threshold {best[0]:.2f}: precision {best[1]:.2f}, recall {best[2]:.2f}")

Non-negotiables: search thresholds, enforce the precision floor, maximise recall subject to it. This is how products tune classifiers to business cost.

Recap

3 min

Logistic regression is a classifier: linear score → sigmoid → probability. predict_proba gives the probability; the threshold (default 0.5) turns it into a label. Move the threshold to trade precision for recall based on real-world cost. Scale features (it's linear). Fast, interpretable, a great baseline. Next: feature engineering, the prep that powers all of these.

Vocabulary Card

sigmoid: S-shaped function mapping any number to (0, 1) — a probability.
predict_proba: Returns class probabilities instead of a hard label.
threshold: The probability cutoff for predicting the positive class; your precision/recall dial.
ROC / AUC: Curve of true-positive vs false-positive rate; AUC summarises ranking quality (1 = perfect).

Homework

4 min

On a 2-class dataset, fit logistic regression, plot the ROC curve with AUC, and pick a threshold for a stated cost scenario (you choose precision-critical or recall-critical). Justify your choice in one sentence.

import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-z)) print(sigmoid(-5)) # 0.007 → very unlikely print(sigmoid(0)) # 0.5 → coin flip print(sigmoid(5)) # 0.993 → very likely

from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression X, y = load_breast_cancer(return_X_y=True) Xtr, Xte, ytr, yte = train_test_split(X, y, stratify=y, random_state=0) clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)).fit(Xtr, ytr) print(clf.score(Xte, yte).round(3))

# threshold.py — see precision/recall trade as the threshold moves from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import precision_score, recall_score X, y = load_breast_cancer(return_X_y=True) # treat malignant (0) as the "positive" we care about catching y_pos = (y == 0).astype(int) Xtr, Xte, ytr, yte = train_test_split(X, y_pos, stratify=y_pos, random_state=0) clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)).fit(Xtr, ytr) proba = clf.predict_proba(Xte)[:, 1] # P(malignant) print(f"{'thresh':>7} {'precision':>10} {'recall':>8}") for t in [0.3, 0.4, 0.5, 0.6, 0.7]: pred = (proba >= t).astype(int) p = precision_score(yte, pred) r = recall_score(yte, pred) print(f"{t:>7} {p:>10.2f} {r:>8.2f}")

import numpy as np from sklearn.metrics import precision_score, recall_score best = None for t in np.linspace(0.5, 0.99, 50): pred = (proba >= t).astype(int) if pred.sum() == 0: continue p = precision_score(yte, pred) r = recall_score(yte, pred) if p >= 0.99 and (best is None or r > best[2]): best = (t, p, r) print(f"chosen threshold {best[0]:.2f}: precision {best[1]:.2f}, recall {best[2]:.2f}")