PY-L5-15 · Linear Regression — Predicting a Number

Learning Goals

3 min

Fit a LinearRegression and read its slope(s) and intercept.
Predict a continuous value for new inputs.
Evaluate with R², MAE, and RMSE — not accuracy.
Interpret each coefficient as "effect per unit of this feature".

Warm-Up · y = mx + c

5 min

You met this line in maths class. Linear regression just finds the best m and c from data — the line that minimises the total squared distance to the points.

price = m × size + c

m (slope)     = RM per extra square foot
c (intercept) = base price at size 0
With many features: price = m1·size + m2·rooms + m3·age + c

Today's big idea

Regression outputs a number, so accuracy makes no sense — you measure how far off you are. R² says "what fraction of the variation did I explain?"; MAE/RMSE say "by how much am I wrong, on average?".

New Concept · Fit, Predict, Measure

14 min

Fit

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target          # y = median house value
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)

lin = LinearRegression().fit(Xtr, ytr)

Read the coefficients

import pandas as pd
coefs = pd.Series(lin.coef_, index=X.columns)
print(coefs.round(3))
print("intercept:", round(lin.intercept_, 3))

Each coefficient is "how much the prediction changes per one-unit increase in that feature, holding others fixed". A positive coefficient on income means higher income → higher predicted house value.

Predict

preds = lin.predict(Xte)
print(preds[:5].round(2))

The right metrics

from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

print("R²  :", round(r2_score(yte, preds), 3))           # 1.0 = perfect, 0 = no better than the mean
print("MAE :", round(mean_absolute_error(yte, preds), 3)) # average absolute error
print("RMSE:", round(root_mean_squared_error(yte, preds), 3)) # penalises big errors more

R²  : 0.591
MAE : 0.53
RMSE: 0.75

R² 0.59 means the model explains 59% of price variation. MAE 0.53 means it's off by about $53k on average (target is in $100k units). RMSE > MAE always — RMSE punishes the occasional big miss.

Worked Example · Predict & Plot

12 min

# regression.py — fit, score, and a predicted-vs-actual plot
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import matplotlib.pyplot as plt

data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)

lin = LinearRegression().fit(Xtr, ytr)
preds = lin.predict(Xte)

print("R² :", round(r2_score(yte, preds), 3))
print("MAE:", round(mean_absolute_error(yte, preds), 3))

# predicted vs actual — perfect model would lie on the diagonal
plt.figure(figsize=(6, 6))
plt.scatter(yte, preds, alpha=0.2, s=10)
lim = [yte.min(), yte.max()]
plt.plot(lim, lim, "r--", linewidth=2)
plt.xlabel("actual"); plt.ylabel("predicted")
plt.title("Predicted vs actual house value")
plt.tight_layout(); plt.savefig("pred_vs_actual.png", dpi=150)

Read the diff

The predicted-vs-actual scatter is the key regression diagnostic. Points hug the red diagonal where the model is right; the cloud's spread shows the error. You'll spot the ceiling at the top — California's target is capped at $500k, which the linear model can't exceed. Diagnostics like this reveal a model's blind spots.

Try It Yourself

13 min

01 🟢 Single-feature line

Fit a regression using ONLY the income feature. Plot the data and the fitted line.

02 🟡 Interpret coefficients

Print all coefficients sorted by magnitude. Which feature has the biggest effect on price? Does its sign make sense?

03 🔴 Forest vs linear

Compare LinearRegression and RandomForestRegressor on R² (via cross_val_score with scoring="r2"). Which wins, and why might that be?

Hint

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
for name, m in [("linear", LinearRegression()),
                ("forest", RandomForestRegressor(n_estimators=100, random_state=0))]:
    s = cross_val_score(m, X, y, cv=5, scoring="r2").mean()
    print(name, round(s, 3))

The forest usually wins because house price isn't perfectly linear — but it loses the simple, interpretable coefficients.

Mini-Challenge · A Regression Report

8 min

Write regression_report(model, Xte, yte) that prints R², MAE, RMSE and the 3 worst-predicted samples (largest absolute error). Use it on any regressor.

Show one possible solution

import numpy as np
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

def regression_report(model, Xte, yte):
    preds = model.predict(Xte)
    print(f"R²  : {r2_score(yte, preds):.3f}")
    print(f"MAE : {mean_absolute_error(yte, preds):.3f}")
    print(f"RMSE: {root_mean_squared_error(yte, preds):.3f}")
    err = np.abs(yte.values - preds)
    worst = np.argsort(err)[-3:][::-1]
    print("worst 3 predictions:")
    for i in worst:
        print(f"  actual {yte.values[i]:.2f}  predicted {preds[i]:.2f}")

Non-negotiables: all three metrics + the worst-error samples (regression's version of inspecting mistakes).

Recap

3 min

Regression predicts numbers. Linear regression fits the best line/plane; coefficients = effect per unit of each feature (interpretable!). Evaluate with R² (variance explained), MAE (average error), RMSE (penalises big misses) — never accuracy. The predicted-vs-actual plot is your go-to diagnostic. Next: logistic regression, same shape but for yes/no.

Vocabulary Card

regression: Predicting a continuous numeric output.
coefficient: The learned effect of a feature — change in prediction per unit change in the feature.
R²: Fraction of the target's variation the model explains (1 = perfect, 0 = mean-only).
MAE / RMSE: Average error / root-mean-square error; RMSE punishes large errors more.

Homework

4 min

Find a regression dataset (or use California housing). Fit linear regression, print the regression report, plot predicted-vs-actual, and interpret the two largest coefficients in plain English.

from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression data = fetch_california_housing(as_frame=True) X, y = data.data, data.target # y = median house value Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) lin = LinearRegression().fit(Xtr, ytr)

from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error print("R² :", round(r2_score(yte, preds), 3)) # 1.0 = perfect, 0 = no better than the mean print("MAE :", round(mean_absolute_error(yte, preds), 3)) # average absolute error print("RMSE:", round(root_mean_squared_error(yte, preds), 3)) # penalises big errors more

# regression.py — fit, score, and a predicted-vs-actual plot from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import matplotlib.pyplot as plt data = fetch_california_housing(as_frame=True) X, y = data.data, data.target Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) lin = LinearRegression().fit(Xtr, ytr) preds = lin.predict(Xte) print("R² :", round(r2_score(yte, preds), 3)) print("MAE:", round(mean_absolute_error(yte, preds), 3)) # predicted vs actual — perfect model would lie on the diagonal plt.figure(figsize=(6, 6)) plt.scatter(yte, preds, alpha=0.2, s=10) lim = [yte.min(), yte.max()] plt.plot(lim, lim, "r--", linewidth=2) plt.xlabel("actual"); plt.ylabel("predicted") plt.title("Predicted vs actual house value") plt.tight_layout(); plt.savefig("pred_vs_actual.png", dpi=150)

from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score for name, m in [("linear", LinearRegression()), ("forest", RandomForestRegressor(n_estimators=100, random_state=0))]: s = cross_val_score(m, X, y, cv=5, scoring="r2").mean() print(name, round(s, 3))

import numpy as np from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error def regression_report(model, Xte, yte): preds = model.predict(Xte) print(f"R² : {r2_score(yte, preds):.3f}") print(f"MAE : {mean_absolute_error(yte, preds):.3f}") print(f"RMSE: {root_mean_squared_error(yte, preds):.3f}") err = np.abs(yte.values - preds) worst = np.argsort(err)[-3:][::-1] print("worst 3 predictions:") for i in worst: print(f" actual {yte.values[i]:.2f} predicted {preds[i]:.2f}")