Learning Goals
3 min- Fit a
LinearRegressionand read its slope(s) and intercept. - Predict a continuous value for new inputs.
- Evaluate with R², MAE, and RMSE — not accuracy.
- Interpret each coefficient as "effect per unit of this feature".
Warm-Up · y = mx + c
5 minYou met this line in maths class. Linear regression just finds the best m and c from data — the line that minimises the total squared distance to the points.
price = m × size + c m (slope) = RM per extra square foot c (intercept) = base price at size 0 With many features: price = m1·size + m2·rooms + m3·age + c
Regression outputs a number, so accuracy makes no sense — you measure how far off you are. R² says "what fraction of the variation did I explain?"; MAE/RMSE say "by how much am I wrong, on average?".
New Concept · Fit, Predict, Measure
14 minFit
from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression data = fetch_california_housing(as_frame=True) X, y = data.data, data.target # y = median house value Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) lin = LinearRegression().fit(Xtr, ytr)
Read the coefficients
import pandas as pd coefs = pd.Series(lin.coef_, index=X.columns) print(coefs.round(3)) print("intercept:", round(lin.intercept_, 3))
Each coefficient is "how much the prediction changes per one-unit increase in that feature, holding others fixed". A positive coefficient on income means higher income → higher predicted house value.
Predict
preds = lin.predict(Xte) print(preds[:5].round(2))
The right metrics
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error print("R² :", round(r2_score(yte, preds), 3)) # 1.0 = perfect, 0 = no better than the mean print("MAE :", round(mean_absolute_error(yte, preds), 3)) # average absolute error print("RMSE:", round(root_mean_squared_error(yte, preds), 3)) # penalises big errors more
R² : 0.591 MAE : 0.53 RMSE: 0.75
R² 0.59 means the model explains 59% of price variation. MAE 0.53 means it's off by about $53k on average (target is in $100k units). RMSE > MAE always — RMSE punishes the occasional big miss.
Worked Example · Predict & Plot
12 min# regression.py — fit, score, and a predicted-vs-actual plot from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import matplotlib.pyplot as plt data = fetch_california_housing(as_frame=True) X, y = data.data, data.target Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) lin = LinearRegression().fit(Xtr, ytr) preds = lin.predict(Xte) print("R² :", round(r2_score(yte, preds), 3)) print("MAE:", round(mean_absolute_error(yte, preds), 3)) # predicted vs actual — perfect model would lie on the diagonal plt.figure(figsize=(6, 6)) plt.scatter(yte, preds, alpha=0.2, s=10) lim = [yte.min(), yte.max()] plt.plot(lim, lim, "r--", linewidth=2) plt.xlabel("actual"); plt.ylabel("predicted") plt.title("Predicted vs actual house value") plt.tight_layout(); plt.savefig("pred_vs_actual.png", dpi=150)
Read the diff
The predicted-vs-actual scatter is the key regression diagnostic. Points hug the red diagonal where the model is right; the cloud's spread shows the error. You'll spot the ceiling at the top — California's target is capped at $500k, which the linear model can't exceed. Diagnostics like this reveal a model's blind spots.
Try It Yourself
13 minFit a regression using ONLY the income feature. Plot the data and the fitted line.
Print all coefficients sorted by magnitude. Which feature has the biggest effect on price? Does its sign make sense?
Compare LinearRegression and RandomForestRegressor on R² (via cross_val_score with scoring="r2"). Which wins, and why might that be?
Hint
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score for name, m in [("linear", LinearRegression()), ("forest", RandomForestRegressor(n_estimators=100, random_state=0))]: s = cross_val_score(m, X, y, cv=5, scoring="r2").mean() print(name, round(s, 3))
The forest usually wins because house price isn't perfectly linear — but it loses the simple, interpretable coefficients.
Mini-Challenge · A Regression Report
8 minWrite regression_report(model, Xte, yte) that prints R², MAE, RMSE and the 3 worst-predicted samples (largest absolute error). Use it on any regressor.
Show one possible solution
import numpy as np from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error def regression_report(model, Xte, yte): preds = model.predict(Xte) print(f"R² : {r2_score(yte, preds):.3f}") print(f"MAE : {mean_absolute_error(yte, preds):.3f}") print(f"RMSE: {root_mean_squared_error(yte, preds):.3f}") err = np.abs(yte.values - preds) worst = np.argsort(err)[-3:][::-1] print("worst 3 predictions:") for i in worst: print(f" actual {yte.values[i]:.2f} predicted {preds[i]:.2f}")
Non-negotiables: all three metrics + the worst-error samples (regression's version of inspecting mistakes).
Recap
3 minRegression predicts numbers. Linear regression fits the best line/plane; coefficients = effect per unit of each feature (interpretable!). Evaluate with R² (variance explained), MAE (average error), RMSE (penalises big misses) — never accuracy. The predicted-vs-actual plot is your go-to diagnostic. Next: logistic regression, same shape but for yes/no.
Vocabulary Card
- regression
- Predicting a continuous numeric output.
- coefficient
- The learned effect of a feature — change in prediction per unit change in the feature.
- R²
- Fraction of the target's variation the model explains (1 = perfect, 0 = mean-only).
- MAE / RMSE
- Average error / root-mean-square error; RMSE punishes large errors more.
Homework
4 minFind a regression dataset (or use California housing). Fit linear regression, print the regression report, plot predicted-vs-actual, and interpret the two largest coefficients in plain English.
Combine regression.py + regression_report. The coefficient interpretation should read like "each extra room adds about RM X to the predicted price".