Project Goals
3 min- Build a regression pipeline for mixed-type property data.
- Evaluate with RMSE and R² (cross-validated).
- Engineer price-relevant features (price per sqft, age).
- Present predictions and uncertainty honestly.
Warm-Up · Make / Find the Data
5 minIf you don't have a real Malaysian property CSV, generate a realistic synthetic one so the lesson runs anywhere:
import numpy as np, pandas as pd rng = np.random.default_rng(0) n = 600 df = pd.DataFrame({ "sqft": rng.integers(600, 3000, n), "rooms": rng.integers(1, 6, n), "age": rng.integers(0, 40, n), "type": rng.choice(["apartment", "terrace", "bungalow"], n), "city": rng.choice(["KL", "Penang", "JB", "Ipoh"], n), }) base = (df["sqft"] * 350 + df["rooms"] * 25000 - df["age"] * 4000 + df["type"].map({"apartment": 0, "terrace": 80000, "bungalow": 250000}) + df["city"].map({"KL": 200000, "Penang": 120000, "JB": 60000, "Ipoh": 0})) df["price"] = (base + rng.normal(0, 50000, n)).round(-3).clip(lower=80000) df.to_csv("property.csv", index=False)
Regression projects follow the same pipeline as classification — only the model class (Regressor) and metrics (RMSE/R², not accuracy) change. A price prediction is useless without an error estimate: "RM 480k ± 60k" is honest; "RM 480,213" pretends to a precision you don't have.
Plan · Regression Pipeline
14 minThe pieces (same as Titanic, Regressor at the end)
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import RandomForestRegressor num_cols = ["sqft", "rooms", "age"] cat_cols = ["type", "city"] pre = ColumnTransformer([ ("num", StandardScaler(), num_cols), ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols), ]) model = Pipeline([("prep", pre), ("reg", RandomForestRegressor(n_estimators=300, random_state=0))])
Metrics for regression
from sklearn.model_selection import cross_val_score r2 = cross_val_score(model, X, y, cv=5, scoring="r2").mean() rmse = -cross_val_score(model, X, y, cv=5, scoring="neg_root_mean_squared_error").mean() print(f"R²: {r2:.3f} RMSE: RM {rmse:,.0f}")
scikit-learn maximises scores, so error metrics are "negative" — flip the sign back. RMSE in RM is interpretable: "typical error of ±RMSE".
Feature engineering ideas
X["price_per_room_proxy"] = X["sqft"] / X["rooms"] # age buckets, is_new, etc.
Build · property.py
12 min# property.py — predict & present import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error import matplotlib.pyplot as plt df = pd.read_csv("property.csv") y = df.pop("price") X = df pre = ColumnTransformer([ ("num", StandardScaler(), ["sqft", "rooms", "age"]), ("cat", OneHotEncoder(handle_unknown="ignore"), ["type", "city"]), ]) model = Pipeline([("prep", pre), ("reg", RandomForestRegressor(n_estimators=300, random_state=0))]) r2 = cross_val_score(model, X, y, cv=5, scoring="r2").mean() print(f"CV R²: {r2:.3f}") Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) model.fit(Xtr, ytr) pred = model.predict(Xte) print(f"test RMSE : RM {root_mean_squared_error(yte, pred):,.0f}") print(f"test MAE : RM {mean_absolute_error(yte, pred):,.0f}") # predicted vs actual plt.figure(figsize=(6, 6)) plt.scatter(yte, pred, alpha=0.4, s=12) lim = [yte.min(), yte.max()] plt.plot(lim, lim, "r--") plt.xlabel("actual (RM)"); plt.ylabel("predicted (RM)") plt.title("Property price: predicted vs actual") plt.tight_layout(); plt.savefig("property_pred.png", dpi=150) # predict a new property, with an error band new = pd.DataFrame([{"sqft": 1500, "rooms": 3, "age": 5, "type": "terrace", "city": "KL"}]) p = model.predict(new)[0] rmse = root_mean_squared_error(yte, pred) print(f"\nestimate: RM {p:,.0f} (± RM {rmse:,.0f})")
Sample output
CV R²: 0.942 test RMSE : RM 58,400 test MAE : RM 44,100 estimate: RM 742,000 (± RM 58,400)
Read the diff
The estimate comes with an honest ± band derived from the test RMSE — never quote a price to the nearest ringgit. High R² here is partly because our synthetic data is fairly linear; real property data is noisier (location micro-effects, renovation, timing), so expect lower R² and wider bands in the wild.
Extensions
13 minCompare LinearRegression and RandomForestRegressor on R². Which fits this data better, and why?
Plot feature importances. Does the model agree that sqft and city drive price?
Compute RMSE separately for each city. Is the model better in some cities than others? Why might that be?
Stretch · A Tiny Price App
8 minSave the trained pipeline with joblib.dump(model, "model.pkl"). Write a small script that loads it and predicts a price from command-line args. (In Lesson 44 you'll wrap a model like this in a Flask web app.)
Show the save/load pattern
import joblib joblib.dump(model, "model.pkl") # later / elsewhere: loaded = joblib.load("model.pkl") new = pd.DataFrame([{"sqft": 1200, "rooms": 2, "age": 10, "type": "apartment", "city": "Penang"}]) print("RM", round(loaded.predict(new)[0]))
A whole pipeline pickles as one object — prep + model travel together. That's why pipelines beat doing prep manually.
Recap
3 minRegression = same pipeline, Regressor at the end, RMSE/R² instead of accuracy. Always present predictions with an error band. Save the whole pipeline with joblib so prep + model deploy as one. You now have two complete projects — and the classic-ML toolkit is done. Next: neural networks.
Homework
4 minFinish the property predictor. Save the pipeline with joblib, then write a tiny predictor script. Report CV R², test RMSE, the top features, and one honest limitation of your model.
Combine property.py with the joblib save/load. A good limitation: "the model can't see renovation quality, exact street, or market timing — so individual estimates can be off by more than the average RMSE."