PY-L5-20 · Project — Malaysian Property Price Predictor

Project Goals

3 min

Build a regression pipeline for mixed-type property data.
Evaluate with RMSE and R² (cross-validated).
Engineer price-relevant features (price per sqft, age).
Present predictions and uncertainty honestly.

Warm-Up · Make / Find the Data

5 min

If you don't have a real Malaysian property CSV, generate a realistic synthetic one so the lesson runs anywhere:

import numpy as np, pandas as pd
rng = np.random.default_rng(0)
n = 600
df = pd.DataFrame({
    "sqft":     rng.integers(600, 3000, n),
    "rooms":    rng.integers(1, 6, n),
    "age":      rng.integers(0, 40, n),
    "type":     rng.choice(["apartment", "terrace", "bungalow"], n),
    "city":     rng.choice(["KL", "Penang", "JB", "Ipoh"], n),
})
base = (df["sqft"] * 350 + df["rooms"] * 25000
        - df["age"] * 4000
        + df["type"].map({"apartment": 0, "terrace": 80000, "bungalow": 250000})
        + df["city"].map({"KL": 200000, "Penang": 120000, "JB": 60000, "Ipoh": 0}))
df["price"] = (base + rng.normal(0, 50000, n)).round(-3).clip(lower=80000)
df.to_csv("property.csv", index=False)

Today's big idea

Regression projects follow the same pipeline as classification — only the model class (Regressor) and metrics (RMSE/R², not accuracy) change. A price prediction is useless without an error estimate: "RM 480k ± 60k" is honest; "RM 480,213" pretends to a precision you don't have.

Plan · Regression Pipeline

14 min

The pieces (same as Titanic, Regressor at the end)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

num_cols = ["sqft", "rooms", "age"]
cat_cols = ["type", "city"]

pre = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])
model = Pipeline([("prep", pre),
                  ("reg", RandomForestRegressor(n_estimators=300, random_state=0))])

Metrics for regression

from sklearn.model_selection import cross_val_score
r2  = cross_val_score(model, X, y, cv=5, scoring="r2").mean()
rmse = -cross_val_score(model, X, y, cv=5,
                        scoring="neg_root_mean_squared_error").mean()
print(f"R²: {r2:.3f}   RMSE: RM {rmse:,.0f}")

scikit-learn maximises scores, so error metrics are "negative" — flip the sign back. RMSE in RM is interpretable: "typical error of ±RMSE".

Feature engineering ideas

X["price_per_room_proxy"] = X["sqft"] / X["rooms"]
# age buckets, is_new, etc.

Build · property.py

12 min

# property.py — predict & present
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error
import matplotlib.pyplot as plt

df = pd.read_csv("property.csv")
y = df.pop("price")
X = df

pre = ColumnTransformer([
    ("num", StandardScaler(), ["sqft", "rooms", "age"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["type", "city"]),
])
model = Pipeline([("prep", pre),
                  ("reg", RandomForestRegressor(n_estimators=300, random_state=0))])

r2 = cross_val_score(model, X, y, cv=5, scoring="r2").mean()
print(f"CV R²: {r2:.3f}")

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)
model.fit(Xtr, ytr)
pred = model.predict(Xte)
print(f"test RMSE : RM {root_mean_squared_error(yte, pred):,.0f}")
print(f"test MAE  : RM {mean_absolute_error(yte, pred):,.0f}")

# predicted vs actual
plt.figure(figsize=(6, 6))
plt.scatter(yte, pred, alpha=0.4, s=12)
lim = [yte.min(), yte.max()]
plt.plot(lim, lim, "r--")
plt.xlabel("actual (RM)"); plt.ylabel("predicted (RM)")
plt.title("Property price: predicted vs actual")
plt.tight_layout(); plt.savefig("property_pred.png", dpi=150)

# predict a new property, with an error band
new = pd.DataFrame([{"sqft": 1500, "rooms": 3, "age": 5,
                     "type": "terrace", "city": "KL"}])
p = model.predict(new)[0]
rmse = root_mean_squared_error(yte, pred)
print(f"\nestimate: RM {p:,.0f}  (± RM {rmse:,.0f})")

Sample output

CV R²: 0.942
test RMSE : RM 58,400
test MAE  : RM 44,100

estimate: RM 742,000  (± RM 58,400)

Read the diff

The estimate comes with an honest ± band derived from the test RMSE — never quote a price to the nearest ringgit. High R² here is partly because our synthetic data is fairly linear; real property data is noisier (location micro-effects, renovation, timing), so expect lower R² and wider bands in the wild.

Extensions

13 min

01 🟢 Linear vs forest

Compare LinearRegression and RandomForestRegressor on R². Which fits this data better, and why?

02 🟡 Importance chart

Plot feature importances. Does the model agree that sqft and city drive price?

03 🔴 Per-city error

Compute RMSE separately for each city. Is the model better in some cities than others? Why might that be?

Stretch · A Tiny Price App

8 min

Save the trained pipeline with joblib.dump(model, "model.pkl"). Write a small script that loads it and predicts a price from command-line args. (In Lesson 44 you'll wrap a model like this in a Flask web app.)

Show the save/load pattern

import joblib
joblib.dump(model, "model.pkl")

# later / elsewhere:
loaded = joblib.load("model.pkl")
new = pd.DataFrame([{"sqft": 1200, "rooms": 2, "age": 10,
                     "type": "apartment", "city": "Penang"}])
print("RM", round(loaded.predict(new)[0]))

A whole pipeline pickles as one object — prep + model travel together. That's why pipelines beat doing prep manually.

Recap

3 min

Regression = same pipeline, Regressor at the end, RMSE/R² instead of accuracy. Always present predictions with an error band. Save the whole pipeline with joblib so prep + model deploy as one. You now have two complete projects — and the classic-ML toolkit is done. Next: neural networks.

Homework

4 min

Finish the property predictor. Save the pipeline with joblib, then write a tiny predictor script. Report CV R², test RMSE, the top features, and one honest limitation of your model.

import numpy as np, pandas as pd rng = np.random.default_rng(0) n = 600 df = pd.DataFrame({ "sqft": rng.integers(600, 3000, n), "rooms": rng.integers(1, 6, n), "age": rng.integers(0, 40, n), "type": rng.choice(["apartment", "terrace", "bungalow"], n), "city": rng.choice(["KL", "Penang", "JB", "Ipoh"], n), }) base = (df["sqft"] * 350 + df["rooms"] * 25000 - df["age"] * 4000 + df["type"].map({"apartment": 0, "terrace": 80000, "bungalow": 250000}) + df["city"].map({"KL": 200000, "Penang": 120000, "JB": 60000, "Ipoh": 0})) df["price"] = (base + rng.normal(0, 50000, n)).round(-3).clip(lower=80000) df.to_csv("property.csv", index=False)

from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import RandomForestRegressor num_cols = ["sqft", "rooms", "age"] cat_cols = ["type", "city"] pre = ColumnTransformer([ ("num", StandardScaler(), num_cols), ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols), ]) model = Pipeline([("prep", pre), ("reg", RandomForestRegressor(n_estimators=300, random_state=0))])

from sklearn.model_selection import cross_val_score r2 = cross_val_score(model, X, y, cv=5, scoring="r2").mean() rmse = -cross_val_score(model, X, y, cv=5, scoring="neg_root_mean_squared_error").mean() print(f"R²: {r2:.3f} RMSE: RM {rmse:,.0f}")

# property.py — predict & present import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error import matplotlib.pyplot as plt df = pd.read_csv("property.csv") y = df.pop("price") X = df pre = ColumnTransformer([ ("num", StandardScaler(), ["sqft", "rooms", "age"]), ("cat", OneHotEncoder(handle_unknown="ignore"), ["type", "city"]), ]) model = Pipeline([("prep", pre), ("reg", RandomForestRegressor(n_estimators=300, random_state=0))]) r2 = cross_val_score(model, X, y, cv=5, scoring="r2").mean() print(f"CV R²: {r2:.3f}") Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) model.fit(Xtr, ytr) pred = model.predict(Xte) print(f"test RMSE : RM {root_mean_squared_error(yte, pred):,.0f}") print(f"test MAE : RM {mean_absolute_error(yte, pred):,.0f}") # predicted vs actual plt.figure(figsize=(6, 6)) plt.scatter(yte, pred, alpha=0.4, s=12) lim = [yte.min(), yte.max()] plt.plot(lim, lim, "r--") plt.xlabel("actual (RM)"); plt.ylabel("predicted (RM)") plt.title("Property price: predicted vs actual") plt.tight_layout(); plt.savefig("property_pred.png", dpi=150) # predict a new property, with an error band new = pd.DataFrame([{"sqft": 1500, "rooms": 3, "age": 5, "type": "terrace", "city": "KL"}]) p = model.predict(new)[0] rmse = root_mean_squared_error(yte, pred) print(f"\nestimate: RM {p:,.0f} (± RM {rmse:,.0f})")

import joblib joblib.dump(model, "model.pkl") # later / elsewhere: loaded = joblib.load("model.pkl") new = pd.DataFrame([{"sqft": 1200, "rooms": 2, "age": 10, "type": "apartment", "city": "Penang"}]) print("RM", round(loaded.predict(new)[0]))