PY-L5-14 · Random Forests — Many Trees, One Vote

Learning Goals

3 min

Explain bagging: many trees on bootstrap samples, voting together.
Train a RandomForestClassifier in two lines.
Read forest feature importances (more stable than one tree).
Tune n_estimators and max_depth.

Warm-Up · Wisdom of the Crowd

5 min

Ask one person to guess the number of sweets in a jar — wildly off. Ask 100 people and average — eerily close. That's the random forest principle: many imperfect, diverse guessers beat one expert.

Random forest = 100s of decision trees, where each tree:
  - trains on a random BOOTSTRAP sample of rows (with replacement)
  - considers only a random SUBSET of features at each split
Final prediction = majority vote (classification) or average (regression)

Today's big idea

Each tree overfits differently. Because their errors are uncorrelated, voting cancels the noise and keeps the signal. This is "bagging" (bootstrap aggregating) — and it just works, with almost no tuning.

New Concept · The Forest

14 min

Two lines to a strong model

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X_train, y_train)
print(rf.score(X_test, y_test))

Why it beats a single tree

single tree:  high accuracy on train, twitchy on test (high variance)
forest:       averages many trees → much lower variance, similar bias
              → better test accuracy, more stable across splits

Key knobs

n_estimators   how many trees (more = better but slower; ~100-300 is plenty)
max_depth      cap each tree's depth (None = grow fully, usual default)
max_features   features considered per split ("sqrt" is the classic default)
n_jobs=-1      use all CPU cores (trees train independently → parallel)

Feature importances — now trustworthy

import pandas as pd
imp = pd.Series(rf.feature_importances_, index=feature_names)
print(imp.sort_values(ascending=False).head())

Averaged over many trees, forest importances are far more stable than a single tree's — a real tool for understanding which features matter.

Regression too

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=200, random_state=0)
# same fit / predict / score — score returns R²

Worked Example · Tree vs Forest

12 min

# tree_vs_forest.py
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

X, y = load_breast_cancer(return_X_y=True)

tree   = DecisionTreeClassifier(random_state=0)
forest = RandomForestClassifier(n_estimators=200, random_state=0, n_jobs=-1)

for name, m in [("single tree", tree), ("random forest", forest)]:
    s = cross_val_score(m, X, y, cv=5)
    print(f"  {name:<14} {s.mean():.1%} ± {s.std():.1%}")

# Top features from the forest
forest.fit(X, y)
names = load_breast_cancer().feature_names
imp = pd.Series(forest.feature_importances_, index=names)
print("\ntop features:")
print(imp.sort_values(ascending=False).head(5).round(3))

Sample output

  single tree    91.4% ± 2.3%
  random forest  96.0% ± 1.6%

top features:
worst perimeter      0.142
worst concave points 0.139
worst radius         0.111
mean concave points  0.104
worst area           0.092

Read the diff

The forest beats the single tree on both mean (96% vs 91%) and stability (smaller std). And the importances make medical sense — "worst" (largest) cell measurements dominate. Random forests are the sensible default for tabular data: strong, robust, and barely need tuning.

Try It Yourself

13 min

01 🟢 Train a forest

Train a 100-tree forest on any dataset. Print CV accuracy.

02 🟡 Does more trees help?

Plot CV accuracy for n_estimators = 1, 5, 10, 50, 100, 300. Where do the gains flatten?

Hint

for n in [1, 5, 10, 50, 100, 300]:
    s = cross_val_score(RandomForestClassifier(n_estimators=n, random_state=0),
                        X, y, cv=5).mean()
    print(n, round(s, 3))

03 🔴 Importance chart

Plot the top 10 feature importances as a horizontal bar chart. Interpret the top three.

Mini-Challenge · Beat the Baseline

8 min

Take any dataset. Establish a baseline with a single tree, then try to beat it with a tuned random forest (sweep n_estimators and max_depth via cross-validation). Report the improvement.

Show one possible solution

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

grid = GridSearchCV(
    RandomForestClassifier(random_state=0, n_jobs=-1),
    {"n_estimators": [100, 300], "max_depth": [None, 5, 10]},
    cv=5,
)
grid.fit(X, y)
print("best params:", grid.best_params_)
print("best CV acc:", round(grid.best_score_, 3))

Non-negotiables: a baseline number, a CV-tuned forest, and the reported gain. GridSearchCV automates the "try every combination with CV" loop.

Recap

3 min

A random forest grows many diverse trees (random rows + random features) and votes. The diversity cancels each tree's overfitting, giving higher, more stable accuracy with almost no tuning. Importances are trustworthy. For most tabular problems, a random forest is the right default. Next: predicting numbers with linear regression.

Vocabulary Card

ensemble: A model made of many models whose predictions are combined.
bagging: Bootstrap Aggregating — train each model on a random resample, then vote/average.
n_estimators: Number of trees in the forest.
GridSearchCV: Tries every hyperparameter combination with cross-validation to find the best.

Homework

4 min

On your dataset, compare single tree vs forest (CV mean ± std), plot the forest's top-10 importances, and tune n_estimators + max_depth with GridSearchCV. Write the final accuracy and the best params.

Random forest = 100s of decision trees, where each tree: - trains on a random BOOTSTRAP sample of rows (with replacement) - considers only a random SUBSET of features at each split Final prediction = majority vote (classification) or average (regression)

single tree: high accuracy on train, twitchy on test (high variance) forest: averages many trees → much lower variance, similar bias → better test accuracy, more stable across splits

n_estimators how many trees (more = better but slower; ~100-300 is plenty) max_depth cap each tree's depth (None = grow fully, usual default) max_features features considered per split ("sqrt" is the classic default) n_jobs=-1 use all CPU cores (trees train independently → parallel)

# tree_vs_forest.py from sklearn.datasets import load_breast_cancer from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier import pandas as pd X, y = load_breast_cancer(return_X_y=True) tree = DecisionTreeClassifier(random_state=0) forest = RandomForestClassifier(n_estimators=200, random_state=0, n_jobs=-1) for name, m in [("single tree", tree), ("random forest", forest)]: s = cross_val_score(m, X, y, cv=5) print(f" {name:<14} {s.mean():.1%} ± {s.std():.1%}") # Top features from the forest forest.fit(X, y) names = load_breast_cancer().feature_names imp = pd.Series(forest.feature_importances_, index=names) print("\ntop features:") print(imp.sort_values(ascending=False).head(5).round(3))

single tree 91.4% ± 2.3% random forest 96.0% ± 1.6% top features: worst perimeter 0.142 worst concave points 0.139 worst radius 0.111 mean concave points 0.104 worst area 0.092

from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier grid = GridSearchCV( RandomForestClassifier(random_state=0, n_jobs=-1), {"n_estimators": [100, 300], "max_depth": [None, 5, 10]}, cv=5, ) grid.fit(X, y) print("best params:", grid.best_params_) print("best CV acc:", round(grid.best_score_, 3))