Learning Goals
3 min- Explain bagging: many trees on bootstrap samples, voting together.
- Train a
RandomForestClassifierin two lines. - Read forest feature importances (more stable than one tree).
- Tune
n_estimatorsandmax_depth.
Warm-Up · Wisdom of the Crowd
5 minAsk one person to guess the number of sweets in a jar — wildly off. Ask 100 people and average — eerily close. That's the random forest principle: many imperfect, diverse guessers beat one expert.
Random forest = 100s of decision trees, where each tree: - trains on a random BOOTSTRAP sample of rows (with replacement) - considers only a random SUBSET of features at each split Final prediction = majority vote (classification) or average (regression)
Each tree overfits differently. Because their errors are uncorrelated, voting cancels the noise and keeps the signal. This is "bagging" (bootstrap aggregating) — and it just works, with almost no tuning.
New Concept · The Forest
14 minTwo lines to a strong model
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, random_state=0) rf.fit(X_train, y_train) print(rf.score(X_test, y_test))
Why it beats a single tree
single tree: high accuracy on train, twitchy on test (high variance)
forest: averages many trees → much lower variance, similar bias
→ better test accuracy, more stable across splitsKey knobs
n_estimators how many trees (more = better but slower; ~100-300 is plenty)
max_depth cap each tree's depth (None = grow fully, usual default)
max_features features considered per split ("sqrt" is the classic default)
n_jobs=-1 use all CPU cores (trees train independently → parallel)Feature importances — now trustworthy
import pandas as pd imp = pd.Series(rf.feature_importances_, index=feature_names) print(imp.sort_values(ascending=False).head())
Averaged over many trees, forest importances are far more stable than a single tree's — a real tool for understanding which features matter.
Regression too
from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=200, random_state=0) # same fit / predict / score — score returns R²
Worked Example · Tree vs Forest
12 min# tree_vs_forest.py from sklearn.datasets import load_breast_cancer from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier import pandas as pd X, y = load_breast_cancer(return_X_y=True) tree = DecisionTreeClassifier(random_state=0) forest = RandomForestClassifier(n_estimators=200, random_state=0, n_jobs=-1) for name, m in [("single tree", tree), ("random forest", forest)]: s = cross_val_score(m, X, y, cv=5) print(f" {name:<14} {s.mean():.1%} ± {s.std():.1%}") # Top features from the forest forest.fit(X, y) names = load_breast_cancer().feature_names imp = pd.Series(forest.feature_importances_, index=names) print("\ntop features:") print(imp.sort_values(ascending=False).head(5).round(3))
Sample output
single tree 91.4% ± 2.3% random forest 96.0% ± 1.6% top features: worst perimeter 0.142 worst concave points 0.139 worst radius 0.111 mean concave points 0.104 worst area 0.092
Read the diff
The forest beats the single tree on both mean (96% vs 91%) and stability (smaller std). And the importances make medical sense — "worst" (largest) cell measurements dominate. Random forests are the sensible default for tabular data: strong, robust, and barely need tuning.
Try It Yourself
13 minTrain a 100-tree forest on any dataset. Print CV accuracy.
Plot CV accuracy for n_estimators = 1, 5, 10, 50, 100, 300. Where do the gains flatten?
Hint
for n in [1, 5, 10, 50, 100, 300]: s = cross_val_score(RandomForestClassifier(n_estimators=n, random_state=0), X, y, cv=5).mean() print(n, round(s, 3))
Plot the top 10 feature importances as a horizontal bar chart. Interpret the top three.
Mini-Challenge · Beat the Baseline
8 minTake any dataset. Establish a baseline with a single tree, then try to beat it with a tuned random forest (sweep n_estimators and max_depth via cross-validation). Report the improvement.
Show one possible solution
from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier grid = GridSearchCV( RandomForestClassifier(random_state=0, n_jobs=-1), {"n_estimators": [100, 300], "max_depth": [None, 5, 10]}, cv=5, ) grid.fit(X, y) print("best params:", grid.best_params_) print("best CV acc:", round(grid.best_score_, 3))
Non-negotiables: a baseline number, a CV-tuned forest, and the reported gain. GridSearchCV automates the "try every combination with CV" loop.
Recap
3 minA random forest grows many diverse trees (random rows + random features) and votes. The diversity cancels each tree's overfitting, giving higher, more stable accuracy with almost no tuning. Importances are trustworthy. For most tabular problems, a random forest is the right default. Next: predicting numbers with linear regression.
Vocabulary Card
- ensemble
- A model made of many models whose predictions are combined.
- bagging
- Bootstrap Aggregating — train each model on a random resample, then vote/average.
- n_estimators
- Number of trees in the forest.
- GridSearchCV
- Tries every hyperparameter combination with cross-validation to find the best.
Homework
4 minOn your dataset, compare single tree vs forest (CV mean ± std), plot the forest's top-10 importances, and tune n_estimators + max_depth with GridSearchCV. Write the final accuracy and the best params.
Combine tree_vs_forest.py with the GridSearchCV block from the mini-challenge.