PY-L5-19 · Project — Predict Titanic Survival

Project Goals

3 min

Run the whole supervised pipeline on a real, messy dataset.
Build a ColumnTransformer + model pipeline.
Evaluate with cross-validation and a classification report.
Interpret which factors mattered for survival.

Warm-Up · The Data

5 min

import seaborn as sns
df = sns.load_dataset("titanic")    # built-in, no download
print(df[["survived","pclass","sex","age","fare","embarked"]].head())
print(df.isna().sum())              # age and others have gaps

   survived  pclass     sex   age     fare embarked
0         0       3    male  22.0   7.2500        S
1         1       1  female  38.0  71.2833        C
...
age          177   ← big gap to handle
embarked       2

Today's big idea

Real data is messy: missing ages, categorical sex/port, numeric fare. Your pipeline must handle all of it. The story the model uncovers — "women and first-class survived more" — is historically accurate, and a lesson in how data encodes society.

Plan · The Six Steps

14 min

1. LOAD     seaborn titanic
2. EXPLORE  survival rate by sex, class (a quick groupby)
3. SELECT   features: pclass, sex, age, sibsp, parch, fare, embarked
4. PREP     impute age/embarked, scale numerics, one-hot categoricals
5. TRAIN    pipeline + RandomForest, cross-validated
6. EVALUATE classification report + feature importances

Step 2 — explore first

print(df.groupby("sex")["survived"].mean().round(2))
print(df.groupby("pclass")["survived"].mean().round(2))

sex
female    0.74
male      0.19
pclass
1    0.63
2    0.47
3    0.24

Before training, you already expect sex and pclass to dominate. If your trained model disagrees, something is wrong.

Steps 3-4 — the pipeline pieces

num_cols = ["age", "sibsp", "parch", "fare"]
cat_cols = ["pclass", "sex", "embarked"]
# numeric: median impute + scale
# categorical: most_frequent impute + one-hot

Build · titanic.py

12 min

# titanic.py — full ML pipeline
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

df = sns.load_dataset("titanic")
y = df["survived"]
X = df[["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]]

num_cols = ["age", "sibsp", "parch", "fare"]
cat_cols = ["pclass", "sex", "embarked"]

pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                      ("sc", StandardScaler())]), num_cols),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
                      ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])

model = Pipeline([("prep", pre),
                  ("clf", RandomForestClassifier(n_estimators=200, random_state=0))])

# cross-validated accuracy
cv = cross_val_score(model, X, y, cv=5)
print(f"CV accuracy: {cv.mean():.1%} ± {cv.std():.1%}")

# final fit + report on a held-out test set
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2,
                                      stratify=y, random_state=0)
model.fit(Xtr, ytr)
print(classification_report(yte, model.predict(Xte),
                            target_names=["died", "survived"]))

# feature importances (names come out of the ColumnTransformer)
names = model.named_steps["prep"].get_feature_names_out()
imp = pd.Series(model.named_steps["clf"].feature_importances_, index=names)
print("\ntop factors:")
print(imp.sort_values(ascending=False).head(6).round(3))

Sample output

CV accuracy: 81.0% ± 2.4%

              precision    recall  f1-score   support
        died       0.83      0.86      0.84       110
    survived       0.77      0.72      0.75        69
    accuracy                           0.81       179

top factors:
cat__sex_male       0.27
num__fare           0.21
num__age            0.18
cat__pclass_3       0.10
num__sibsp          0.06

Read the diff

The model agrees with your exploration: sex and class/fare dominate. ~81% accuracy is solid for Titanic. The interpretable importances let you tell the story honestly — and notice that "survived" recall (0.72) is lower, meaning the model misses some survivors. That's where you'd focus next.

Extensions

13 min

01 🟢 Engineer family_size

Add a feature family_size = sibsp + parch + 1. Does CV accuracy improve?

02 🟡 Try other models

Swap RandomForest for LogisticRegression and a GradientBoostingClassifier. Compare CV accuracy.

03 🔴 Predict a new passenger

Build a one-row DataFrame for "a 10-year-old girl in first class, fare 80, embarked C" and predict her survival probability.

Hint

new = pd.DataFrame([{
    "pclass": 1, "sex": "female", "age": 10,
    "sibsp": 0, "parch": 1, "fare": 80, "embarked": "C",
}])
print(model.predict_proba(new).round(2))

Stretch · Beat 83%

8 min

Try to push CV accuracy past 83%: engineer features (title from name, is_alone, fare bins), tune the forest with GridSearchCV, or try gradient boosting. Document what helped and what didn't.

Recap

3 min

You ran a complete ML project: explore → prep (impute, scale, encode in a ColumnTransformer) → train (cross-validated forest) → evaluate (report + importances) → interpret. The model recovered the real history. This six-step shape is every supervised project you'll ever do. Next: the same shape, for predicting a number.

Homework

4 min

Complete the Titanic project and write a one-page report: your CV accuracy, the classification report, the top factors, and a paragraph telling the survival story the data reveals (and one caveat about reading too much into it).

import seaborn as sns df = sns.load_dataset("titanic") # built-in, no download print(df[["survived","pclass","sex","age","fare","embarked"]].head()) print(df.isna().sum()) # age and others have gaps

survived pclass sex age fare embarked 0 0 3 male 22.0 7.2500 S 1 1 1 female 38.0 71.2833 C ... age 177 ← big gap to handle embarked 2

1. LOAD seaborn titanic 2. EXPLORE survival rate by sex, class (a quick groupby) 3. SELECT features: pclass, sex, age, sibsp, parch, fare, embarked 4. PREP impute age/embarked, scale numerics, one-hot categoricals 5. TRAIN pipeline + RandomForest, cross-validated 6. EVALUATE classification report + feature importances

# titanic.py — full ML pipeline import seaborn as sns import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report df = sns.load_dataset("titanic") y = df["survived"] X = df[["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]] num_cols = ["age", "sibsp", "parch", "fare"] cat_cols = ["pclass", "sex", "embarked"] pre = ColumnTransformer([ ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num_cols), ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols), ]) model = Pipeline([("prep", pre), ("clf", RandomForestClassifier(n_estimators=200, random_state=0))]) # cross-validated accuracy cv = cross_val_score(model, X, y, cv=5) print(f"CV accuracy: {cv.mean():.1%} ± {cv.std():.1%}") # final fit + report on a held-out test set Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0) model.fit(Xtr, ytr) print(classification_report(yte, model.predict(Xte), target_names=["died", "survived"])) # feature importances (names come out of the ColumnTransformer) names = model.named_steps["prep"].get_feature_names_out() imp = pd.Series(model.named_steps["clf"].feature_importances_, index=names) print("\ntop factors:") print(imp.sort_values(ascending=False).head(6).round(3))

CV accuracy: 81.0% ± 2.4% precision recall f1-score support died 0.83 0.86 0.84 110 survived 0.77 0.72 0.75 69 accuracy 0.81 179 top factors: cat__sex_male 0.27 num__fare 0.21 num__age 0.18 cat__pclass_3 0.10 num__sibsp 0.06