Project Goals
3 min- Run the whole supervised pipeline on a real, messy dataset.
- Build a
ColumnTransformer+ model pipeline. - Evaluate with cross-validation and a classification report.
- Interpret which factors mattered for survival.
Warm-Up · The Data
5 minimport seaborn as sns df = sns.load_dataset("titanic") # built-in, no download print(df[["survived","pclass","sex","age","fare","embarked"]].head()) print(df.isna().sum()) # age and others have gaps
survived pclass sex age fare embarked 0 0 3 male 22.0 7.2500 S 1 1 1 female 38.0 71.2833 C ... age 177 ← big gap to handle embarked 2
Real data is messy: missing ages, categorical sex/port, numeric fare. Your pipeline must handle all of it. The story the model uncovers — "women and first-class survived more" — is historically accurate, and a lesson in how data encodes society.
Plan · The Six Steps
14 min1. LOAD seaborn titanic 2. EXPLORE survival rate by sex, class (a quick groupby) 3. SELECT features: pclass, sex, age, sibsp, parch, fare, embarked 4. PREP impute age/embarked, scale numerics, one-hot categoricals 5. TRAIN pipeline + RandomForest, cross-validated 6. EVALUATE classification report + feature importances
Step 2 — explore first
print(df.groupby("sex")["survived"].mean().round(2)) print(df.groupby("pclass")["survived"].mean().round(2))
sex female 0.74 male 0.19 pclass 1 0.63 2 0.47 3 0.24
Before training, you already expect sex and pclass to dominate. If your trained model disagrees, something is wrong.
Steps 3-4 — the pipeline pieces
num_cols = ["age", "sibsp", "parch", "fare"] cat_cols = ["pclass", "sex", "embarked"] # numeric: median impute + scale # categorical: most_frequent impute + one-hot
Build · titanic.py
12 min# titanic.py — full ML pipeline import seaborn as sns import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report df = sns.load_dataset("titanic") y = df["survived"] X = df[["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]] num_cols = ["age", "sibsp", "parch", "fare"] cat_cols = ["pclass", "sex", "embarked"] pre = ColumnTransformer([ ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num_cols), ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols), ]) model = Pipeline([("prep", pre), ("clf", RandomForestClassifier(n_estimators=200, random_state=0))]) # cross-validated accuracy cv = cross_val_score(model, X, y, cv=5) print(f"CV accuracy: {cv.mean():.1%} ± {cv.std():.1%}") # final fit + report on a held-out test set Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0) model.fit(Xtr, ytr) print(classification_report(yte, model.predict(Xte), target_names=["died", "survived"])) # feature importances (names come out of the ColumnTransformer) names = model.named_steps["prep"].get_feature_names_out() imp = pd.Series(model.named_steps["clf"].feature_importances_, index=names) print("\ntop factors:") print(imp.sort_values(ascending=False).head(6).round(3))
Sample output
CV accuracy: 81.0% ± 2.4%
precision recall f1-score support
died 0.83 0.86 0.84 110
survived 0.77 0.72 0.75 69
accuracy 0.81 179
top factors:
cat__sex_male 0.27
num__fare 0.21
num__age 0.18
cat__pclass_3 0.10
num__sibsp 0.06Read the diff
The model agrees with your exploration: sex and class/fare dominate. ~81% accuracy is solid for Titanic. The interpretable importances let you tell the story honestly — and notice that "survived" recall (0.72) is lower, meaning the model misses some survivors. That's where you'd focus next.
Extensions
13 minAdd a feature family_size = sibsp + parch + 1. Does CV accuracy improve?
Swap RandomForest for LogisticRegression and a GradientBoostingClassifier. Compare CV accuracy.
Build a one-row DataFrame for "a 10-year-old girl in first class, fare 80, embarked C" and predict her survival probability.
Hint
new = pd.DataFrame([{ "pclass": 1, "sex": "female", "age": 10, "sibsp": 0, "parch": 1, "fare": 80, "embarked": "C", }]) print(model.predict_proba(new).round(2))
Stretch · Beat 83%
8 minTry to push CV accuracy past 83%: engineer features (title from name, is_alone, fare bins), tune the forest with GridSearchCV, or try gradient boosting. Document what helped and what didn't.
Recap
3 minYou ran a complete ML project: explore → prep (impute, scale, encode in a ColumnTransformer) → train (cross-validated forest) → evaluate (report + importances) → interpret. The model recovered the real history. This six-step shape is every supervised project you'll ever do. Next: the same shape, for predicting a number.
Homework
4 minComplete the Titanic project and write a one-page report: your CV accuracy, the classification report, the top factors, and a paragraph telling the survival story the data reveals (and one caveat about reading too much into it).
The deliverable is titanic.py + the report. A strong report notes the "women and children first" + class effects AND warns that the model just reflects 1912 society, not a rule about who "deserves" rescue.