Learning Goals
3 min- Define feature, label, sample, feature matrix X, target vector y.
- Reshape any prediction problem into an X / y table.
- Know the convention: X is 2-D (samples × features), y is 1-D.
- Spot "leaky" features that secretly contain the answer.
Warm-Up · The Universal Table
5 min←──────── features (X) ────────→ label (y) ┌─────────┬────────┬───────────┬───┬──────────┐ │ length │ weight │ has_whisk │… │ animal │ ├─────────┼────────┼───────────┼───┼──────────┤ │ 45 │ 4.0 │ yes │… │ cat │ ← one sample │ 70 │ 25.0 │ no │… │ dog │ │ 40 │ 3.5 │ yes │… │ cat │ └─────────┴────────┴───────────┴───┴──────────┘
One sample is a row. Its features are the input columns (X). Its label is the answer column (y). Training = "learn the function f so that f(X) ≈ y". Every supervised problem reduces to this.
New Concept · X and y
14 minThe shapes everyone agrees on
X : 2-D, shape (n_samples, n_features) the clues y : 1-D, shape (n_samples,) the answers
scikit-learn (Lessons 9+) expects exactly this. Get the shapes right and 90% of beginner errors vanish.
Building X and y from a DataFrame
import pandas as pd df = pd.DataFrame({ "length": [45, 70, 40, 60], "weight": [4.0, 25.0, 3.5, 18.0], "animal": ["cat", "dog", "cat", "dog"], }) X = df[["length", "weight"]] # features — a DataFrame (2-D) y = df["animal"] # label — a Series (1-D) print(X.shape) # (4, 2) print(y.shape) # (4,)
Feature types
numeric 45, 4.0, 12.5 use as-is categorical "cat", "red", "KL" must be encoded to numbers (L5-18) ordinal small<medium<large encode preserving order text "great product!" vectorise (L5-35) image pixel grid flatten or use a CNN (L5-21+)
Choosing features — the art
- More features isn't always better — irrelevant ones add noise.
- A good feature varies with the label. "Weight" helps tell cats from dogs; "name length" doesn't.
- Domain knowledge beats raw column count.
The deadly trap: data leakage
A leaky feature secretly contains the answer. Example: predicting "did this patient have surgery?" using a column "surgery_recovery_days". The model gets 100% accuracy in training and is useless in real life, because that column doesn't exist before the surgery.
Rule of thumb: a feature is legal only if it would be available at PREDICTION time, before the label is known.
Worked Example · Reshape Three Problems
12 minProblem: "Will a student pass the exam?" features (X): hours_studied, attendance%, past_score, sleep_hours label (y): passed (yes/no) → classification LEAKY: "final_grade" — that IS the answer Problem: "How much will this used car sell for?" features (X): age_years, mileage_km, brand, engine_cc label (y): price_rm → regression LEAKY: "actual_sale_price" — obviously Problem: "Which genre is this song?" features (X): tempo, loudness, danceability, duration label (y): genre → classification
import pandas as pd students = pd.DataFrame({ "hours": [1, 5, 3, 8, 2, 6], "attend": [60, 95, 70, 99, 50, 90], "past": [45, 80, 60, 88, 40, 75], "passed": [0, 1, 0, 1, 0, 1], # 0 = fail, 1 = pass }) X = students[["hours", "attend", "past"]] y = students["passed"] print("X shape:", X.shape) # (6, 3) print("y shape:", y.shape) # (6,) print("\nfeatures:\n", X.head(2)) print("\nlabels:", y.tolist())
Read the diff
The label is encoded as 0/1 — models want numbers, not "pass"/"fail" strings (we'll automate that in L5-18). X is the three honest features that exist before the exam. No leakage. This is now ready to feed a classifier.
Try It Yourself
13 minFrom any CSV you have, choose a label column and build X and y. Print both shapes.
You want to predict "will this customer churn next month?". Given columns signup_date, plan, monthly_spend, support_tickets, cancellation_date — which feature is leaky and why?
Answer
cancellation_date only exists after the customer churned — it IS the label in disguise. Drop it from X.
You want to predict a YouTube video's view count in its first week. List 5 honest features available at upload time and 2 leaky ones to avoid.
Example answer
Honest: title length, thumbnail brightness, channel subscriber count, upload hour, video duration. Leaky: total likes (accumulates after upload), comments count (same), "is_trending" (a consequence of views).
Mini-Challenge · Build a Mini Dataset
8 minInvent a prediction problem from your own life (e.g., "will I enjoy this movie?"). Build a DataFrame with at least 8 samples, 4 honest features, and 1 label. Split into X / y and confirm the shapes. Note one feature you deliberately excluded because it would leak.
Show one possible solution
# movie_taste.py import pandas as pd df = pd.DataFrame({ "runtime_min": [90, 150, 100, 175, 95, 130, 110, 160], "is_animated": [1, 0, 1, 0, 1, 0, 0, 0], "imdb_rating": [7.8, 6.2, 8.1, 5.5, 7.0, 8.8, 6.9, 7.4], "is_sequel": [0, 1, 0, 1, 0, 0, 1, 1], "i_enjoyed": [1, 0, 1, 0, 1, 1, 0, 1], # label }) X = df[["runtime_min", "is_animated", "imdb_rating", "is_sequel"]] y = df["i_enjoyed"] print("X:", X.shape, " y:", y.shape) # Excluded leaky feature: "my_rating_after_watching" — # that's basically the label, not available before watching.
Non-negotiables: ≥8 samples, ≥4 features, a 0/1 label, correct shapes, one documented excluded leaky feature.
Recap
3 minEvery supervised problem is a table: features X (2-D, samples × features) and label y (1-D). Pick features that vary with the label and are available before prediction time. Encode categories and labels as numbers. Beware leaky features that secretly contain the answer — they make training look perfect and production fail.
Vocabulary Card
- feature
- An input column — a clue the model uses. Collectively the matrix
X. - label / target
- The answer column the model predicts — the vector
y. - sample
- One row — one example with its features and label.
- data leakage
- A feature that reveals the answer; inflates training scores, ruins real-world performance.
Homework
4 minTake three different real datasets (or invent them) and for each write: the prediction question, the feature columns, the label column, whether it's classification or regression, and any leaky feature to avoid. Submit as a markdown table.
Question Features Label Type Leaky to avoid ───────────────────────── ─────────────────────────── ─────────── ────────────── ────────────────── Will it rain tomorrow? humidity, pressure, temp rain (y/n) classification tomorrow's_humidity Final exam score? attendance, hw_avg, hours score (0-100) regression mock_exam_2_days_before Email spam? word counts, sender domain spam (y/n) classification "user_marked_spam"
Non-negotiables: each row has all five columns filled and a plausible leaky feature.