PY-L5-03 · The Data Mindset — Features & Labels

Learning Goals

3 min

Define feature, label, sample, feature matrix X, target vector y.
Reshape any prediction problem into an X / y table.
Know the convention: X is 2-D (samples × features), y is 1-D.
Spot "leaky" features that secretly contain the answer.

Warm-Up · The Universal Table

5 min

  ←──────── features (X) ────────→   label (y)
┌─────────┬────────┬───────────┬───┬──────────┐
│ length  │ weight │ has_whisk │…  │ animal   │
├─────────┼────────┼───────────┼───┼──────────┤
│   45    │   4.0  │    yes    │…  │  cat     │  ← one sample
│   70    │  25.0  │    no     │…  │  dog     │
│   40    │   3.5  │    yes    │…  │  cat     │
└─────────┴────────┴───────────┴───┴──────────┘

Today's big idea

One sample is a row. Its features are the input columns (X). Its label is the answer column (y). Training = "learn the function f so that f(X) ≈ y". Every supervised problem reduces to this.

New Concept · X and y

14 min

The shapes everyone agrees on

X  : 2-D, shape (n_samples, n_features)   the clues
y  : 1-D, shape (n_samples,)              the answers

scikit-learn (Lessons 9+) expects exactly this. Get the shapes right and 90% of beginner errors vanish.

Building X and y from a DataFrame

import pandas as pd

df = pd.DataFrame({
    "length": [45, 70, 40, 60],
    "weight": [4.0, 25.0, 3.5, 18.0],
    "animal": ["cat", "dog", "cat", "dog"],
})

X = df[["length", "weight"]]   # features — a DataFrame (2-D)
y = df["animal"]                # label — a Series (1-D)

print(X.shape)   # (4, 2)
print(y.shape)   # (4,)

Feature types

numeric       45, 4.0, 12.5         use as-is
categorical   "cat", "red", "KL"    must be encoded to numbers (L5-18)
ordinal       small<medium<large    encode preserving order
text          "great product!"      vectorise (L5-35)
image         pixel grid            flatten or use a CNN (L5-21+)

Choosing features — the art

More features isn't always better — irrelevant ones add noise.
A good feature varies with the label. "Weight" helps tell cats from dogs; "name length" doesn't.
Domain knowledge beats raw column count.

The deadly trap: data leakage

A leaky feature secretly contains the answer. Example: predicting "did this patient have surgery?" using a column "surgery_recovery_days". The model gets 100% accuracy in training and is useless in real life, because that column doesn't exist before the surgery.

Rule of thumb: a feature is legal only if it would be
available at PREDICTION time, before the label is known.

Worked Example · Reshape Three Problems

12 min

Problem: "Will a student pass the exam?"
  features (X): hours_studied, attendance%, past_score, sleep_hours
  label    (y): passed  (yes/no)  → classification
  LEAKY:        "final_grade" — that IS the answer

Problem: "How much will this used car sell for?"
  features (X): age_years, mileage_km, brand, engine_cc
  label    (y): price_rm  → regression
  LEAKY:        "actual_sale_price" — obviously

Problem: "Which genre is this song?"
  features (X): tempo, loudness, danceability, duration
  label    (y): genre  → classification

import pandas as pd

students = pd.DataFrame({
    "hours":     [1, 5, 3, 8, 2, 6],
    "attend":    [60, 95, 70, 99, 50, 90],
    "past":      [45, 80, 60, 88, 40, 75],
    "passed":    [0, 1, 0, 1, 0, 1],   # 0 = fail, 1 = pass
})

X = students[["hours", "attend", "past"]]
y = students["passed"]

print("X shape:", X.shape)   # (6, 3)
print("y shape:", y.shape)   # (6,)
print("\nfeatures:\n", X.head(2))
print("\nlabels:", y.tolist())

Read the diff

The label is encoded as 0/1 — models want numbers, not "pass"/"fail" strings (we'll automate that in L5-18). X is the three honest features that exist before the exam. No leakage. This is now ready to feed a classifier.

Try It Yourself

13 min

01 🟢 Split X and y

From any CSV you have, choose a label column and build X and y. Print both shapes.

02 🟡 Spot the leak

You want to predict "will this customer churn next month?". Given columns signup_date, plan, monthly_spend, support_tickets, cancellation_date — which feature is leaky and why?

Answer

cancellation_date only exists after the customer churned — it IS the label in disguise. Drop it from X.

03 🔴 Design a feature set

You want to predict a YouTube video's view count in its first week. List 5 honest features available at upload time and 2 leaky ones to avoid.

Example answer

Honest: title length, thumbnail brightness, channel subscriber count, upload hour, video duration. Leaky: total likes (accumulates after upload), comments count (same), "is_trending" (a consequence of views).

Mini-Challenge · Build a Mini Dataset

8 min

Invent a prediction problem from your own life (e.g., "will I enjoy this movie?"). Build a DataFrame with at least 8 samples, 4 honest features, and 1 label. Split into X / y and confirm the shapes. Note one feature you deliberately excluded because it would leak.

Show one possible solution

# movie_taste.py
import pandas as pd

df = pd.DataFrame({
    "runtime_min":   [90, 150, 100, 175, 95, 130, 110, 160],
    "is_animated":   [1, 0, 1, 0, 1, 0, 0, 0],
    "imdb_rating":   [7.8, 6.2, 8.1, 5.5, 7.0, 8.8, 6.9, 7.4],
    "is_sequel":     [0, 1, 0, 1, 0, 0, 1, 1],
    "i_enjoyed":     [1, 0, 1, 0, 1, 1, 0, 1],   # label
})

X = df[["runtime_min", "is_animated", "imdb_rating", "is_sequel"]]
y = df["i_enjoyed"]
print("X:", X.shape, " y:", y.shape)

# Excluded leaky feature: "my_rating_after_watching" —
# that's basically the label, not available before watching.

Non-negotiables: ≥8 samples, ≥4 features, a 0/1 label, correct shapes, one documented excluded leaky feature.

Recap

3 min

Every supervised problem is a table: features X (2-D, samples × features) and label y (1-D). Pick features that vary with the label and are available before prediction time. Encode categories and labels as numbers. Beware leaky features that secretly contain the answer — they make training look perfect and production fail.

Vocabulary Card

feature: An input column — a clue the model uses. Collectively the matrix X.
label / target: The answer column the model predicts — the vector y.
sample: One row — one example with its features and label.
data leakage: A feature that reveals the answer; inflates training scores, ruins real-world performance.

Homework

4 min

Take three different real datasets (or invent them) and for each write: the prediction question, the feature columns, the label column, whether it's classification or regression, and any leaky feature to avoid. Submit as a markdown table.

Question                  Features                     Label        Type            Leaky to avoid
─────────────────────────  ───────────────────────────  ───────────  ──────────────  ──────────────────
Will it rain tomorrow?     humidity, pressure, temp     rain (y/n)   classification  tomorrow's_humidity
Final exam score?          attendance, hw_avg, hours    score (0-100) regression     mock_exam_2_days_before
Email spam?                word counts, sender domain   spam (y/n)   classification  "user_marked_spam"

Non-negotiables: each row has all five columns filled and a plausible leaky feature.

←──────── features (X) ────────→ label (y) ┌─────────┬────────┬───────────┬───┬──────────┐ │ length │ weight │ has_whisk │… │ animal │ ├─────────┼────────┼───────────┼───┼──────────┤ │ 45 │ 4.0 │ yes │… │ cat │ ← one sample │ 70 │ 25.0 │ no │… │ dog │ │ 40 │ 3.5 │ yes │… │ cat │ └─────────┴────────┴───────────┴───┴──────────┘

import pandas as pd df = pd.DataFrame({ "length": [45, 70, 40, 60], "weight": [4.0, 25.0, 3.5, 18.0], "animal": ["cat", "dog", "cat", "dog"], }) X = df[["length", "weight"]] # features — a DataFrame (2-D) y = df["animal"] # label — a Series (1-D) print(X.shape) # (4, 2) print(y.shape) # (4,)

numeric 45, 4.0, 12.5 use as-is categorical "cat", "red", "KL" must be encoded to numbers (L5-18) ordinal small<medium<large encode preserving order text "great product!" vectorise (L5-35) image pixel grid flatten or use a CNN (L5-21+)

Problem: "Will a student pass the exam?" features (X): hours_studied, attendance%, past_score, sleep_hours label (y): passed (yes/no) → classification LEAKY: "final_grade" — that IS the answer Problem: "How much will this used car sell for?" features (X): age_years, mileage_km, brand, engine_cc label (y): price_rm → regression LEAKY: "actual_sale_price" — obviously Problem: "Which genre is this song?" features (X): tempo, loudness, danceability, duration label (y): genre → classification

import pandas as pd students = pd.DataFrame({ "hours": [1, 5, 3, 8, 2, 6], "attend": [60, 95, 70, 99, 50, 90], "past": [45, 80, 60, 88, 40, 75], "passed": [0, 1, 0, 1, 0, 1], # 0 = fail, 1 = pass }) X = students[["hours", "attend", "past"]] y = students["passed"] print("X shape:", X.shape) # (6, 3) print("y shape:", y.shape) # (6,) print("\nfeatures:\n", X.head(2)) print("\nlabels:", y.tolist())

# movie_taste.py import pandas as pd df = pd.DataFrame({ "runtime_min": [90, 150, 100, 175, 95, 130, 110, 160], "is_animated": [1, 0, 1, 0, 1, 0, 0, 0], "imdb_rating": [7.8, 6.2, 8.1, 5.5, 7.0, 8.8, 6.9, 7.4], "is_sequel": [0, 1, 0, 1, 0, 0, 1, 1], "i_enjoyed": [1, 0, 1, 0, 1, 1, 0, 1], # label }) X = df[["runtime_min", "is_animated", "imdb_rating", "is_sequel"]] y = df["i_enjoyed"] print("X:", X.shape, " y:", y.shape) # Excluded leaky feature: "my_rating_after_watching" — # that's basically the label, not available before watching.