Learning Goals
3 min- Install scikit-learn; load a built-in dataset.
- Follow the universal recipe:
model.fit(X, y)→model.predict(X_new)→model.score. - Understand why we split data into train and test sets.
- Swap one model for another by changing a single line.
Warm-Up · The Recipe
5 minpip install scikit-learn
Every supervised model in scikit-learn: model = SomeModel() # 1. choose model.fit(X_train, y_train) # 2. learn from training data preds = model.predict(X_test) # 3. predict on new data model.score(X_test, y_test) # 4. how good?
scikit-learn gives every algorithm the same interface. Learn fit / predict / score once and KNN, trees, forests, regression all work identically. The hard part is the data, not the API.
New Concept · fit, predict, score
14 minLoad data
from sklearn.datasets import load_iris iris = load_iris(as_frame=True) X = iris.data # 150 × 4 features y = iris.target # 0,1,2 — three species print(X.shape, y.shape)
Split into train and test
Never test on data the model trained on — it would just recite memorised answers. Hold some back:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) # 80% to learn from, 20% kept secret for the exam
fit — learn the pattern
from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors=3) model.fit(X_train, y_train)
predict — apply it to new data
preds = model.predict(X_test) print(preds[:5]) # e.g. [1 0 2 1 1] print(y_test[:5].values) # the true answers
score — measure accuracy
acc = model.score(X_test, y_test) print(f"accuracy: {acc:.1%}") # e.g. 96.7%
For classifiers, score returns accuracy (fraction correct). For regressors, it returns R² (Lesson 15). Same method, sensible default per task.
Swapping models — one line
from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() # only this line changed model.fit(X_train, y_train) print(model.score(X_test, y_test))
Worked Example · Compare Three Models
12 min# compare.py — same data, three models, one loop from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) models = { "KNN (k=3)": KNeighborsClassifier(n_neighbors=3), "Decision Tree": DecisionTreeClassifier(random_state=42), "Logistic Reg": LogisticRegression(max_iter=200), } for name, model in models.items(): model.fit(X_train, y_train) acc = model.score(X_test, y_test) print(f" {name:<16} {acc:.1%}")
Sample output
KNN (k=3) 96.7% Decision Tree 100.0% Logistic Reg 100.0%
Read the diff
Three completely different algorithms, identical code — only the constructor changed. That's the power of scikit-learn's uniform interface. (Don't over-read the 100%: iris is famously easy, and the test set is tiny. Lesson 10 shows how to measure more honestly.)
Try It Yourself
13 minLoad iris, split, fit a KNN, print the test accuracy.
After fitting, predict the species of a flower with measurements [5.1, 3.5, 1.4, 0.2]. Map the number back to a species name.
Hint
import numpy as np names = load_iris().target_names p = model.predict(np.array([[5.1, 3.5, 1.4, 0.2]])) print(names[p[0]]) # 'setosa'
Loop n_neighbors from 1 to 20, record test accuracy for each, and report the best k.
Hint
best_k, best_acc = None, 0 for k in range(1, 21): m = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train) acc = m.score(X_test, y_test) if acc > best_acc: best_k, best_acc = k, acc print(f"best k = {best_k} ({best_acc:.1%})")
Mini-Challenge · The Model Leaderboard
8 minBuild a function leaderboard(X, y, models) that splits the data, fits each model, and prints a sorted accuracy table. Test it on the wine dataset (load_wine) with four different models.
Show one possible solution
from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression def leaderboard(X, y, models): Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=0) rows = [] for name, m in models.items(): m.fit(Xtr, ytr) rows.append((name, m.score(Xte, yte))) for name, acc in sorted(rows, key=lambda r: -r[1]): print(f" {name:<18} {acc:.1%}") X, y = load_wine(return_X_y=True) leaderboard(X, y, { "KNN": KNeighborsClassifier(), "Decision Tree": DecisionTreeClassifier(random_state=0), "Random Forest": RandomForestClassifier(random_state=0), "Logistic Reg": LogisticRegression(max_iter=5000), })
Non-negotiables: one split shared across all models, sorted output. Notice Random Forest usually tops the board — we'll see why in Lesson 14.
Recap
3 minThe universal recipe: fit(X_train, y_train), predict(X_test), score(X_test, y_test). Always split train/test so you measure generalisation, not memorisation. Swapping algorithms is a one-line change. Tomorrow we make the split smarter with cross-validation.
Vocabulary Card
- fit
- Train the model — learn parameters from training data.
- predict
- Apply the trained model to new inputs.
- score
- Default metric — accuracy for classifiers, R² for regressors.
- train/test split
- Holding back data the model never saw, to measure real performance.
Homework
4 minUse your own prepped dataset (or load_breast_cancer). Run the leaderboard with at least three models. Write down: which model won, the accuracy, and one reason the others might have lost.
from sklearn.datasets import load_breast_cancer X, y = load_breast_cancer(return_X_y=True) leaderboard(X, y, { "KNN": KNeighborsClassifier(), "Random Forest": RandomForestClassifier(random_state=0), "Logistic Reg": LogisticRegression(max_iter=5000), }) # Typically Logistic Reg / RF ~96%, KNN lower because it's # sensitive to unscaled features (fix that in Lesson 18).