Learning Goals
5 minBy the end of this lesson you can:
- Tell good training data from bad.
- Name three data problems: too little, unbalanced, wrongly labelled.
- Explain how bad data leads to a biased, unfair AI.
Warm-Up · Spot the Wrong One
8 minLast lesson we designed clean categories. Now we check the examples inside them.
Here are photos labelled "cat": a tabby cat, a black cat, a dog, a ginger cat. Which label is wrong?
Reveal the answer
The dog. A wrong label like this teaches the AI a mistake — it might start calling some dogs "cat".
New Concept · Garbage In, Garbage Out
18 minAn AI learns only from its data. If the data is poor, the AI will be poor too. Computer scientists call this "garbage in, garbage out".
Three common data problems
- Too little — only a few examples, so the pattern is weak.
- Unbalanced — far more of one kind than another, so the AI favours it.
- Wrong labels — examples in the wrong box, teaching mistakes.
From bad data to bias
Bias is unfairness in an AI's answers. It often comes straight from unfair data.
If a face model is trained mostly on one group of people, it works worse for everyone else — not because anyone meant harm, but because the data was not balanced.
Biased AI can be unfair to real people. Checking your data for balance and correct labels is part of building AI responsibly.
Worked Example · The Sneaky Shortcut
18 minArjun trains a "cat vs dog" model. Look at his data and predict the failures.
| Label | What he collected |
|---|---|
| cat | 200 photos — all taken indoors |
| dog | 200 photos — all taken outdoors in parks |
The amounts look balanced. But there's a sneaky problem.
- The model might learn a shortcut: "indoors = cat, grass = dog".
- Show it a dog indoors and it may say "cat".
- Show it a cat in a park and it may say "dog".
Plus one mislabelled photo (a dog tagged "cat") would quietly teach more mistakes.
Balanced numbers aren't enough. The variety must be balanced too — cats and dogs both indoors and outdoors.
Try It Yourself
20 minUse your worksheet.
A set labelled "apple" contains: a red apple, a green apple, a tomato, an apple slice. Which examples are bad, and why?
Hint
One is the wrong fruit entirely. Wrong labels are the most damaging kind of bad data.
A "smiling face" model was trained only on photos of adults. Name who it might work worse for, and how you would fix the data.
Hint
It may struggle with children's faces — add balanced examples across ages.
Mini-Challenge · Audit a Dataset
12 minBe a data auditor. Here's a plan for a Malaysian-food classifier (from Lesson 6): 100 nasi lemak photos, 100 roti canai photos, but only 8 satay photos — all from one stall.
List three problems with this data and a fix for each.
It works if your three fixes would make the data more balanced, varied, and correctly labelled.
Show one good audit
- Unbalanced — only 8 satay vs 100 others. Fix: collect ~100 satay too.
- No variety — all satay from one stall. Fix: photos from many stalls, plates and lighting.
- Check labels — make sure none are tagged wrongly before training.
Recap
5 minAn AI is only as good as its data. Too little, unbalanced, or wrongly-labelled data makes a weak, biased AI — garbage in, garbage out. Checking data for balance and correct labels is part of building AI responsibly.
Vocabulary Card
- bias
- Unfairness in an AI's answers, often caused by unbalanced data.
- garbage in, garbage out
- If you train on poor data, you get a poor AI.
- balanced data
- Fair, varied amounts of every category the AI must handle.
Homework · Spot Two Unfair Spots
≤ 20 minTake a dataset plan — your own from Lesson 6, or this one: "a model to recognise school shoes, trained only on brand-new white shoes". List two ways it could be unfair or fail, and a fix for each.
Sample · School-Shoe Model
- Only new white shoes — it may fail on scuffed or black shoes. Fix: add worn shoes and other colours.
- One angle only — it may fail on side or top views. Fix: photograph shoes from many angles.
Yours will be different — two real weaknesses with sensible fixes is the goal.