AI-L1-12 · Good Data, Bad Data

Learning Goals

5 min

By the end of this lesson you can:

Tell good training data from bad.
Name three data problems: too little, unbalanced, wrongly labelled.
Explain how bad data leads to a biased, unfair AI.

Warm-Up · Spot the Wrong One

8 min

Last lesson we designed clean categories. Now we check the examples inside them.

Here are photos labelled "cat": a tabby cat, a black cat, a dog, a ginger cat. Which label is wrong?

Reveal the answer

The dog. A wrong label like this teaches the AI a mistake — it might start calling some dogs "cat".

New Concept · Garbage In, Garbage Out

18 min

An AI learns only from its data. If the data is poor, the AI will be poor too. Computer scientists call this "garbage in, garbage out".

Three common data problems

Too little — only a few examples, so the pattern is weak.
Unbalanced — far more of one kind than another, so the AI favours it.
Wrong labels — examples in the wrong box, teaching mistakes.

From bad data to bias

Bias is unfairness in an AI's answers. It often comes straight from unfair data.

If a face model is trained mostly on one group of people, it works worse for everyone else — not because anyone meant harm, but because the data was not balanced.

Why it matters

Biased AI can be unfair to real people. Checking your data for balance and correct labels is part of building AI responsibly.

Worked Example · The Sneaky Shortcut

18 min

Arjun trains a "cat vs dog" model. Look at his data and predict the failures.

Label	What he collected
cat	200 photos — all taken indoors
dog	200 photos — all taken outdoors in parks

The amounts look balanced. But there's a sneaky problem.

The model might learn a shortcut: "indoors = cat, grass = dog".
Show it a dog indoors and it may say "cat".
Show it a cat in a park and it may say "dog".

Plus one mislabelled photo (a dog tagged "cat") would quietly teach more mistakes.

The takeaway

Balanced numbers aren't enough. The variety must be balanced too — cats and dogs both indoors and outdoors.

Try It Yourself

20 min

Use your worksheet.

01 🟢 Find the bad examples

A set labelled "apple" contains: a red apple, a green apple, a tomato, an apple slice. Which examples are bad, and why?

Hint

One is the wrong fruit entirely. Wrong labels are the most damaging kind of bad data.

02 🟡 Fix the bias

A "smiling face" model was trained only on photos of adults. Name who it might work worse for, and how you would fix the data.

Hint

It may struggle with children's faces — add balanced examples across ages.

Mini-Challenge · Audit a Dataset

12 min

Be a data auditor. Here's a plan for a Malaysian-food classifier (from Lesson 6): 100 nasi lemak photos, 100 roti canai photos, but only 8 satay photos — all from one stall.

List three problems with this data and a fix for each.

It works if your three fixes would make the data more balanced, varied, and correctly labelled.

Show one good audit

Unbalanced — only 8 satay vs 100 others. Fix: collect ~100 satay too.
No variety — all satay from one stall. Fix: photos from many stalls, plates and lighting.
Check labels — make sure none are tagged wrongly before training.

Recap

5 min

An AI is only as good as its data. Too little, unbalanced, or wrongly-labelled data makes a weak, biased AI — garbage in, garbage out. Checking data for balance and correct labels is part of building AI responsibly.

Vocabulary Card

bias: Unfairness in an AI's answers, often caused by unbalanced data.
garbage in, garbage out: If you train on poor data, you get a poor AI.
balanced data: Fair, varied amounts of every category the AI must handle.

Homework · Spot Two Unfair Spots

≤ 20 min

Take a dataset plan — your own from Lesson 6, or this one: "a model to recognise school shoes, trained only on brand-new white shoes". List two ways it could be unfair or fail, and a fix for each.

Learning Goals

Warm-Up · Spot the Wrong One

New Concept · Garbage In, Garbage Out

Three common data problems

From bad data to bias

Worked Example · The Sneaky Shortcut

Try It Yourself

Mini-Challenge · Audit a Dataset

Recap

Vocabulary Card

Homework · Spot Two Unfair Spots

Sample · School-Shoe Model