PY-L4-07 · Cleaning Messy Text Data

Cleaning Messy Text Data

Three quotes that prepared every junior developer for the real world: "data is dirty", "data is dirtier than you think", and "it's usually the boss's data that's dirtiest". Today we fix whitespace, mixed case, blank values and inconsistent labels.

⏱ 1 hour🧹 Strings lesson📚 After PY-L4-06💻 VS Code

Learning Goals

3 min

Strip whitespace with strip, lstrip, rstrip.
Normalise case with lower, upper, title.
Replace substrings with replace; collapse with split + join.
Detect missing or empty values defensively.

Warm-Up · Spot the Dirt

5 min

Here's a tiny customer list. Find five problems.

name,email,city
Ahmad ,ahmad@EMAIL.com,KL
mei,  mei@email.com  ,kuala lumpur
SURESH,suresh@email.com,Kuala Lumpur
Aisyah,,K.L.
   Devi   ,devi@email.com,kl

Show the answers

Trailing space after Ahmad .
Mixed case in EMAIL.com.
Leading/trailing whitespace around mei@email.com and Devi .
City has four different spellings: KL, kuala lumpur, Kuala Lumpur, K.L., kl.
Aisyah's email is empty.

Today's big idea

Clean once, near the loader. Don't scatter cleanup logic across the program — fix every field as you read it, then trust the data downstream.

New Concept · The Cleanup Toolbox

14 min

Whitespace

s = "   hello world   "
print(repr(s.strip()))         # → 'hello world'
print(repr(s.lstrip()))        # → 'hello world   '
print(repr(s.rstrip()))        # → '   hello world'

Case

print("KuALa LumPuR".lower())      # → kuala lumpur
print("kuala lumpur".upper())      # → KUALA LUMPUR
print("kuala lumpur".title())      # → Kuala Lumpur

Replace and collapse

s = "K.L."
print(s.replace(".", ""))          # → KL

# Collapse runs of whitespace using split + join
s = "kuala       lumpur"
print(" ".join(s.split()))         # → kuala lumpur

"a b".split() with no argument splits on any run of whitespace AND drops empties — handy for collapsing.

Detecting missing values

val = ""
if not val.strip():
    print("missing")

An empty string is falsy. val.strip() turns "all-whitespace" values into "", which makes them falsy too. not val.strip() means "blank or only whitespace".

A clean function

Bundle the work into one function and call it for every value:

def clean(s):
    s = s.strip().lower()
    s = " ".join(s.split())
    return s.replace(".", "")

print(clean("   K.L.  "))          # → kl
print(clean("kuala  lumpur"))      # → kuala lumpur
print(clean("KuAlA   LuMpUr "))    # → kuala lumpur

Worked Example · Clean the Customer CSV

12 min

Save the dirty data from the warm-up as customers.csv. Now clean it.

# clean_customers.py — normalise messy strings
import csv
from pathlib import Path

CITY_MAP = {
    "kl": "Kuala Lumpur",
    "kuala lumpur": "Kuala Lumpur",
}

def clean_text(s):
    if s is None:
        return ""
    return " ".join(s.strip().split())

def clean_city(s):
    key = clean_text(s).lower().replace(".", "")
    return CITY_MAP.get(key, clean_text(s).title())

with open("customers.csv", newline="") as f:
    rows = list(csv.DictReader(f))

cleaned = []
for r in rows:
    cleaned.append({
        "name":  clean_text(r["name"]).title(),
        "email": clean_text(r["email"]).lower(),
        "city":  clean_city(r["city"]),
    })

# Drop rows missing an email
final = [r for r in cleaned if r["email"]]

print(f"Loaded {len(rows)} rows, kept {len(final)} after cleaning.\n")
for r in final:
    print(f"  {r['name']:<10} {r['email']:<25} {r['city']}")

with open("customers_clean.csv", "w", newline="") as f:
    w = csv.DictWriter(f, fieldnames=["name", "email", "city"])
    w.writeheader()
    w.writerows(final)

Sample output

Loaded 5 rows, kept 4 after cleaning.

  Ahmad      ahmad@email.com           Kuala Lumpur
  Mei        mei@email.com             Kuala Lumpur
  Suresh     suresh@email.com          Kuala Lumpur
  Devi       devi@email.com            Kuala Lumpur

Read the diff

Three helpers (clean_text, clean_city, and the missing-email filter) together fix all five problems. The cleanup is in one place — easy to test, easy to extend. That's the "clean once, near the loader" rule.

Try It Yourself

13 min

01 🟢 Normalise phone numbers

Write clean_phone: remove spaces, dashes and parentheses, keep only digits and a leading +.

Hint

def clean_phone(s):
    out = "".join(c for c in s if c.isdigit() or c == "+")
    return out

print(clean_phone("+60 12-345 6789"))   # → +60123456789
print(clean_phone("(03) 8888-7777"))    # → 0388887777

02 🟡 Email validator (light)

Return True if a string has exactly one @, at least one . after the @, and nothing else weird.

Hint

def looks_like_email(s):
    s = s.strip()
    if s.count("@") != 1:
        return False
    user, domain = s.split("@")
    if not user or "." not in domain:
        return False
    return True

print(looks_like_email("a@b.com"))     # → True
print(looks_like_email("nope"))        # → False
print(looks_like_email("a@@b.com"))    # → False

Real email validation is one of the hardest small problems in software. This "light" check catches obvious junk; in production use a library or a regex.

03 🔴 Standardise country names

Build a COUNTRY_MAP that turns "malaysia", "Malaysia", "MY", "mly" into "Malaysia". Make it work for at least three countries with several aliases each.

Hint

COUNTRY_MAP = {}
def add(canon, *aliases):
    for a in aliases:
        COUNTRY_MAP[a.lower()] = canon

add("Malaysia",  "malaysia", "my", "mly", "msia")
add("Singapore", "singapore", "sg", "sgp", "spore")
add("Indonesia", "indonesia", "id", "idn", "indo")

def canonical(s):
    return COUNTRY_MAP.get(s.strip().lower(), s.strip().title())

print(canonical("msia"))      # → Malaysia
print(canonical("Sg"))        # → Singapore
print(canonical("brunei"))    # → Brunei  (unknown → just title-cased)

Mini-Challenge · Address Standardiser

8 min

Build fix_addr.py. It reads addresses (one per line) from addresses.txt and produces a clean version of each:

Strip leading/trailing whitespace.
Collapse multiple spaces to one.
Title-case the words EXCEPT keep state codes (KL, JB, PG) all-caps.
Skip blank lines.

Print the cleaned version side-by-side with the original for visual review.

Show one possible solution

# fix_addr.py
from pathlib import Path

CAPS = {"KL", "JB", "PG", "MY"}

def clean(addr):
    words = " ".join(addr.strip().split()).split(" ")
    out = []
    for w in words:
        if w.upper() in CAPS:
            out.append(w.upper())
        else:
            out.append(w.title())
    return " ".join(out)

for raw in Path("addresses.txt").read_text().splitlines():
    if not raw.strip():
        continue
    print(f"  {raw!r:<40} → {clean(raw)!r}")

Non-negotiables: strip, collapse, title-case, but the abbreviations stay upper. Print both for review.

Recap

3 min

Six tools handle 90% of dirty data: strip, lower/upper/title, replace, split + join, a falsy check for missing values, and a small lookup-map for spelling variants. Wrap them in one clean_* function per column and apply at load time. Downstream code can then trust the data.

Vocabulary Card

strip: Removes leading and trailing whitespace.
title / lower / upper: Casing helpers — title capitalises each word.
split + join: The idiomatic way to collapse runs of whitespace.
lookup map: A dict that turns every spelling variant into the canonical value.

Homework

4 min

Take a friend's contact list (or invent one with 10 deliberately messy rows). Write clean_contacts.py that fixes every common issue: whitespace, name casing, lowercase email, normalised phone (use your function from Try-It #1), missing values filtered out. Save the result to contacts_clean.csv and print a one-line summary of how many rows were dropped.

Sample · clean_contacts.py

# clean_contacts.py
import csv

def clean_text(s):  return " ".join((s or "").strip().split())
def clean_phone(s): return "".join(c for c in (s or "") if c.isdigit() or c == "+")
def looks_like_email(s):
    s = s.strip()
    return s.count("@") == 1 and "." in s.split("@")[-1]

with open("contacts_dirty.csv", newline="") as f:
    rows = list(csv.DictReader(f))

cleaned, dropped = [], 0
for r in rows:
    name  = clean_text(r["name"]).title()
    email = clean_text(r["email"]).lower()
    phone = clean_phone(r["phone"])
    if not name or not looks_like_email(email):
        dropped += 1
        continue
    cleaned.append({"name": name, "email": email, "phone": phone})

with open("contacts_clean.csv", "w", newline="") as f:
    w = csv.DictWriter(f, fieldnames=["name", "email", "phone"])
    w.writeheader()
    w.writerows(cleaned)

print(f"Kept {len(cleaned)}, dropped {dropped}.")

Non-negotiables: helpers per column, drop bad rows with a count, save the clean file.

Cleaning Messy Text Data

⏱ 1 hour🧹 Strings lesson📚 After PY-L4-06💻 VS Code

def clean(s): s = s.strip().lower() s = " ".join(s.split()) return s.replace(".", "") print(clean(" K.L. ")) # → kl print(clean("kuala lumpur")) # → kuala lumpur print(clean("KuAlA LuMpUr ")) # → kuala lumpur

# clean_customers.py — normalise messy strings import csv from pathlib import Path CITY_MAP = { "kl": "Kuala Lumpur", "kuala lumpur": "Kuala Lumpur", } def clean_text(s): if s is None: return "" return " ".join(s.strip().split()) def clean_city(s): key = clean_text(s).lower().replace(".", "") return CITY_MAP.get(key, clean_text(s).title()) with open("customers.csv", newline="") as f: rows = list(csv.DictReader(f)) cleaned = [] for r in rows: cleaned.append({ "name": clean_text(r["name"]).title(), "email": clean_text(r["email"]).lower(), "city": clean_city(r["city"]), }) # Drop rows missing an email final = [r for r in cleaned if r["email"]] print(f"Loaded {len(rows)} rows, kept {len(final)} after cleaning.\n") for r in final: print(f" {r['name']:<10} {r['email']:<25} {r['city']}") with open("customers_clean.csv", "w", newline="") as f: w = csv.DictWriter(f, fieldnames=["name", "email", "city"]) w.writeheader() w.writerows(final)

Loaded 5 rows, kept 4 after cleaning. Ahmad ahmad@email.com Kuala Lumpur Mei mei@email.com Kuala Lumpur Suresh suresh@email.com Kuala Lumpur Devi devi@email.com Kuala Lumpur

def clean_phone(s): out = "".join(c for c in s if c.isdigit() or c == "+") return out print(clean_phone("+60 12-345 6789")) # → +60123456789 print(clean_phone("(03) 8888-7777")) # → 0388887777

def looks_like_email(s): s = s.strip() if s.count("@") != 1: return False user, domain = s.split("@") if not user or "." not in domain: return False return True print(looks_like_email("a@b.com")) # → True print(looks_like_email("nope")) # → False print(looks_like_email("a@@b.com")) # → False

COUNTRY_MAP = {} def add(canon, *aliases): for a in aliases: COUNTRY_MAP[a.lower()] = canon add("Malaysia", "malaysia", "my", "mly", "msia") add("Singapore", "singapore", "sg", "sgp", "spore") add("Indonesia", "indonesia", "id", "idn", "indo") def canonical(s): return COUNTRY_MAP.get(s.strip().lower(), s.strip().title()) print(canonical("msia")) # → Malaysia print(canonical("Sg")) # → Singapore print(canonical("brunei")) # → Brunei (unknown → just title-cased)

# fix_addr.py from pathlib import Path CAPS = {"KL", "JB", "PG", "MY"} def clean(addr): words = " ".join(addr.strip().split()).split(" ") out = [] for w in words: if w.upper() in CAPS: out.append(w.upper()) else: out.append(w.title()) return " ".join(out) for raw in Path("addresses.txt").read_text().splitlines(): if not raw.strip(): continue print(f" {raw!r:<40} → {clean(raw)!r}")