Learning Goals
3 min- Strip whitespace with
strip,lstrip,rstrip. - Normalise case with
lower,upper,title. - Replace substrings with
replace; collapse withsplit + join. - Detect missing or empty values defensively.
Warm-Up · Spot the Dirt
5 minHere's a tiny customer list. Find five problems.
name,email,city Ahmad ,ahmad@EMAIL.com,KL mei, mei@email.com ,kuala lumpur SURESH,suresh@email.com,Kuala Lumpur Aisyah,,K.L. Devi ,devi@email.com,kl
Show the answers
- Trailing space after
Ahmad. - Mixed case in
EMAIL.com. - Leading/trailing whitespace around
mei@email.comandDevi. - City has four different spellings: KL, kuala lumpur, Kuala Lumpur, K.L., kl.
- Aisyah's email is empty.
Clean once, near the loader. Don't scatter cleanup logic across the program — fix every field as you read it, then trust the data downstream.
New Concept · The Cleanup Toolbox
14 minWhitespace
s = " hello world " print(repr(s.strip())) # → 'hello world' print(repr(s.lstrip())) # → 'hello world ' print(repr(s.rstrip())) # → ' hello world'
Case
print("KuALa LumPuR".lower()) # → kuala lumpur print("kuala lumpur".upper()) # → KUALA LUMPUR print("kuala lumpur".title()) # → Kuala Lumpur
Replace and collapse
s = "K.L." print(s.replace(".", "")) # → KL # Collapse runs of whitespace using split + join s = "kuala lumpur" print(" ".join(s.split())) # → kuala lumpur
"a b".split() with no argument splits on any run of whitespace AND drops empties — handy for collapsing.
Detecting missing values
val = "" if not val.strip(): print("missing")
An empty string is falsy. val.strip() turns "all-whitespace" values into "", which makes them falsy too. not val.strip() means "blank or only whitespace".
A clean function
Bundle the work into one function and call it for every value:
def clean(s): s = s.strip().lower() s = " ".join(s.split()) return s.replace(".", "") print(clean(" K.L. ")) # → kl print(clean("kuala lumpur")) # → kuala lumpur print(clean("KuAlA LuMpUr ")) # → kuala lumpur
Worked Example · Clean the Customer CSV
12 minSave the dirty data from the warm-up as customers.csv. Now clean it.
# clean_customers.py — normalise messy strings import csv from pathlib import Path CITY_MAP = { "kl": "Kuala Lumpur", "kuala lumpur": "Kuala Lumpur", } def clean_text(s): if s is None: return "" return " ".join(s.strip().split()) def clean_city(s): key = clean_text(s).lower().replace(".", "") return CITY_MAP.get(key, clean_text(s).title()) with open("customers.csv", newline="") as f: rows = list(csv.DictReader(f)) cleaned = [] for r in rows: cleaned.append({ "name": clean_text(r["name"]).title(), "email": clean_text(r["email"]).lower(), "city": clean_city(r["city"]), }) # Drop rows missing an email final = [r for r in cleaned if r["email"]] print(f"Loaded {len(rows)} rows, kept {len(final)} after cleaning.\n") for r in final: print(f" {r['name']:<10} {r['email']:<25} {r['city']}") with open("customers_clean.csv", "w", newline="") as f: w = csv.DictWriter(f, fieldnames=["name", "email", "city"]) w.writeheader() w.writerows(final)
Sample output
Loaded 5 rows, kept 4 after cleaning. Ahmad ahmad@email.com Kuala Lumpur Mei mei@email.com Kuala Lumpur Suresh suresh@email.com Kuala Lumpur Devi devi@email.com Kuala Lumpur
Read the diff
Three helpers (clean_text, clean_city, and the missing-email filter) together fix all five problems. The cleanup is in one place — easy to test, easy to extend. That's the "clean once, near the loader" rule.
Try It Yourself
13 minWrite clean_phone: remove spaces, dashes and parentheses, keep only digits and a leading +.
Hint
def clean_phone(s): out = "".join(c for c in s if c.isdigit() or c == "+") return out print(clean_phone("+60 12-345 6789")) # → +60123456789 print(clean_phone("(03) 8888-7777")) # → 0388887777
Return True if a string has exactly one @, at least one . after the @, and nothing else weird.
Hint
def looks_like_email(s): s = s.strip() if s.count("@") != 1: return False user, domain = s.split("@") if not user or "." not in domain: return False return True print(looks_like_email("a@b.com")) # → True print(looks_like_email("nope")) # → False print(looks_like_email("a@@b.com")) # → False
Real email validation is one of the hardest small problems in software. This "light" check catches obvious junk; in production use a library or a regex.
Build a COUNTRY_MAP that turns "malaysia", "Malaysia", "MY", "mly" into "Malaysia". Make it work for at least three countries with several aliases each.
Hint
COUNTRY_MAP = {} def add(canon, *aliases): for a in aliases: COUNTRY_MAP[a.lower()] = canon add("Malaysia", "malaysia", "my", "mly", "msia") add("Singapore", "singapore", "sg", "sgp", "spore") add("Indonesia", "indonesia", "id", "idn", "indo") def canonical(s): return COUNTRY_MAP.get(s.strip().lower(), s.strip().title()) print(canonical("msia")) # → Malaysia print(canonical("Sg")) # → Singapore print(canonical("brunei")) # → Brunei (unknown → just title-cased)
Mini-Challenge · Address Standardiser
8 minBuild fix_addr.py. It reads addresses (one per line) from addresses.txt and produces a clean version of each:
- Strip leading/trailing whitespace.
- Collapse multiple spaces to one.
- Title-case the words EXCEPT keep state codes (KL, JB, PG) all-caps.
- Skip blank lines.
Print the cleaned version side-by-side with the original for visual review.
Show one possible solution
# fix_addr.py from pathlib import Path CAPS = {"KL", "JB", "PG", "MY"} def clean(addr): words = " ".join(addr.strip().split()).split(" ") out = [] for w in words: if w.upper() in CAPS: out.append(w.upper()) else: out.append(w.title()) return " ".join(out) for raw in Path("addresses.txt").read_text().splitlines(): if not raw.strip(): continue print(f" {raw!r:<40} → {clean(raw)!r}")
Non-negotiables: strip, collapse, title-case, but the abbreviations stay upper. Print both for review.
Recap
3 minSix tools handle 90% of dirty data: strip, lower/upper/title, replace, split + join, a falsy check for missing values, and a small lookup-map for spelling variants. Wrap them in one clean_* function per column and apply at load time. Downstream code can then trust the data.
Vocabulary Card
- strip
- Removes leading and trailing whitespace.
- title / lower / upper
- Casing helpers —
titlecapitalises each word. - split + join
- The idiomatic way to collapse runs of whitespace.
- lookup map
- A dict that turns every spelling variant into the canonical value.
Homework
4 minTake a friend's contact list (or invent one with 10 deliberately messy rows). Write clean_contacts.py that fixes every common issue: whitespace, name casing, lowercase email, normalised phone (use your function from Try-It #1), missing values filtered out. Save the result to contacts_clean.csv and print a one-line summary of how many rows were dropped.
Sample · clean_contacts.py
# clean_contacts.py import csv def clean_text(s): return " ".join((s or "").strip().split()) def clean_phone(s): return "".join(c for c in (s or "") if c.isdigit() or c == "+") def looks_like_email(s): s = s.strip() return s.count("@") == 1 and "." in s.split("@")[-1] with open("contacts_dirty.csv", newline="") as f: rows = list(csv.DictReader(f)) cleaned, dropped = [], 0 for r in rows: name = clean_text(r["name"]).title() email = clean_text(r["email"]).lower() phone = clean_phone(r["phone"]) if not name or not looks_like_email(email): dropped += 1 continue cleaned.append({"name": name, "email": email, "phone": phone}) with open("contacts_clean.csv", "w", newline="") as f: w = csv.DictWriter(f, fieldnames=["name", "email", "phone"]) w.writeheader() w.writerows(cleaned) print(f"Kept {len(cleaned)}, dropped {dropped}.")
Non-negotiables: helpers per column, drop bad rows with a count, save the clean file.