Challenge Goals
3 min- Identify and fix six categories of dirt in one CSV.
- Produce
clean.csvandsummary.jsonas artefacts. - Print a janitor report — what you cleaned, what you dropped.
- Combine everything from lessons 01–07.
The Dirty File
5 minSave this as orders_dirty.csv. Eyeball it first — how many problems can you spot?
order_id,customer,product,quantity,price,date 1001, Ahmad ,ROTI,3, 1.50 ,2026-05-01 1002,mei,milo,2,3.00,2026-05-01 1003,Suresh, NASI ,one,8.00,2026-05-02 1004,,roti,5,1.50,2026/05/02 1005,Devi,Nasi,2,8.00,2026-05-03 1005,Devi,Nasi,2,8.00,2026-05-03 1006,Ali ,MILO,4,3.0,03-05-2026 1007,Mei,roti,0,1.50,2026-05-04 1008,ali,nasi,2,,2026-05-04
Six problem categories
- Trailing/leading whitespace on names and products.
- Mixed casing on customer names and products.
- Quantity stored as the word
"one"instead of a number. - Mixed date formats —
2026-05-01,2026/05/02,03-05-2026. - Missing values — empty customer, empty price.
- Logic bugs — duplicate order id 1005, zero-quantity row 1007.
Write janitor.py that reads orders_dirty.csv, fixes everything fixable, drops what isn't fixable, and writes clean.csv + summary.json. Print a janitor report at the end.
Plan · Pipeline Stages
14 minApproach this as a small pipeline. Each stage has one job:
stage 1: load → list of dicts (raw strings) stage 2: clean → normalise whitespace, case stage 3: parse → strings → ints, floats, dates stage 4: validate → drop rows with required fields missing stage 5: dedupe → drop duplicate order_ids stage 6: report → write clean.csv + summary.json + console report
Date handling
Three input formats means you need to try them in order:
from datetime import datetime DATE_FORMATS = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"] def parse_date(s): for fmt in DATE_FORMATS: try: return datetime.strptime(s.strip(), fmt).date() except ValueError: continue return None # signal failure
The "one" problem
Sometimes quantities are spelled out. A tiny lookup table fixes it:
WORDS = {"zero":0, "one":1, "two":2, "three":3, "four":4, "five":5, "six":6, "seven":7, "eight":8, "nine":9, "ten":10} def parse_int(s): s = s.strip().lower() if s in WORDS: return WORDS[s] try: return int(s) except ValueError: return None
Drop-or-keep rules
Be explicit about which fields are required. For this CSV:
Required: order_id, customer, product, quantity, price, date Sanity: quantity > 0, price > 0 Unique: order_id
Reference Solution · janitor.py
12 min# janitor.py — clean orders_dirty.csv → clean.csv + summary.json import csv, json from datetime import datetime DATE_FORMATS = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"] WORDS = {"zero":0, "one":1, "two":2, "three":3, "four":4, "five":5} def clean_text(s): return " ".join((s or "").strip().split()) def parse_date(s): for fmt in DATE_FORMATS: try: return datetime.strptime(s.strip(), fmt).date() except ValueError: continue return None def parse_int(s): s = clean_text(s).lower() if s in WORDS: return WORDS[s] try: return int(s) except ValueError: return None def parse_float(s): try: return float(clean_text(s)) except ValueError: return None # Stage 1: load with open("orders_dirty.csv", newline="") as f: raw_rows = list(csv.DictReader(f)) cleaned, dropped = [], [] seen_ids = set() for r in raw_rows: # Stages 2 + 3: clean & parse row = { "order_id": parse_int(r["order_id"]), "customer": clean_text(r["customer"]).title(), "product": clean_text(r["product"]).title(), "quantity": parse_int(r["quantity"]), "price": parse_float(r["price"]), "date": parse_date(r["date"]), } # Stage 4: validate if not all([row["order_id"], row["customer"], row["product"], row["quantity"], row["price"], row["date"]]): dropped.append({"row": r, "reason": "missing or unparseable field"}) continue if row["quantity"] <= 0 or row["price"] <= 0: dropped.append({"row": r, "reason": "non-positive quantity/price"}) continue # Stage 5: dedupe if row["order_id"] in seen_ids: dropped.append({"row": r, "reason": "duplicate order_id"}) continue seen_ids.add(row["order_id"]) cleaned.append(row) # Stage 6: write with open("clean.csv", "w", newline="") as f: w = csv.DictWriter(f, fieldnames=cleaned[0].keys()) w.writeheader() for row in cleaned: out = dict(row) out["date"] = row["date"].strftime("%Y-%m-%d") w.writerow(out) summary = { "input_rows": len(raw_rows), "clean_rows": len(cleaned), "dropped_rows": len(dropped), "total_revenue": round(sum(r["quantity"] * r["price"] for r in cleaned), 2), "first_date": str(min(r["date"] for r in cleaned)), "last_date": str(max(r["date"] for r in cleaned)), } with open("summary.json", "w") as f: json.dump(summary, f, indent=2) # Janitor report print("🧹 janitor report") print(f" input: {summary['input_rows']} rows") print(f" kept: {summary['clean_rows']} rows") print(f" dropped: {summary['dropped_rows']} rows") print(f" revenue: RM {summary['total_revenue']:.2f}") print() for d in dropped: print(f" ✗ id={d['row']['order_id']!r} {d['reason']}")
Sample output
🧹 janitor report input: 9 rows kept: 5 rows dropped: 4 rows revenue: RM 38.50 ✗ id='1004' missing or unparseable field ✗ id='1005' duplicate order_id ✗ id='1007' non-positive quantity/price ✗ id='1008' missing or unparseable field
Extensions · Push Yourself
13 minInstead of just dropping bad rows, save them to quarantine.csv with an extra reason column so a human can fix them later.
Have parse_date look at the first 5 rows, find which format wins most often, and use that one for the whole file. Faster + more predictable than trying all three for every row.
Refactor janitor.py so the entire pipeline is a function clean(path_in, path_out, report_path). Then in __main__, parse sys.argv so the tool can be pointed at any CSV.
Stretch · Live Schema Check
8 minDefine the schema as a list of (name, parser, required) tuples at the top of janitor.py. The pipeline reads the schema once and applies it row-by-row — no hard-coded column names in the cleanup logic.
Show the skeleton
SCHEMA = [ ("order_id", parse_int, True), ("customer", clean_text, True), ("product", clean_text, True), ("quantity", parse_int, True), ("price", parse_float, True), ("date", parse_date, True), ] def clean_row(raw): out = {} for name, parser, required in SCHEMA: val = parser(raw.get(name, "")) if required and not val and val != 0: return None out[name] = val return out
Adding a new column = adding one line to SCHEMA. That's the kind of design that scales.
Recap
3 minYou just did real data engineering. A six-stage pipeline turned an unusable file into a clean dataset and a machine-readable summary. The shape — load → clean → parse → validate → dedupe → report — works for every messy file you'll ever meet. Next week we connect Python to the live web.
Homework · Your Own Janitor
4 minFind or create a messy CSV from your own life — a Spotify export, a school timetable, your phone's contact dump. Write a janitor that cleans it, drops the unfixable, and emits a JSON summary that includes at least one aggregate (count, sum, average or top-N). Document the rules in a short comment block at the top of the file so a teammate could re-run it next week.
Sample skeleton
# clean_my_data.py # Source: my_export.csv # Required: title, artist, played_at # Drop reasons: missing title/artist; unparseable date; played_at in the future import csv, json from datetime import datetime # ... helpers from the lesson ... rows, dropped, kept = load(...), [], [] for r in rows: ok = ... (kept if ok else dropped).append(r) with open("clean.csv", "w", newline="") as f: ... json.dump({ "input_rows": len(rows), "kept": len(kept), "top_artist": ... }, open("summary.json", "w"), indent=2)
Non-negotiables: documented rules, both artefacts emitted, at least one aggregate in the JSON.