PY-L4-08 · Challenge — Data Janitor

Challenge Goals

3 min

Identify and fix six categories of dirt in one CSV.
Produce clean.csv and summary.json as artefacts.
Print a janitor report — what you cleaned, what you dropped.
Combine everything from lessons 01–07.

The Dirty File

5 min

Save this as orders_dirty.csv. Eyeball it first — how many problems can you spot?

order_id,customer,product,quantity,price,date
1001, Ahmad ,ROTI,3, 1.50 ,2026-05-01
1002,mei,milo,2,3.00,2026-05-01
1003,Suresh, NASI ,one,8.00,2026-05-02
1004,,roti,5,1.50,2026/05/02
1005,Devi,Nasi,2,8.00,2026-05-03
1005,Devi,Nasi,2,8.00,2026-05-03
1006,Ali  ,MILO,4,3.0,03-05-2026
1007,Mei,roti,0,1.50,2026-05-04
1008,ali,nasi,2,,2026-05-04

Six problem categories

Trailing/leading whitespace on names and products.
Mixed casing on customer names and products.
Quantity stored as the word "one" instead of a number.
Mixed date formats — 2026-05-01, 2026/05/02, 03-05-2026.
Missing values — empty customer, empty price.
Logic bugs — duplicate order id 1005, zero-quantity row 1007.

Your mission

Write janitor.py that reads orders_dirty.csv, fixes everything fixable, drops what isn't fixable, and writes clean.csv + summary.json. Print a janitor report at the end.

Plan · Pipeline Stages

14 min

Approach this as a small pipeline. Each stage has one job:

stage 1: load     → list of dicts (raw strings)
stage 2: clean    → normalise whitespace, case
stage 3: parse    → strings → ints, floats, dates
stage 4: validate → drop rows with required fields missing
stage 5: dedupe   → drop duplicate order_ids
stage 6: report   → write clean.csv + summary.json + console report

Date handling

Three input formats means you need to try them in order:

from datetime import datetime

DATE_FORMATS = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"]

def parse_date(s):
    for fmt in DATE_FORMATS:
        try:
            return datetime.strptime(s.strip(), fmt).date()
        except ValueError:
            continue
    return None       # signal failure

The "one" problem

Sometimes quantities are spelled out. A tiny lookup table fixes it:

WORDS = {"zero":0, "one":1, "two":2, "three":3, "four":4, "five":5,
         "six":6, "seven":7, "eight":8, "nine":9, "ten":10}

def parse_int(s):
    s = s.strip().lower()
    if s in WORDS:
        return WORDS[s]
    try:
        return int(s)
    except ValueError:
        return None

Drop-or-keep rules

Be explicit about which fields are required. For this CSV:

Required: order_id, customer, product, quantity, price, date
Sanity:   quantity > 0, price > 0
Unique:   order_id

Reference Solution · janitor.py

12 min

# janitor.py — clean orders_dirty.csv → clean.csv + summary.json
import csv, json
from datetime import datetime

DATE_FORMATS = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"]
WORDS = {"zero":0, "one":1, "two":2, "three":3, "four":4, "five":5}

def clean_text(s): return " ".join((s or "").strip().split())

def parse_date(s):
    for fmt in DATE_FORMATS:
        try: return datetime.strptime(s.strip(), fmt).date()
        except ValueError: continue
    return None

def parse_int(s):
    s = clean_text(s).lower()
    if s in WORDS: return WORDS[s]
    try: return int(s)
    except ValueError: return None

def parse_float(s):
    try: return float(clean_text(s))
    except ValueError: return None

# Stage 1: load
with open("orders_dirty.csv", newline="") as f:
    raw_rows = list(csv.DictReader(f))

cleaned, dropped = [], []
seen_ids = set()

for r in raw_rows:
    # Stages 2 + 3: clean & parse
    row = {
        "order_id": parse_int(r["order_id"]),
        "customer": clean_text(r["customer"]).title(),
        "product":  clean_text(r["product"]).title(),
        "quantity": parse_int(r["quantity"]),
        "price":    parse_float(r["price"]),
        "date":     parse_date(r["date"]),
    }
    # Stage 4: validate
    if not all([row["order_id"], row["customer"], row["product"],
                row["quantity"], row["price"], row["date"]]):
        dropped.append({"row": r, "reason": "missing or unparseable field"})
        continue
    if row["quantity"] <= 0 or row["price"] <= 0:
        dropped.append({"row": r, "reason": "non-positive quantity/price"})
        continue
    # Stage 5: dedupe
    if row["order_id"] in seen_ids:
        dropped.append({"row": r, "reason": "duplicate order_id"})
        continue
    seen_ids.add(row["order_id"])
    cleaned.append(row)

# Stage 6: write
with open("clean.csv", "w", newline="") as f:
    w = csv.DictWriter(f, fieldnames=cleaned[0].keys())
    w.writeheader()
    for row in cleaned:
        out = dict(row)
        out["date"] = row["date"].strftime("%Y-%m-%d")
        w.writerow(out)

summary = {
    "input_rows":   len(raw_rows),
    "clean_rows":   len(cleaned),
    "dropped_rows": len(dropped),
    "total_revenue": round(sum(r["quantity"] * r["price"] for r in cleaned), 2),
    "first_date":    str(min(r["date"] for r in cleaned)),
    "last_date":     str(max(r["date"] for r in cleaned)),
}
with open("summary.json", "w") as f:
    json.dump(summary, f, indent=2)

# Janitor report
print("🧹 janitor report")
print(f"   input:   {summary['input_rows']} rows")
print(f"   kept:    {summary['clean_rows']} rows")
print(f"   dropped: {summary['dropped_rows']} rows")
print(f"   revenue: RM {summary['total_revenue']:.2f}")
print()
for d in dropped:
    print(f"   ✗ id={d['row']['order_id']!r}  {d['reason']}")

Sample output

🧹 janitor report
   input:   9 rows
   kept:    5 rows
   dropped: 4 rows
   revenue: RM 38.50

   ✗ id='1004'  missing or unparseable field
   ✗ id='1005'  duplicate order_id
   ✗ id='1007'  non-positive quantity/price
   ✗ id='1008'  missing or unparseable field

Extensions · Push Yourself

13 min

01 🟢 Quarantine instead of drop

Instead of just dropping bad rows, save them to quarantine.csv with an extra reason column so a human can fix them later.

02 🟡 Auto-detect date format

Have parse_date look at the first 5 rows, find which format wins most often, and use that one for the whole file. Faster + more predictable than trying all three for every row.

03 🔴 Janitor as a function

Refactor janitor.py so the entire pipeline is a function clean(path_in, path_out, report_path). Then in __main__, parse sys.argv so the tool can be pointed at any CSV.

Stretch · Live Schema Check

8 min

Define the schema as a list of (name, parser, required) tuples at the top of janitor.py. The pipeline reads the schema once and applies it row-by-row — no hard-coded column names in the cleanup logic.

Show the skeleton

SCHEMA = [
    ("order_id", parse_int,   True),
    ("customer", clean_text,  True),
    ("product",  clean_text,  True),
    ("quantity", parse_int,   True),
    ("price",    parse_float, True),
    ("date",     parse_date,  True),
]

def clean_row(raw):
    out = {}
    for name, parser, required in SCHEMA:
        val = parser(raw.get(name, ""))
        if required and not val and val != 0:
            return None
        out[name] = val
    return out

Adding a new column = adding one line to SCHEMA. That's the kind of design that scales.

Recap

3 min

You just did real data engineering. A six-stage pipeline turned an unusable file into a clean dataset and a machine-readable summary. The shape — load → clean → parse → validate → dedupe → report — works for every messy file you'll ever meet. Next week we connect Python to the live web.

Homework · Your Own Janitor

4 min

Find or create a messy CSV from your own life — a Spotify export, a school timetable, your phone's contact dump. Write a janitor that cleans it, drops the unfixable, and emits a JSON summary that includes at least one aggregate (count, sum, average or top-N). Document the rules in a short comment block at the top of the file so a teammate could re-run it next week.

Sample skeleton

# clean_my_data.py
# Source:        my_export.csv
# Required:      title, artist, played_at
# Drop reasons:  missing title/artist; unparseable date; played_at in the future
import csv, json
from datetime import datetime

# ... helpers from the lesson ...

rows, dropped, kept = load(...), [], []
for r in rows:
    ok = ...
    (kept if ok else dropped).append(r)

with open("clean.csv", "w", newline="") as f: ...
json.dump({
    "input_rows": len(rows),
    "kept": len(kept),
    "top_artist": ...
}, open("summary.json", "w"), indent=2)

Non-negotiables: documented rules, both artefacts emitted, at least one aggregate in the JSON.

order_id,customer,product,quantity,price,date 1001, Ahmad ,ROTI,3, 1.50 ,2026-05-01 1002,mei,milo,2,3.00,2026-05-01 1003,Suresh, NASI ,one,8.00,2026-05-02 1004,,roti,5,1.50,2026/05/02 1005,Devi,Nasi,2,8.00,2026-05-03 1005,Devi,Nasi,2,8.00,2026-05-03 1006,Ali ,MILO,4,3.0,03-05-2026 1007,Mei,roti,0,1.50,2026-05-04 1008,ali,nasi,2,,2026-05-04

stage 1: load → list of dicts (raw strings) stage 2: clean → normalise whitespace, case stage 3: parse → strings → ints, floats, dates stage 4: validate → drop rows with required fields missing stage 5: dedupe → drop duplicate order_ids stage 6: report → write clean.csv + summary.json + console report

from datetime import datetime DATE_FORMATS = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"] def parse_date(s): for fmt in DATE_FORMATS: try: return datetime.strptime(s.strip(), fmt).date() except ValueError: continue return None # signal failure

WORDS = {"zero":0, "one":1, "two":2, "three":3, "four":4, "five":5, "six":6, "seven":7, "eight":8, "nine":9, "ten":10} def parse_int(s): s = s.strip().lower() if s in WORDS: return WORDS[s] try: return int(s) except ValueError: return None

# janitor.py — clean orders_dirty.csv → clean.csv + summary.json import csv, json from datetime import datetime DATE_FORMATS = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"] WORDS = {"zero":0, "one":1, "two":2, "three":3, "four":4, "five":5} def clean_text(s): return " ".join((s or "").strip().split()) def parse_date(s): for fmt in DATE_FORMATS: try: return datetime.strptime(s.strip(), fmt).date() except ValueError: continue return None def parse_int(s): s = clean_text(s).lower() if s in WORDS: return WORDS[s] try: return int(s) except ValueError: return None def parse_float(s): try: return float(clean_text(s)) except ValueError: return None # Stage 1: load with open("orders_dirty.csv", newline="") as f: raw_rows = list(csv.DictReader(f)) cleaned, dropped = [], [] seen_ids = set() for r in raw_rows: # Stages 2 + 3: clean & parse row = { "order_id": parse_int(r["order_id"]), "customer": clean_text(r["customer"]).title(), "product": clean_text(r["product"]).title(), "quantity": parse_int(r["quantity"]), "price": parse_float(r["price"]), "date": parse_date(r["date"]), } # Stage 4: validate if not all([row["order_id"], row["customer"], row["product"], row["quantity"], row["price"], row["date"]]): dropped.append({"row": r, "reason": "missing or unparseable field"}) continue if row["quantity"] <= 0 or row["price"] <= 0: dropped.append({"row": r, "reason": "non-positive quantity/price"}) continue # Stage 5: dedupe if row["order_id"] in seen_ids: dropped.append({"row": r, "reason": "duplicate order_id"}) continue seen_ids.add(row["order_id"]) cleaned.append(row) # Stage 6: write with open("clean.csv", "w", newline="") as f: w = csv.DictWriter(f, fieldnames=cleaned[0].keys()) w.writeheader() for row in cleaned: out = dict(row) out["date"] = row["date"].strftime("%Y-%m-%d") w.writerow(out) summary = { "input_rows": len(raw_rows), "clean_rows": len(cleaned), "dropped_rows": len(dropped), "total_revenue": round(sum(r["quantity"] * r["price"] for r in cleaned), 2), "first_date": str(min(r["date"] for r in cleaned)), "last_date": str(max(r["date"] for r in cleaned)), } with open("summary.json", "w") as f: json.dump(summary, f, indent=2) # Janitor report print("🧹 janitor report") print(f" input: {summary['input_rows']} rows") print(f" kept: {summary['clean_rows']} rows") print(f" dropped: {summary['dropped_rows']} rows") print(f" revenue: RM {summary['total_revenue']:.2f}") print() for d in dropped: print(f" ✗ id={d['row']['order_id']!r} {d['reason']}")

🧹 janitor report input: 9 rows kept: 5 rows dropped: 4 rows revenue: RM 38.50 ✗ id='1004' missing or unparseable field ✗ id='1005' duplicate order_id ✗ id='1007' non-positive quantity/price ✗ id='1008' missing or unparseable field

SCHEMA = [ ("order_id", parse_int, True), ("customer", clean_text, True), ("product", clean_text, True), ("quantity", parse_int, True), ("price", parse_float, True), ("date", parse_date, True), ] def clean_row(raw): out = {} for name, parser, required in SCHEMA: val = parser(raw.get(name, "")) if required and not val and val != 0: return None out[name] = val return out