PY-L2-44 · Regex Project — Validators

Project Goals

3 min

By the end of this project you can:

Design a regex from a written format spec.
Anchor with ^ and $ so the pattern consumes the whole string.
Write a small test table — list of (input, expected) tuples — and verify your validator passes.
Combine multiple checks (regex + logical) when regex alone isn't enough.

Warm-Up · The Test-First Mindset

5 min

Before writing any regex, write the tests. List every shape you want to accept and every shape you want to reject. Then design a pattern that satisfies both.

EMAIL_TESTS = [
    ("aisyah@example.com",          True),
    ("a.b@example.co.uk",           True),
    ("a@b.c",                       True),     # minimal
    ("nope",                        False),
    ("no@host",                     False),
    ("@nostart.com",                False),
    ("a@b.",                        False),
    ("two@@example.com",            False),
]

With the table written, you can test each iteration of your pattern. That is the test-first habit professional Python devs swear by.

Today's big idea

Validators are easy to get 90% right and hard to get 100%. Tests turn "mostly works" into "provably works for these cases".

Task 1 · Email Validator

10 min

An email looks like local@domain.tld — letters, digits, dots and a few symbols on each side. Real-world emails are messier (RFC 5322 is 80 pages); for this project, the "good enough" pattern below covers 95% of real usage.

import re

EMAIL_PAT = re.compile(r"^[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$")

def is_email(text):
    return bool(EMAIL_PAT.search(text.strip()))

Break it down:

^ — start.
[A-Za-z0-9._%+\-]+ — local part (letters, digits, dots, underscore, percent, plus, hyphen). The hyphen is escaped because it's at the end of the class.
@ — literal at sign.
[A-Za-z0-9.\-]+ — domain.
\. — literal dot before the TLD.
[A-Za-z]{2,} — TLD of at least 2 letters.
$ — end.

Task 2 · Malaysian IC Validator

10 min

MyKad format: YYMMDD-PB-####. Six digits (date of birth), dash, two digits (state), dash, four digits (sequence). Examples:

OK    140812-14-3456
OK    900101-08-1234
BAD   140812-14-345        (last group too short)
BAD   140812143456         (no dashes)
BAD   140812-14-3456X       (extra char)
BAD   991313-14-3456        (impossible month 13)

A pure regex can't check "impossible month" — that needs an extra Python check. We'll layer them:

IC_PAT = re.compile(r"^(\d{2})(\d{2})(\d{2})-(\d{2})-(\d{4})$")

def is_mykad(text):
    m = IC_PAT.search(text.strip())
    if not m:
        return False
    yy, mm, dd, state, seq = m.groups()
    if not (1 <= int(mm) <= 12):
        return False
    if not (1 <= int(dd) <= 31):     # not perfect — Feb 31 still passes
        return False
    return True

The regex finds the structure; the Python check enforces the meaning. This layering is standard in real validators.

Task 3 · Phone Validator

10 min

Malaysian mobile phone formats — accept these variations:

012-3456789       3-digit prefix, dash, 7 digits
012-345-6789      with extra dash
0123456789        no dashes
+60123456789      with country code
+60 12-345 6789   with spaces

Strategy: strip out the dashes and spaces first, then check the resulting digits look like a phone number.

def normalise(text):
    # Remove spaces, dashes, parentheses
    return re.sub(r"[\s\-()]", "", text.strip())

PHONE_PAT = re.compile(r"^(\+60|0)1\d{8,9}$")

def is_phone(text):
    return bool(PHONE_PAT.search(normalise(text)))

The re.sub(pattern, replacement, text) function replaces every match with the replacement string — empty string here. We'll use re.sub more in the capstone.

The phone pattern accepts either +60 or 0 followed by 1 (mobile prefix) followed by 8 or 9 more digits.

Task 4 · The Test Suite

8 min

Wrap each validator with a test table and run them all.

def run_tests(name, validator, tests):
    print(f"\n=== {name} ===")
    passing = 0
    for text, expected in tests:
        actual = validator(text)
        ok = (actual == expected)
        flag = "OK  " if ok else "FAIL"
        print(f"  {flag}  {text!r:<28} → {actual} (expected {expected})")
        passing += ok
    print(f"  {passing}/{len(tests)} passing")

EMAIL_TESTS = [
    ("aisyah@example.com",   True),
    ("a.b@example.co.uk",     True),
    ("a@b.cd",                True),
    ("nope",                  False),
    ("no@host",               False),
    ("@nostart.com",          False),
    ("a@b.",                  False),
]

IC_TESTS = [
    ("140812-14-3456",        True),
    ("900101-08-1234",        True),
    ("140812-14-345",         False),
    ("140812143456",          False),
    ("140812-14-3456X",       False),
    ("991313-14-3456",        False),     # bad month
]

PHONE_TESTS = [
    ("012-3456789",           True),
    ("012-345-6789",          True),
    ("0123456789",            True),
    ("+60123456789",          True),
    ("+60 12-345 6789",       True),
    ("123",                   False),
    ("012-345",               False),
    ("abc",                   False),
]

run_tests("Email", is_email, EMAIL_TESTS)
run_tests("MyKad", is_mykad, IC_TESTS)
run_tests("Phone", is_phone, PHONE_TESTS)

Putting It All Together · validators.py

8 min

Assemble all four tasks into one file. Run the tests. Iterate until all three sections pass.

Show the complete file

# validators.py — three real validators with tests

import re

EMAIL_PAT = re.compile(r"^[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$")
IC_PAT    = re.compile(r"^(\d{2})(\d{2})(\d{2})-(\d{2})-(\d{4})$")
PHONE_PAT = re.compile(r"^(\+60|0)1\d{8,9}$")

def is_email(text):
    return bool(EMAIL_PAT.search(text.strip()))

def is_mykad(text):
    m = IC_PAT.search(text.strip())
    if not m:
        return False
    yy, mm, dd, state, seq = m.groups()
    return 1 <= int(mm) <= 12 and 1 <= int(dd) <= 31

def is_phone(text):
    normalised = re.sub(r"[\s\-()]", "", text.strip())
    return bool(PHONE_PAT.search(normalised))

# --- tests ---

def run_tests(name, validator, tests):
    print(f"\n=== {name} ===")
    passing = 0
    for text, expected in tests:
        actual = validator(text)
        ok = actual == expected
        print(f"  {'OK' if ok else 'FAIL':<4}  {text!r:<28} → {actual}  expected {expected}")
        passing += ok
    print(f"  {passing}/{len(tests)} passing")

if __name__ == "__main__":
    EMAIL_TESTS = [
        ("aisyah@example.com",   True),
        ("a.b@example.co.uk",     True),
        ("a@b.cd",                True),
        ("nope",                  False),
        ("no@host",               False),
    ]
    IC_TESTS = [
        ("140812-14-3456",        True),
        ("900101-08-1234",        True),
        ("140812143456",          False),
        ("991313-14-3456",        False),
    ]
    PHONE_TESTS = [
        ("012-3456789",           True),
        ("0123456789",            True),
        ("+60123456789",          True),
        ("123",                   False),
    ]
    run_tests("Email", is_email, EMAIL_TESTS)
    run_tests("MyKad", is_mykad, IC_TESTS)
    run_tests("Phone", is_phone, PHONE_TESTS)

Non-negotiables: three pure validator functions, three test tables, a runner that prints OK/FAIL. The if __name__ == "__main__": guard from PY-L2-31 means importing validators elsewhere doesn't run the tests.

Recap

3 min

Validators follow a recipe: anchor with ^/$, describe the format, layer a Python check on top for any semantic constraints (like valid month). Write the test table first. Iterate the pattern until all tests pass. re.sub is the third member of the regex family — use it to normalise messy input before validating. Three small validators in one tidy module = a reusable foundation for any form you'll ever build.

What's next

Tomorrow we leave regex behind for JSON — the universal data-exchange format. Then we'll combine JSON, regex and files into the capstone.

Homework

4 min

Add a 4th validator to your file:

is_postcode(text) — Malaysian postcode is exactly 5 digits, no other characters. Example: 50480.
Add at least 5 test cases for it.

Stretch. Add is_strong_password(text) reusing the rules from PY-L2-42's homework — at least 8 chars, has digit, has upper, has lower, has symbol. Express the four character-class checks with regex; the length check with len().

Sample · added validators

POSTCODE_PAT = re.compile(r"^\d{5}$")

def is_postcode(text):
    return bool(POSTCODE_PAT.search(text.strip()))

def is_strong_password(text):
    if len(text) < 8: return False
    if not re.search(r"\d", text):         return False
    if not re.search(r"[A-Z]", text):       return False
    if not re.search(r"[a-z]", text):       return False
    if not re.search(r"[!@#$%^&*]", text):  return False
    return True

POSTCODE_TESTS = [
    ("50480",  True),
    ("12345",  True),
    ("1234",   False),
    ("123456", False),
    ("5048A",  False),
]
run_tests("Postcode", is_postcode, POSTCODE_TESTS)

Non-negotiables: anchored 5-digit pattern, at least 5 tests covering correct and too-short/too-long/bad-char cases.

EMAIL_TESTS = [ ("aisyah@example.com", True), ("a.b@example.co.uk", True), ("a@b.c", True), # minimal ("nope", False), ("no@host", False), ("@nostart.com", False), ("a@b.", False), ("two@@example.com", False), ]

OK 140812-14-3456 OK 900101-08-1234 BAD 140812-14-345 (last group too short) BAD 140812143456 (no dashes) BAD 140812-14-3456X (extra char) BAD 991313-14-3456 (impossible month 13)

IC_PAT = re.compile(r"^(\d{2})(\d{2})(\d{2})-(\d{2})-(\d{4})$") def is_mykad(text): m = IC_PAT.search(text.strip()) if not m: return False yy, mm, dd, state, seq = m.groups() if not (1 <= int(mm) <= 12): return False if not (1 <= int(dd) <= 31): # not perfect — Feb 31 still passes return False return True

def normalise(text): # Remove spaces, dashes, parentheses return re.sub(r"[\s\-()]", "", text.strip()) PHONE_PAT = re.compile(r"^(\+60|0)1\d{8,9}$") def is_phone(text): return bool(PHONE_PAT.search(normalise(text)))

def run_tests(name, validator, tests): print(f"\n=== {name} ===") passing = 0 for text, expected in tests: actual = validator(text) ok = (actual == expected) flag = "OK " if ok else "FAIL" print(f" {flag} {text!r:<28} → {actual} (expected {expected})") passing += ok print(f" {passing}/{len(tests)} passing") EMAIL_TESTS = [ ("aisyah@example.com", True), ("a.b@example.co.uk", True), ("a@b.cd", True), ("nope", False), ("no@host", False), ("@nostart.com", False), ("a@b.", False), ] IC_TESTS = [ ("140812-14-3456", True), ("900101-08-1234", True), ("140812-14-345", False), ("140812143456", False), ("140812-14-3456X", False), ("991313-14-3456", False), # bad month ] PHONE_TESTS = [ ("012-3456789", True), ("012-345-6789", True), ("0123456789", True), ("+60123456789", True), ("+60 12-345 6789", True), ("123", False), ("012-345", False), ("abc", False), ] run_tests("Email", is_email, EMAIL_TESTS) run_tests("MyKad", is_mykad, IC_TESTS) run_tests("Phone", is_phone, PHONE_TESTS)

# validators.py — three real validators with tests import re EMAIL_PAT = re.compile(r"^[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$") IC_PAT = re.compile(r"^(\d{2})(\d{2})(\d{2})-(\d{2})-(\d{4})$") PHONE_PAT = re.compile(r"^(\+60|0)1\d{8,9}$") def is_email(text): return bool(EMAIL_PAT.search(text.strip())) def is_mykad(text): m = IC_PAT.search(text.strip()) if not m: return False yy, mm, dd, state, seq = m.groups() return 1 <= int(mm) <= 12 and 1 <= int(dd) <= 31 def is_phone(text): normalised = re.sub(r"[\s\-()]", "", text.strip()) return bool(PHONE_PAT.search(normalised)) # --- tests --- def run_tests(name, validator, tests): print(f"\n=== {name} ===") passing = 0 for text, expected in tests: actual = validator(text) ok = actual == expected print(f" {'OK' if ok else 'FAIL':<4} {text!r:<28} → {actual} expected {expected}") passing += ok print(f" {passing}/{len(tests)} passing") if __name__ == "__main__": EMAIL_TESTS = [ ("aisyah@example.com", True), ("a.b@example.co.uk", True), ("a@b.cd", True), ("nope", False), ("no@host", False), ] IC_TESTS = [ ("140812-14-3456", True), ("900101-08-1234", True), ("140812143456", False), ("991313-14-3456", False), ] PHONE_TESTS = [ ("012-3456789", True), ("0123456789", True), ("+60123456789", True), ("123", False), ] run_tests("Email", is_email, EMAIL_TESTS) run_tests("MyKad", is_mykad, IC_TESTS) run_tests("Phone", is_phone, PHONE_TESTS)