Project Goals
3 minBy the end of this project you can:
- Design a regex from a written format spec.
- Anchor with
^and$so the pattern consumes the whole string. - Write a small test table — list of (input, expected) tuples — and verify your validator passes.
- Combine multiple checks (regex + logical) when regex alone isn't enough.
Warm-Up · The Test-First Mindset
5 minBefore writing any regex, write the tests. List every shape you want to accept and every shape you want to reject. Then design a pattern that satisfies both.
EMAIL_TESTS = [ ("aisyah@example.com", True), ("a.b@example.co.uk", True), ("a@b.c", True), # minimal ("nope", False), ("no@host", False), ("@nostart.com", False), ("a@b.", False), ("two@@example.com", False), ]
With the table written, you can test each iteration of your pattern. That is the test-first habit professional Python devs swear by.
Validators are easy to get 90% right and hard to get 100%. Tests turn "mostly works" into "provably works for these cases".
Task 1 · Email Validator
10 minAn email looks like local@domain.tld — letters, digits, dots and a few symbols on each side. Real-world emails are messier (RFC 5322 is 80 pages); for this project, the "good enough" pattern below covers 95% of real usage.
import re EMAIL_PAT = re.compile(r"^[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$") def is_email(text): return bool(EMAIL_PAT.search(text.strip()))
Break it down:
^— start.[A-Za-z0-9._%+\-]+— local part (letters, digits, dots, underscore, percent, plus, hyphen). The hyphen is escaped because it's at the end of the class.@— literal at sign.[A-Za-z0-9.\-]+— domain.\.— literal dot before the TLD.[A-Za-z]{2,}— TLD of at least 2 letters.$— end.
Task 2 · Malaysian IC Validator
10 minMyKad format: YYMMDD-PB-####. Six digits (date of birth), dash, two digits (state), dash, four digits (sequence). Examples:
OK 140812-14-3456 OK 900101-08-1234 BAD 140812-14-345 (last group too short) BAD 140812143456 (no dashes) BAD 140812-14-3456X (extra char) BAD 991313-14-3456 (impossible month 13)
A pure regex can't check "impossible month" — that needs an extra Python check. We'll layer them:
IC_PAT = re.compile(r"^(\d{2})(\d{2})(\d{2})-(\d{2})-(\d{4})$") def is_mykad(text): m = IC_PAT.search(text.strip()) if not m: return False yy, mm, dd, state, seq = m.groups() if not (1 <= int(mm) <= 12): return False if not (1 <= int(dd) <= 31): # not perfect — Feb 31 still passes return False return True
The regex finds the structure; the Python check enforces the meaning. This layering is standard in real validators.
Task 3 · Phone Validator
10 minMalaysian mobile phone formats — accept these variations:
012-3456789 3-digit prefix, dash, 7 digits 012-345-6789 with extra dash 0123456789 no dashes +60123456789 with country code +60 12-345 6789 with spaces
Strategy: strip out the dashes and spaces first, then check the resulting digits look like a phone number.
def normalise(text): # Remove spaces, dashes, parentheses return re.sub(r"[\s\-()]", "", text.strip()) PHONE_PAT = re.compile(r"^(\+60|0)1\d{8,9}$") def is_phone(text): return bool(PHONE_PAT.search(normalise(text)))
The re.sub(pattern, replacement, text) function replaces every match with the replacement string — empty string here. We'll use re.sub more in the capstone.
The phone pattern accepts either +60 or 0 followed by 1 (mobile prefix) followed by 8 or 9 more digits.
Task 4 · The Test Suite
8 minWrap each validator with a test table and run them all.
def run_tests(name, validator, tests): print(f"\n=== {name} ===") passing = 0 for text, expected in tests: actual = validator(text) ok = (actual == expected) flag = "OK " if ok else "FAIL" print(f" {flag} {text!r:<28} → {actual} (expected {expected})") passing += ok print(f" {passing}/{len(tests)} passing") EMAIL_TESTS = [ ("aisyah@example.com", True), ("a.b@example.co.uk", True), ("a@b.cd", True), ("nope", False), ("no@host", False), ("@nostart.com", False), ("a@b.", False), ] IC_TESTS = [ ("140812-14-3456", True), ("900101-08-1234", True), ("140812-14-345", False), ("140812143456", False), ("140812-14-3456X", False), ("991313-14-3456", False), # bad month ] PHONE_TESTS = [ ("012-3456789", True), ("012-345-6789", True), ("0123456789", True), ("+60123456789", True), ("+60 12-345 6789", True), ("123", False), ("012-345", False), ("abc", False), ] run_tests("Email", is_email, EMAIL_TESTS) run_tests("MyKad", is_mykad, IC_TESTS) run_tests("Phone", is_phone, PHONE_TESTS)
Putting It All Together · validators.py
8 minAssemble all four tasks into one file. Run the tests. Iterate until all three sections pass.
Show the complete file
# validators.py — three real validators with tests import re EMAIL_PAT = re.compile(r"^[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}$") IC_PAT = re.compile(r"^(\d{2})(\d{2})(\d{2})-(\d{2})-(\d{4})$") PHONE_PAT = re.compile(r"^(\+60|0)1\d{8,9}$") def is_email(text): return bool(EMAIL_PAT.search(text.strip())) def is_mykad(text): m = IC_PAT.search(text.strip()) if not m: return False yy, mm, dd, state, seq = m.groups() return 1 <= int(mm) <= 12 and 1 <= int(dd) <= 31 def is_phone(text): normalised = re.sub(r"[\s\-()]", "", text.strip()) return bool(PHONE_PAT.search(normalised)) # --- tests --- def run_tests(name, validator, tests): print(f"\n=== {name} ===") passing = 0 for text, expected in tests: actual = validator(text) ok = actual == expected print(f" {'OK' if ok else 'FAIL':<4} {text!r:<28} → {actual} expected {expected}") passing += ok print(f" {passing}/{len(tests)} passing") if __name__ == "__main__": EMAIL_TESTS = [ ("aisyah@example.com", True), ("a.b@example.co.uk", True), ("a@b.cd", True), ("nope", False), ("no@host", False), ] IC_TESTS = [ ("140812-14-3456", True), ("900101-08-1234", True), ("140812143456", False), ("991313-14-3456", False), ] PHONE_TESTS = [ ("012-3456789", True), ("0123456789", True), ("+60123456789", True), ("123", False), ] run_tests("Email", is_email, EMAIL_TESTS) run_tests("MyKad", is_mykad, IC_TESTS) run_tests("Phone", is_phone, PHONE_TESTS)
Non-negotiables: three pure validator functions, three test tables, a runner that prints OK/FAIL. The if __name__ == "__main__": guard from PY-L2-31 means importing validators elsewhere doesn't run the tests.
Recap
3 minValidators follow a recipe: anchor with ^/$, describe the format, layer a Python check on top for any semantic constraints (like valid month). Write the test table first. Iterate the pattern until all tests pass. re.sub is the third member of the regex family — use it to normalise messy input before validating. Three small validators in one tidy module = a reusable foundation for any form you'll ever build.
Tomorrow we leave regex behind for JSON — the universal data-exchange format. Then we'll combine JSON, regex and files into the capstone.
Homework
4 minAdd a 4th validator to your file:
is_postcode(text)— Malaysian postcode is exactly 5 digits, no other characters. Example:50480.- Add at least 5 test cases for it.
Stretch. Add is_strong_password(text) reusing the rules from PY-L2-42's homework — at least 8 chars, has digit, has upper, has lower, has symbol. Express the four character-class checks with regex; the length check with len().
Sample · added validators
POSTCODE_PAT = re.compile(r"^\d{5}$") def is_postcode(text): return bool(POSTCODE_PAT.search(text.strip())) def is_strong_password(text): if len(text) < 8: return False if not re.search(r"\d", text): return False if not re.search(r"[A-Z]", text): return False if not re.search(r"[a-z]", text): return False if not re.search(r"[!@#$%^&*]", text): return False return True POSTCODE_TESTS = [ ("50480", True), ("12345", True), ("1234", False), ("123456", False), ("5048A", False), ] run_tests("Postcode", is_postcode, POSTCODE_TESTS)
Non-negotiables: anchored 5-digit pattern, at least 5 tests covering correct and too-short/too-long/bad-char cases.