Learning Goals
3 minBy the end of this lesson you can:
- Write custom character classes with
[abc],[a-z],[^...]. - Group sub-patterns with
(...)and extract just the group's text. - Use alternation
|to match either of several options. - Recognise greedy vs lazy quantifiers —
+vs+?.
Warm-Up · The URL Pattern, Tightened
5 minYesterday this happened:
re.findall(r"https?://\S+", "Visit https://example.com!") # → ['https://example.com!'] ← the exclamation got picked up
Today's upgrade: spell out what a URL can contain — letters, digits, dots, slashes, hyphens, query characters — using a custom character class.
re.findall(r"https?://[\w./\-?=&%]+", "Visit https://example.com!") # → ['https://example.com'] ← clean
The bracket class lists exactly which characters are allowed. The trailing ! isn't in the class, so matching stops.
Yesterday's patterns matched approximately. Today's match exactly. The difference is precision in the character class and the right quantifier.
New Concept · The Full Toolkit
14 minCustom character classes
[abc] a, b or c [a-z] any lower-case letter [a-zA-Z] any letter, either case [0-9] same as \d [A-F0-9] uppercase hex digit [^aeiou] NOT a vowel [\w.-] word char, dot, or hyphen
Inside brackets, special characters lose most of their power. . is just a dot. + is just a plus. Only ], \, ^ (if first), and - (between chars) need escaping.
Groups · (...)
Brackets create a group. findall returns only the group's text when groups are present.
# Find all 4-digit years and extract just the year (not the trailing space) re.findall(r"(\d{4})\s+", "Born 2014 today") # → ['2014'] # Find all "name: value" pairs and pull out both name AND value re.findall(r"(\w+):\s*(\w+)", "name: Aisyah age: 12 city: KL") # → [('name', 'Aisyah'), ('age', '12'), ('city', 'KL')]
When you have several groups, findall returns a list of tuples — one tuple per match.
Alternation · |
Match either pattern A or pattern B. Use brackets to limit the alternation's scope.
re.findall(r"cat|dog", "the cat saw a dog") # → ['cat', 'dog'] re.findall(r"(jpg|png|gif)", "logo.png photo.jpg banner.gif file.txt") # → ['png', 'jpg', 'gif']
Greedy vs lazy
Quantifiers like + are greedy — they grab as much as possible. Add ? after them to make them lazy.
# Greedy: matches as much as possible re.findall(r"<.+>", "<b>bold</b> and <i>italic</i>") # → ['<b>bold</b> and <i>italic</i>'] ← takes everything between first < and last >! # Lazy: matches as little as possible re.findall(r"<.+?>", "<b>bold</b> and <i>italic</i>") # → ['<b>', '</b>', '<i>', '</i>'] ← exactly the four tags
For real HTML/XML you'd use a parser, not regex. But the lazy-quantifier idea applies everywhere a pattern could overrun.
Special character cheat-sheet
\d digit \D non-digit
\w letter/digit/_ \W non-word
\s whitespace \S non-whitespace
\b word boundary \B non-boundary
. any non-newline \n newline (in raw string: \n is two chars,
in pattern: a real newline)Word boundary · the underused trick
# Find "is" only as a whole word, not inside "this", "island", etc. re.findall(r"\bis\b", "this is an island") # → ['is']
\b matches the empty position between a word character and a non-word character. Hugely useful for word-search.
Worked Example · The Receipt Parser
12 minGiven a wall of receipt text, extract structured info. Save as receipt_parse.py:
# receipt_parse.py — extract item, qty, price from messy receipt text import re receipt = """ PAK CIK RAZIF'S WARUNG Date: 2026-05-27 Nasi lemak x1 RM 8.00 Teh tarik x2 RM 7.00 Roti planta x1 RM 4.50 Cendol x3 RM 15.00 Subtotal: RM 34.50 SST 6%: RM 2.07 TOTAL: RM 36.57 """ # Each row: word(s), x qty, RM price # Pattern with named groups would be cleaner — but bracket-numbered works PAT = re.compile(r"^([A-Za-z][\w ]+?)\s+x(\d+)\s+RM\s+(\d+\.\d{2})", re.MULTILINE) items = [] for name, qty, price in PAT.findall(receipt): items.append({"name": name.strip(), "qty": int(qty), "price": float(price)}) for it in items: print(it) # Subtotal / SST / TOTAL — single line each totals = re.findall(r"(Subtotal|SST 6%|TOTAL):\s+RM\s+(\d+\.\d{2})", receipt) for label, amount in totals: print(f"{label:<10} → {amount}") # Date — anchor at line start to avoid spurious matches date_match = re.search(r"Date:\s+(\d{4}-\d{2}-\d{2})", receipt) if date_match: print(f"\nReceipt date: {date_match.group(1)}")
Output
{'name': 'Nasi lemak', 'qty': 1, 'price': 8.0}
{'name': 'Teh tarik', 'qty': 2, 'price': 7.0}
{'name': 'Roti planta', 'qty': 1, 'price': 4.5}
{'name': 'Cendol', 'qty': 3, 'price': 15.0}
Subtotal → 34.50
SST 6% → 2.07
TOTAL → 36.57
Receipt date: 2026-05-27Read the diff
Three patterns, three different shapes. (1) The item rows use a compiled pattern with re.compile (same regex, faster on repeated use) and the re.MULTILINE flag so ^ means "start of any line". Three groups capture name, qty and price separately. (2) The totals row uses alternation (Subtotal|SST 6%|TOTAL) to match any of three labels. (3) re.search + .group(1) extracts just the date.
re.compile(pattern) turns a pattern string into a compiled Pattern object. If you'll use the same pattern many times — in a loop, across a big file — pre-compile once. The speedup matters at scale.
Try It Yourself
13 minFind every occurrence of "cat" as a whole word in the text below. The substring inside "catnap" and "decatlon" should NOT match.
text = "The cat saw a catnap during the decathlon. cat? Cat!"
Hint
import re matches = re.findall(r"\b[Cc]at\b", text) print(matches) # → ['cat', 'cat', 'Cat']
Word boundaries \b prevent the substring match. The character class [Cc] matches either case.
Given the filename list below, extract each file's extension into a list.
files = "report.docx, photo.jpg, .bashrc, holiday-snap.PNG, no-ext"
Hint
import re exts = re.findall(r"\.([A-Za-z]+)", files) print(exts) # → ['docx', 'jpg', 'bashrc', 'PNG']
The escaped dot matches the literal. The group ([A-Za-z]+) is what findall hands back. Note .bashrc matches — its name starts with a dot. no-ext doesn't.
Find every HH:MM time in the text. Hours must be 0-23, minutes 0-59.
text = "Bus at 7:30, train at 13:45, meeting at 25:00, lunch at 12:30"
Hint
# Hours: ([01]?\d|2[0-3]) — 00-19 or 20-23 # Minutes: [0-5]\d — 00-59 matches = re.findall(r"\b([01]?\d|2[0-3]):[0-5]\d\b", text) print(matches) # → ['7', '13', '12'] -- but findall on multi-group is tricky # Better: capture the WHOLE time (no inner group) matches = re.findall(r"\b(?:[01]?\d|2[0-3]):[0-5]\d\b", text) print(matches) # → ['7:30', '13:45', '12:30']
The (?:...) is a non-capturing group — it groups for alternation but doesn't become an extra return. Useful when you want the structure but not the data.
Mini-Challenge · The Log Tail Parser
8 minSave the text below as app.log. Then build parse_log.py that uses a compiled regex with named groups to extract every log line into a dict.
2026-05-27T08:01:02 INFO user=aisyah action=login 2026-05-27T08:02:11 WARN user=wei_jie action=invalid_login attempts=3 2026-05-27T08:05:45 ERROR user=priya action=db_query duration=4250ms 2026-05-27T08:11:33 INFO user=iman action=logout
Extract: timestamp, level, user, action — and any extra key=value fields.
Use named groups: (?P<name>...). They make the result much more readable.
Show one possible solution
# parse_log.py — log tail to list of dicts import re LINE = re.compile( r"(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})\s+" r"(?P<level>INFO|WARN|ERROR)\s+" r"user=(?P<user>\w+)\s+" r"action=(?P<action>\w+)" r"(?:\s+(?P<extra>.*))?" ) rows = [] with open("app.log", encoding="utf-8") as f: for line in f: m = LINE.search(line) if m: row = m.groupdict() # Parse any extra "k=v k=v" pairs extras = {} if row["extra"]: for k, v in re.findall(r"(\w+)=(\S+)", row["extra"]): extras[k] = v row["extras"] = extras del row["extra"] rows.append(row) for r in rows: print(r)
Non-negotiables: named groups, a compiled pattern, and a second pass over the "extras" chunk for additional k=v pairs. m.groupdict() returns a dict mapping group name to matched text — much cleaner than numeric groups.
Recap
3 minCharacter classes [...] let you spell out exactly which characters to accept. [^...] inverts. Groups (...) capture sub-matches; findall returns just the groups, or tuples of groups if many. Alternation | matches either side. Quantifiers are greedy by default — add ? for lazy. Named groups (?P<name>...) + groupdict() turn matches into self-documenting dicts. Pre-compile patterns you'll use repeatedly.
Vocabulary Card
- character class
- Brackets list the allowed characters at one position. Most regex specials are inert inside.
- group
- Round brackets capture a sub-pattern.
(?:...)groups without capturing. - named group
(?P<name>...). Access viam.group("name")orm.groupdict().- alternation
A|B— match either A or B. Combine with groups:(jpg|png|gif).- greedy vs lazy
+grabs as much as possible.+?grabs as little as possible.- re.compile
- Pre-compile a pattern for repeated use.
Homework
4 minBuild tweet_extractor.py with three reusable functions. Each takes text and returns a list:
hashtags(text)— every#word, lower-cased and unique (usesetinternally).mentions(text)— every@username.links(text)— every URL. Tighten yesterday's pattern with a custom character class.
Run on a few sample tweets and print the results.
Sample · tweet_extractor.py
# tweet_extractor.py — tighter regex patterns import re HASH_PAT = re.compile(r"#(\w+)") MENT_PAT = re.compile(r"@(\w+)") LINK_PAT = re.compile(r"https?://[\w./\-?=&%]+") def hashtags(text): return sorted({h.lower() for h in HASH_PAT.findall(text)}) def mentions(text): return MENT_PAT.findall(text) def links(text): return LINK_PAT.findall(text) if __name__ == "__main__": tweet = "Lovely #FOOD @aisyah #penang at https://maps.example.com! #food" print("Hashtags:", hashtags(tweet)) # → ['food', 'penang'] print("Mentions:", mentions(tweet)) # → ['aisyah'] print("Links :", links(tweet)) # → ['https://maps.example.com']
Non-negotiables: three compiled patterns at module level, three pure functions, and de-duplication of hashtags via a set comprehension. The link pattern uses a character class — no more trailing exclamation marks.