PY-L2-43 · Regex Patterns — Character Classes & Quantifiers

Learning Goals

3 min

By the end of this lesson you can:

Write custom character classes with [abc], [a-z], [^...].
Group sub-patterns with (...) and extract just the group's text.
Use alternation | to match either of several options.
Recognise greedy vs lazy quantifiers — + vs +?.

Warm-Up · The URL Pattern, Tightened

5 min

Yesterday this happened:

re.findall(r"https?://\S+", "Visit https://example.com!")
# → ['https://example.com!']     ← the exclamation got picked up

Today's upgrade: spell out what a URL can contain — letters, digits, dots, slashes, hyphens, query characters — using a custom character class.

re.findall(r"https?://[\w./\-?=&%]+", "Visit https://example.com!")
# → ['https://example.com']     ← clean

The bracket class lists exactly which characters are allowed. The trailing ! isn't in the class, so matching stops.

Today's big idea

Yesterday's patterns matched approximately. Today's match exactly. The difference is precision in the character class and the right quantifier.

New Concept · The Full Toolkit

14 min

Custom character classes

[abc]            a, b or c
[a-z]            any lower-case letter
[a-zA-Z]         any letter, either case
[0-9]            same as \d
[A-F0-9]         uppercase hex digit
[^aeiou]         NOT a vowel
[\w.-]           word char, dot, or hyphen

Inside brackets, special characters lose most of their power. . is just a dot. + is just a plus. Only ], \, ^ (if first), and - (between chars) need escaping.

Groups · `(...)`

Brackets create a group. findall returns only the group's text when groups are present.

# Find all 4-digit years and extract just the year (not the trailing space)
re.findall(r"(\d{4})\s+", "Born 2014 today")
# → ['2014']

# Find all "name: value" pairs and pull out both name AND value
re.findall(r"(\w+):\s*(\w+)", "name: Aisyah age: 12 city: KL")
# → [('name', 'Aisyah'), ('age', '12'), ('city', 'KL')]

When you have several groups, findall returns a list of tuples — one tuple per match.

Alternation · `|`

Match either pattern A or pattern B. Use brackets to limit the alternation's scope.

re.findall(r"cat|dog", "the cat saw a dog")
# → ['cat', 'dog']

re.findall(r"(jpg|png|gif)", "logo.png photo.jpg banner.gif file.txt")
# → ['png', 'jpg', 'gif']

Greedy vs lazy

Quantifiers like + are greedy — they grab as much as possible. Add ? after them to make them lazy.

# Greedy: matches as much as possible
re.findall(r"<.+>", "<b>bold</b> and <i>italic</i>")
# → ['<b>bold</b> and <i>italic</i>']    ← takes everything between first < and last >!

# Lazy: matches as little as possible
re.findall(r"<.+?>", "<b>bold</b> and <i>italic</i>")
# → ['<b>', '</b>', '<i>', '</i>']    ← exactly the four tags

For real HTML/XML you'd use a parser, not regex. But the lazy-quantifier idea applies everywhere a pattern could overrun.

Special character cheat-sheet

\d   digit                \D   non-digit
\w   letter/digit/_       \W   non-word
\s   whitespace            \S   non-whitespace
\b   word boundary         \B   non-boundary
.    any non-newline       \n   newline (in raw string: \n is two chars,
                                  in pattern: a real newline)

Word boundary · the underused trick

# Find "is" only as a whole word, not inside "this", "island", etc.
re.findall(r"\bis\b", "this is an island")
# → ['is']

\b matches the empty position between a word character and a non-word character. Hugely useful for word-search.

Worked Example · The Receipt Parser

12 min

Given a wall of receipt text, extract structured info. Save as receipt_parse.py:

# receipt_parse.py — extract item, qty, price from messy receipt text

import re

receipt = """
PAK CIK RAZIF'S WARUNG
Date: 2026-05-27

Nasi lemak    x1    RM 8.00
Teh tarik     x2    RM 7.00
Roti planta   x1    RM 4.50
Cendol        x3    RM 15.00

Subtotal:           RM 34.50
SST 6%:             RM 2.07
TOTAL:              RM 36.57
"""

# Each row: word(s), x qty, RM price
# Pattern with named groups would be cleaner — but bracket-numbered works
PAT = re.compile(r"^([A-Za-z][\w ]+?)\s+x(\d+)\s+RM\s+(\d+\.\d{2})", re.MULTILINE)

items = []
for name, qty, price in PAT.findall(receipt):
    items.append({"name": name.strip(), "qty": int(qty), "price": float(price)})

for it in items:
    print(it)

# Subtotal / SST / TOTAL — single line each
totals = re.findall(r"(Subtotal|SST 6%|TOTAL):\s+RM\s+(\d+\.\d{2})", receipt)
for label, amount in totals:
    print(f"{label:<10} → {amount}")

# Date — anchor at line start to avoid spurious matches
date_match = re.search(r"Date:\s+(\d{4}-\d{2}-\d{2})", receipt)
if date_match:
    print(f"\nReceipt date: {date_match.group(1)}")

Output

{'name': 'Nasi lemak', 'qty': 1, 'price': 8.0}
{'name': 'Teh tarik', 'qty': 2, 'price': 7.0}
{'name': 'Roti planta', 'qty': 1, 'price': 4.5}
{'name': 'Cendol', 'qty': 3, 'price': 15.0}
Subtotal   → 34.50
SST 6%     → 2.07
TOTAL      → 36.57

Receipt date: 2026-05-27

Read the diff

Three patterns, three different shapes. (1) The item rows use a compiled pattern with re.compile (same regex, faster on repeated use) and the re.MULTILINE flag so ^ means "start of any line". Three groups capture name, qty and price separately. (2) The totals row uses alternation (Subtotal|SST 6%|TOTAL) to match any of three labels. (3) re.search + .group(1) extracts just the date.

Pre-compile for repeat use

re.compile(pattern) turns a pattern string into a compiled Pattern object. If you'll use the same pattern many times — in a loop, across a big file — pre-compile once. The speedup matters at scale.

Try It Yourself

13 min

01 🟢 The whole-word search

Find every occurrence of "cat" as a whole word in the text below. The substring inside "catnap" and "decatlon" should NOT match.

text = "The cat saw a catnap during the decathlon. cat? Cat!"

Hint

import re
matches = re.findall(r"\b[Cc]at\b", text)
print(matches)        # → ['cat', 'cat', 'Cat']

Word boundaries \b prevent the substring match. The character class [Cc] matches either case.

02 🟡 File extension extractor

Given the filename list below, extract each file's extension into a list.

files = "report.docx, photo.jpg, .bashrc, holiday-snap.PNG, no-ext"

Hint

import re
exts = re.findall(r"\.([A-Za-z]+)", files)
print(exts)            # → ['docx', 'jpg', 'bashrc', 'PNG']

The escaped dot matches the literal. The group ([A-Za-z]+) is what findall hands back. Note .bashrc matches — its name starts with a dot. no-ext doesn't.

03 🔴 Find the times (stretch)

Find every HH:MM time in the text. Hours must be 0-23, minutes 0-59.

text = "Bus at 7:30, train at 13:45, meeting at 25:00, lunch at 12:30"

Hint

# Hours: ([01]?\d|2[0-3])  — 00-19 or 20-23
# Minutes: [0-5]\d        — 00-59
matches = re.findall(r"\b([01]?\d|2[0-3]):[0-5]\d\b", text)
print(matches)         # → ['7', '13', '12']  -- but findall on multi-group is tricky

# Better: capture the WHOLE time (no inner group)
matches = re.findall(r"\b(?:[01]?\d|2[0-3]):[0-5]\d\b", text)
print(matches)         # → ['7:30', '13:45', '12:30']

The (?:...) is a non-capturing group — it groups for alternation but doesn't become an extra return. Useful when you want the structure but not the data.

Mini-Challenge · The Log Tail Parser

8 min

Save the text below as app.log. Then build parse_log.py that uses a compiled regex with named groups to extract every log line into a dict.

2026-05-27T08:01:02 INFO  user=aisyah action=login
2026-05-27T08:02:11 WARN  user=wei_jie action=invalid_login attempts=3
2026-05-27T08:05:45 ERROR user=priya  action=db_query duration=4250ms
2026-05-27T08:11:33 INFO  user=iman   action=logout

Extract: timestamp, level, user, action — and any extra key=value fields.

Use named groups: (?P<name>...). They make the result much more readable.

Show one possible solution

# parse_log.py — log tail to list of dicts

import re

LINE = re.compile(
    r"(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})\s+"
    r"(?P<level>INFO|WARN|ERROR)\s+"
    r"user=(?P<user>\w+)\s+"
    r"action=(?P<action>\w+)"
    r"(?:\s+(?P<extra>.*))?"
)

rows = []
with open("app.log", encoding="utf-8") as f:
    for line in f:
        m = LINE.search(line)
        if m:
            row = m.groupdict()
            # Parse any extra "k=v k=v" pairs
            extras = {}
            if row["extra"]:
                for k, v in re.findall(r"(\w+)=(\S+)", row["extra"]):
                    extras[k] = v
            row["extras"] = extras
            del row["extra"]
            rows.append(row)

for r in rows:
    print(r)

Non-negotiables: named groups, a compiled pattern, and a second pass over the "extras" chunk for additional k=v pairs. m.groupdict() returns a dict mapping group name to matched text — much cleaner than numeric groups.

Recap

3 min

Character classes [...] let you spell out exactly which characters to accept. [^...] inverts. Groups (...) capture sub-matches; findall returns just the groups, or tuples of groups if many. Alternation | matches either side. Quantifiers are greedy by default — add ? for lazy. Named groups (?P<name>...) + groupdict() turn matches into self-documenting dicts. Pre-compile patterns you'll use repeatedly.

Vocabulary Card

character class: Brackets list the allowed characters at one position. Most regex specials are inert inside.
group: Round brackets capture a sub-pattern. (?:...) groups without capturing.
named group: (?P<name>...). Access via m.group("name") or m.groupdict().
alternation: A|B — match either A or B. Combine with groups: (jpg|png|gif).
greedy vs lazy: + grabs as much as possible. +? grabs as little as possible.
re.compile: Pre-compile a pattern for repeated use.

Homework

4 min

Build tweet_extractor.py with three reusable functions. Each takes text and returns a list:

hashtags(text) — every #word, lower-cased and unique (use set internally).
mentions(text) — every @username.
links(text) — every URL. Tighten yesterday's pattern with a custom character class.

Run on a few sample tweets and print the results.

Sample · tweet_extractor.py

# tweet_extractor.py — tighter regex patterns

import re

HASH_PAT  = re.compile(r"#(\w+)")
MENT_PAT  = re.compile(r"@(\w+)")
LINK_PAT  = re.compile(r"https?://[\w./\-?=&%]+")

def hashtags(text):
    return sorted({h.lower() for h in HASH_PAT.findall(text)})

def mentions(text):
    return MENT_PAT.findall(text)

def links(text):
    return LINK_PAT.findall(text)

if __name__ == "__main__":
    tweet = "Lovely #FOOD @aisyah #penang at https://maps.example.com! #food"
    print("Hashtags:", hashtags(tweet))    # → ['food', 'penang']
    print("Mentions:", mentions(tweet))    # → ['aisyah']
    print("Links   :", links(tweet))       # → ['https://maps.example.com']

Non-negotiables: three compiled patterns at module level, three pure functions, and de-duplication of hashtags via a set comprehension. The link pattern uses a character class — no more trailing exclamation marks.

[abc] a, b or c [a-z] any lower-case letter [a-zA-Z] any letter, either case [0-9] same as \d [A-F0-9] uppercase hex digit [^aeiou] NOT a vowel [\w.-] word char, dot, or hyphen

# Find all 4-digit years and extract just the year (not the trailing space) re.findall(r"(\d{4})\s+", "Born 2014 today") # → ['2014'] # Find all "name: value" pairs and pull out both name AND value re.findall(r"(\w+):\s*(\w+)", "name: Aisyah age: 12 city: KL") # → [('name', 'Aisyah'), ('age', '12'), ('city', 'KL')]

# Greedy: matches as much as possible re.findall(r"<.+>", "bold and italic") # → ['bold and italic'] ← takes everything between first < and last >! # Lazy: matches as little as possible re.findall(r"<.+?>", "bold and italic") # → ['', '', '', ''] ← exactly the four tags

\d digit \D non-digit \w letter/digit/_ \W non-word \s whitespace \S non-whitespace \b word boundary \B non-boundary . any non-newline \n newline (in raw string: \n is two chars, in pattern: a real newline)

# receipt_parse.py — extract item, qty, price from messy receipt text import re receipt = """ PAK CIK RAZIF'S WARUNG Date: 2026-05-27 Nasi lemak x1 RM 8.00 Teh tarik x2 RM 7.00 Roti planta x1 RM 4.50 Cendol x3 RM 15.00 Subtotal: RM 34.50 SST 6%: RM 2.07 TOTAL: RM 36.57 """ # Each row: word(s), x qty, RM price # Pattern with named groups would be cleaner — but bracket-numbered works PAT = re.compile(r"^([A-Za-z][\w ]+?)\s+x(\d+)\s+RM\s+(\d+\.\d{2})", re.MULTILINE) items = [] for name, qty, price in PAT.findall(receipt): items.append({"name": name.strip(), "qty": int(qty), "price": float(price)}) for it in items: print(it) # Subtotal / SST / TOTAL — single line each totals = re.findall(r"(Subtotal|SST 6%|TOTAL):\s+RM\s+(\d+\.\d{2})", receipt) for label, amount in totals: print(f"{label:<10} → {amount}") # Date — anchor at line start to avoid spurious matches date_match = re.search(r"Date:\s+(\d{4}-\d{2}-\d{2})", receipt) if date_match: print(f"\nReceipt date: {date_match.group(1)}")

{'name': 'Nasi lemak', 'qty': 1, 'price': 8.0} {'name': 'Teh tarik', 'qty': 2, 'price': 7.0} {'name': 'Roti planta', 'qty': 1, 'price': 4.5} {'name': 'Cendol', 'qty': 3, 'price': 15.0} Subtotal → 34.50 SST 6% → 2.07 TOTAL → 36.57 Receipt date: 2026-05-27

# Hours: ([01]?\d|2[0-3]) — 00-19 or 20-23 # Minutes: [0-5]\d — 00-59 matches = re.findall(r"\b([01]?\d|2[0-3]):[0-5]\d\b", text) print(matches) # → ['7', '13', '12'] -- but findall on multi-group is tricky # Better: capture the WHOLE time (no inner group) matches = re.findall(r"\b(?:[01]?\d|2[0-3]):[0-5]\d\b", text) print(matches) # → ['7:30', '13:45', '12:30']

2026-05-27T08:01:02 INFO user=aisyah action=login 2026-05-27T08:02:11 WARN user=wei_jie action=invalid_login attempts=3 2026-05-27T08:05:45 ERROR user=priya action=db_query duration=4250ms 2026-05-27T08:11:33 INFO user=iman action=logout

# parse_log.py — log tail to list of dicts import re LINE = re.compile( r"(?P<ts>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})\s+" r"(?P<level>INFO|WARN|ERROR)\s+" r"user=(?P<user>\w+)\s+" r"action=(?P<action>\w+)" r"(?:\s+(?P<extra>.*))?" ) rows = [] with open("app.log", encoding="utf-8") as f: for line in f: m = LINE.search(line) if m: row = m.groupdict() # Parse any extra "k=v k=v" pairs extras = {} if row["extra"]: for k, v in re.findall(r"(\w+)=(\S+)", row["extra"]): extras[k] = v row["extras"] = extras del row["extra"] rows.append(row) for r in rows: print(r)

Learning Goals

Warm-Up · The URL Pattern, Tightened

New Concept · The Full Toolkit

Custom character classes

Groups · (...)

Alternation · |

Greedy vs lazy

Special character cheat-sheet

Word boundary · the underused trick

Worked Example · The Receipt Parser

Read the diff

Try It Yourself

Mini-Challenge · The Log Tail Parser

Recap

Vocabulary Card

Homework

Sample · tweet_extractor.py

Learning Goals

Warm-Up · The URL Pattern, Tightened

New Concept · The Full Toolkit

Custom character classes

Groups · (...)

Alternation · |

Greedy vs lazy

Special character cheat-sheet

Word boundary · the underused trick

Worked Example · The Receipt Parser

Read the diff

Try It Yourself

Mini-Challenge · The Log Tail Parser

Recap

Vocabulary Card

Homework

Sample · tweet_extractor.py

Groups · `(...)`

Alternation · `|`

Groups · `(...)`

Alternation · `|`