Learning Goals
3 minBy the end of this lesson you can:
- Import
reand usere.findall(pattern, text)to pull every match into a list. - Use
re.search(pattern, text)to test whether any match exists. - Write simple literal patterns and recognise the regex special characters that need escaping.
- Use raw strings (
r"...") for regex patterns and explain why.
Warm-Up · Find Every Phone Number
5 minWithout regex, finding a phone number in a wall of text means scanning character by character — checking for digits, dashes, the right shape. Tedious. Watch:
import re text = """ Aisyah: 012-3456789 Wei Jie: 011-2345 678 Priya: 017-9921122 Find me at 014-1112233 or 03-12345678 ext 99. """ phones = re.findall(r"\d{2,3}-\d{7,8}", text) print(phones) # → ['012-3456789', '017-9921122', '014-1112233', '03-12345678']
One line. \d means "a digit". {2,3} means "2 or 3 of them". - means a literal dash. \d{7,8} means "7 or 8 digits". Together: a phone-shaped pattern.
A regex is a tiny language for describing string shapes. Once you can describe the shape, Python finds every instance.
New Concept · Two Functions, Five Characters
14 minThe two everyday functions
import re # 1 — findall: pulls every match into a list re.findall(r"cat", "the cat saw a cat in the catacomb") # → ['cat', 'cat', 'cat'] # 2 — search: returns a Match object, or None m = re.search(r"cat", "where is the cat?") if m: print("Found at index", m.start()) else: print("No match.")
Pick by what you want:
- Every match as a list →
findall. - Just yes/no →
searchwithif m:. - First match + position →
search+m.start(),m.group().
Literal patterns
Most characters in a regex mean themselves. r"cat" matches the three letters c-a-t in order.
The first five special characters
Char Meaning Example \d any digit "\d\d" matches "42" \w letter, digit, _ "\w+" matches "hello42" . any char except \n "h.t" matches "hat", "h2t", "h@t" * zero or more of prev "ab*" matches "a", "ab", "abbb" + one or more of prev "ab+" matches "ab", "abbb" (not "a")
Why raw strings?
Always write your regex as a raw string — r"...". Python normally treats \n as a newline; with the r prefix, it's two characters — backslash and n.
re.findall("\d+", text) # works, but Python warns about \d re.findall(r"\d+", text) # ✅ the safe way
For literal regex characters like \d, \w, \b, raw strings are essential.
Counts · the curly braces
{n} exactly n
{n,} at least n
{n,m} between n and m
? zero or one of prev (same as {0,1})
+ one or more (same as {1,})
* zero or more (same as {0,})re.findall(r"\d{4}", "Year 2026 or 26?") # → ['2026'] (only 4-digit runs) re.findall(r"colou?r", "color and colour") # → ['color', 'colour'] re.findall(r"go+gle", "gogle google gooogle") # → ['gogle', 'google', 'gooogle']
Character classes
Square brackets list characters that match a single position.
re.findall(r"[aeiou]", "education") # → ['e', 'u', 'a', 'i', 'o'] re.findall(r"[A-Z]", "Hello World") # → ['H', 'W'] re.findall(r"[^aeiou ]", "education") # NOT a vowel or space — ['d', 'c', 't', 'n']
The ^ inside brackets means "not". Useful for "everything except these".
The anchors · ^ and $
Outside of brackets, ^ means "start of the string", $ means "end".
re.search(r"^Aisyah", "Aisyah said hi") # → match (starts with Aisyah) re.search(r"^Aisyah", "Hi, Aisyah!") # → None (doesn't start with Aisyah) re.search(r"\.$", "Hi.") # → match (ends with a literal dot)
Worked Example · Tweet Extractor
12 minSave as tweet_parse.py:
# tweet_parse.py — find hashtags, mentions and URLs import re tweet = """Lovely day! Visited @aisyah and @wei_jie for laksa #foodie #penang #malaysia. Recipe at https://example.com/laksa and also at http://lp.gov.my! Call 012-3456789 if you want some 😀 """ # Hashtags — # followed by letters/digits/underscore hashtags = re.findall(r"#\w+", tweet) print("Hashtags:", hashtags) # Mentions — @ followed by letters/digits/underscore mentions = re.findall(r"@\w+", tweet) print("Mentions:", mentions) # Phone numbers — Malaysian style phones = re.findall(r"\d{2,3}-\d{7,8}", tweet) print("Phones :", phones) # URLs — http(s) followed by non-space characters urls = re.findall(r"https?://\S+", tweet) print("URLs :", urls) # Quick yes/no if re.search(r"laksa", tweet): print("Found a laksa reference.")
Output
Hashtags: ['#foodie', '#penang', '#malaysia'] Mentions: ['@aisyah', '@wei_jie'] Phones : ['012-3456789'] URLs : ['https://example.com/laksa', 'http://lp.gov.my!'] Found a laksa reference.
Notice the URL match includes the trailing exclamation mark — \S+ grabs everything that's not whitespace. We'll tighten the URL pattern in PY-L2-43.
Read the diff
Four different patterns, each one a literal character followed by \w+ or a more specific shape. https? uses ? to mean "optional s" — matches both http and https. \S+ (capital S) means "not whitespace" — perfect when you don't know exactly what comes next.
Try It Yourself
13 minUse re.findall(r"\\w+", text) to extract every word from a sentence. Count them.
Hint
import re text = "Hello, world! It's 2026 already." words = re.findall(r"\w+", text) print(words) # → ['Hello', 'world', 'It', 's', '2026', 'already'] print("Count:", len(words))
Note that \w doesn't include apostrophes, so It's splits into It and s. We'll tighten with a custom character class tomorrow.
From the text below, extract every email address. Aim for a pattern like \w+@\w+\.\w+ — "word, at-sign, word, dot, word".
text = "Contact us: aisyah@example.com or weijie123@school.edu.my, not aisyah.com"
Hint
emails = re.findall(r"\w+@\w+\.\w+", text) print(emails) # → ['aisyah@example.com', 'weijie123@school.edu.my']
The backslash dot \. matches a literal dot. Without escaping, . means "any character" — too generous here.
Malaysian IC numbers look like YYMMDD-PB-#### — six digits, dash, two digits, dash, four digits. Write a function is_ic(text) that returns True/False.
Hint
import re def is_ic(text): return bool(re.search(r"^\d{6}-\d{2}-\d{4}$", text.strip())) print(is_ic("140812-14-3456")) # → True print(is_ic("14081214-3456")) # → False print(is_ic("140812-14-3456 extra")) # → False (because of $ anchor)
^ and $ are the anchors — the pattern must consume the entire string. Without them, re.search would return True for any text containing a valid IC.
Mini-Challenge · The Chat-Log Stats Tool
8 minBuild chat_stats.py. Given a chat log of the format below, print stats:
[14:32] @aisyah: hey are we still meeting? #lunch [14:33] @wei_jie: yes, at 12:30! my number's 012-3456789 if you need. [14:35] @priya: ok cool. found a place: https://maps.example.com/x [14:35] @aisyah: thanks #lunch #penang
Print:
- How many messages (lines that contain
:). - Every unique participant (use
set). - Every hashtag used.
- Every URL.
- Every phone number.
Show one possible solution
# chat_stats.py — extract structured info from a chat log import re log = """[14:32] @aisyah: hey are we still meeting? #lunch [14:33] @wei_jie: yes, at 12:30! my number's 012-3456789 if you need. [14:35] @priya: ok cool. found a place: https://maps.example.com/x [14:35] @aisyah: thanks #lunch #penang""" mentions = re.findall(r"@\w+", log) hashtags = re.findall(r"#\w+", log) phones = re.findall(r"\d{2,3}-\d{7,8}", log) urls = re.findall(r"https?://\S+", log) print(f"Messages : {len(log.splitlines())}") print(f"Participants: {set(mentions)}") print(f"Hashtags : {hashtags}") print(f"Phones : {phones}") print(f"URLs : {urls}")
Non-negotiables: four re.findall calls with appropriate patterns, plus set() on the mentions to find unique people. Your chat-log parser is one regex away from being shippable.
Recap
3 minTwo functions cover most use cases. re.findall returns all matches as a list. re.search returns a Match object (or None) for the first match — use if m: for yes/no. Always use raw strings — r"...". \d is a digit, \w is a word character, . is anything, * is zero-or-more, + is one-or-more. Square brackets list a character set; ^ and $ anchor to start and end.
Vocabulary Card
- regex
- A short string describing a pattern of other strings.
- re.findall(p, t)
- Every match as a list.
- re.search(p, t)
- The first match — returns a Match object or None.
- raw string
r"..."— Python doesn't process backslash escapes. Essential for regex.- \d / \w / .
- Digit / word character / any non-newline.
- * / + / ?
- Zero-or-more / one-or-more / optional.
Homework
4 minBuild password_check.py. Given a password string, report:
- Has at least 8 characters?
- Has at least one digit? (Use
re.search(r"\d", ...).) - Has at least one upper-case letter? (
r"[A-Z]") - Has at least one lower-case letter? (
r"[a-z]") - Has at least one symbol? (
r"[!@#$%^&*]")
Print each check as a ✓ or ✗ and a final "strong / weak" verdict (strong = all five passed).
Sample · password_check.py
# password_check.py — five regex tests import re def check(pwd): tests = [ ("length >= 8", len(pwd) >= 8), ("has a digit", bool(re.search(r"\d", pwd))), ("has UPPER", bool(re.search(r"[A-Z]", pwd))), ("has lower", bool(re.search(r"[a-z]", pwd))), ("has symbol", bool(re.search(r"[!@#$%^&*]", pwd))), ] for name, ok in tests: print(f" {'✓' if ok else '✗'} {name}") if all(ok for _, ok in tests): print(" STRONG") else: print(" weak") check(input("Password: "))
Non-negotiables: five separate regex checks, a tick/cross display, and a final strong/weak verdict. bool(re.search(...)) converts the Match-or-None to a clean True/False.