PY-L5-34 · NLP 101 — Tokens, Stop-Words, Stemming

Learning Goals

3 min

Tokenise text into words; normalise case and punctuation.
Remove stop-words (the, a, is...) that carry little meaning.
Stem and lemmatise — reduce words to a common root.
Build a reusable text-cleaning function.

Warm-Up · From Sentence to Tokens

5 min

"The cats are RUNNING quickly!"
  → tokenise:   ["The","cats","are","RUNNING","quickly","!"]
  → lowercase:  ["the","cats","are","running","quickly","!"]
  → drop stop:  ["cats","running","quickly"]
  → stem:       ["cat","run","quickli"]

Today's big idea

Models can't read raw sentences — they need clean, normalised tokens. Cleaning collapses "Running", "runs", "ran" toward one root so the model sees them as the same idea. Garbage in, garbage out applies double to text.

New Concept · The Cleaning Steps

14 min

Install & data

pip install nltk
# one-time downloads:
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"

Tokenise + lowercase

import re
text = "The cats are RUNNING quickly!"
# simple regex tokeniser: words only, lowercased
tokens = re.findall(r"[a-z]+", text.lower())
print(tokens)   # ['the', 'cats', 'are', 'running', 'quickly']

For most tasks a regex tokeniser is plenty. NLTK's word_tokenize is smarter about punctuation and contractions when you need it.

Stop-words

from nltk.corpus import stopwords
STOP = set(stopwords.words("english"))
tokens = [t for t in tokens if t not in STOP]
print(tokens)   # ['cats', 'running', 'quickly']

Stop-words are common filler (the, is, and, of). Removing them focuses the model on content words. (For some tasks — like sentiment with "not" — be careful which you drop.)

Stemming — chop to a crude root

from nltk.stem import PorterStemmer
ps = PorterStemmer()
print([ps.stem(t) for t in ["running", "runs", "ran", "easily"]])
# ['run', 'run', 'ran', 'easili']  — fast but crude (note 'easili')

Lemmatising — proper dictionary root

from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
print([lem.lemmatize(t, pos="v") for t in ["running", "runs", "ran"]])
# ['run', 'run', 'run']  — slower but real words

stemming    fast, crude, may produce non-words ("easili")
lemmatising slower, accurate, real dictionary words ("good")

Worked Example · A Clean-Text Function

12 min

# clean_text.py — reusable NLP preprocessing
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

STOP = set(stopwords.words("english"))
lem = WordNetLemmatizer()

def clean(text):
    tokens = re.findall(r"[a-z]+", text.lower())     # tokenise + lowercase
    tokens = [t for t in tokens if t not in STOP]    # drop stop-words
    tokens = [lem.lemmatize(t) for t in tokens]      # lemmatise
    tokens = [t for t in tokens if len(t) > 2]       # drop tiny tokens
    return tokens

reviews = [
    "This product is AMAZING! Highly recommend it to everyone.",
    "Terrible quality, broke after two days. Very disappointed.",
    "It's okay, nothing special but does the job fine.",
]
for r in reviews:
    print(clean(r))

Sample output

['product', 'amazing', 'highly', 'recommend', 'everyone']
['terrible', 'quality', 'broke', 'two', 'day', 'disappointed']
['okay', 'nothing', 'special', 'job', 'fine']

Read the diff

From raw reviews to clean content tokens in four steps. Notice how "AMAZING!" → "amazing" and filler ("is", "to", "it") vanished. This clean() is the front door for the next two lessons — turning text into numbers (TF-IDF) and classifying sentiment.

Try It Yourself

13 min

01 🟢 Tokenise & count

Clean a paragraph and print the 10 most common tokens with collections.Counter.

Hint

from collections import Counter
print(Counter(clean(big_text)).most_common(10))

02 🟡 Stemming vs lemmatising

Run both on the same word list. Show a case where they differ and explain which is "more correct".

03 🔴 Keep negations

Default stop-words include "not" and "no" — bad for sentiment! Build a custom stop-list that keeps negation words. Show it changes the tokens for "not good".

Hint

KEEP = {"not", "no", "never", "nor"}
STOP2 = STOP - KEEP
# now "not good" keeps "not" → sentiment can tell it's negative

Mini-Challenge · Word Cloud Data

8 min

Clean a body of text (song lyrics, a chapter, your chat export) and produce the token frequency table. Optionally feed it to the wordcloud library for an image. The cleaning quality is what makes the cloud meaningful.

Show one possible solution

from collections import Counter
from pathlib import Path

tokens = clean(Path("lyrics.txt").read_text())
freq = Counter(tokens)
for word, n in freq.most_common(15):
    print(f"  {word:<15} {n}")

# optional: pip install wordcloud
# from wordcloud import WordCloud
# WordCloud().generate_from_frequencies(freq).to_file("cloud.png")

Non-negotiables: clean before counting. Without stop-word removal, "the" and "and" dominate every cloud.

Recap

3 min

Text cleaning: tokenise → lowercase → drop stop-words → stem/lemmatise. Stemming is fast and crude; lemmatising is slower and accurate. Watch out: blindly dropping "not"/"no" ruins sentiment. A reusable clean() feeds everything downstream. Next: turn these tokens into numbers.

Vocabulary Card

token: A unit of text — usually a word — after splitting.
stop-words: Common low-information words (the, is, and) often removed.
stemming: Crudely chopping a word to a root (running → run).
lemmatising: Reducing a word to its dictionary base form, accurately.

Homework

4 min

Build a clean() with a negation-aware stop-list. Run it on 10 product reviews (real or invented). Print the cleaned tokens and the top-10 words across all reviews. One sentence on a cleaning choice you made and why.

"The cats are RUNNING quickly!" → tokenise: ["The","cats","are","RUNNING","quickly","!"] → lowercase: ["the","cats","are","running","quickly","!"] → drop stop: ["cats","running","quickly"] → stem: ["cat","run","quickli"]

import re text = "The cats are RUNNING quickly!" # simple regex tokeniser: words only, lowercased tokens = re.findall(r"[a-z]+", text.lower()) print(tokens) # ['the', 'cats', 'are', 'running', 'quickly']

# clean_text.py — reusable NLP preprocessing import re from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer STOP = set(stopwords.words("english")) lem = WordNetLemmatizer() def clean(text): tokens = re.findall(r"[a-z]+", text.lower()) # tokenise + lowercase tokens = [t for t in tokens if t not in STOP] # drop stop-words tokens = [lem.lemmatize(t) for t in tokens] # lemmatise tokens = [t for t in tokens if len(t) > 2] # drop tiny tokens return tokens reviews = [ "This product is AMAZING! Highly recommend it to everyone.", "Terrible quality, broke after two days. Very disappointed.", "It's okay, nothing special but does the job fine.", ] for r in reviews: print(clean(r))

from collections import Counter from pathlib import Path tokens = clean(Path("lyrics.txt").read_text()) freq = Counter(tokens) for word, n in freq.most_common(15): print(f" {word:<15} {n}") # optional: pip install wordcloud # from wordcloud import WordCloud # WordCloud().generate_from_frequencies(freq).to_file("cloud.png")