Learning Goals
3 min- Tokenise text into words; normalise case and punctuation.
- Remove stop-words (the, a, is...) that carry little meaning.
- Stem and lemmatise — reduce words to a common root.
- Build a reusable text-cleaning function.
Warm-Up · From Sentence to Tokens
5 min"The cats are RUNNING quickly!" → tokenise: ["The","cats","are","RUNNING","quickly","!"] → lowercase: ["the","cats","are","running","quickly","!"] → drop stop: ["cats","running","quickly"] → stem: ["cat","run","quickli"]
Models can't read raw sentences — they need clean, normalised tokens. Cleaning collapses "Running", "runs", "ran" toward one root so the model sees them as the same idea. Garbage in, garbage out applies double to text.
New Concept · The Cleaning Steps
14 minInstall & data
pip install nltk
# one-time downloads:
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('wordnet')"Tokenise + lowercase
import re text = "The cats are RUNNING quickly!" # simple regex tokeniser: words only, lowercased tokens = re.findall(r"[a-z]+", text.lower()) print(tokens) # ['the', 'cats', 'are', 'running', 'quickly']
For most tasks a regex tokeniser is plenty. NLTK's word_tokenize is smarter about punctuation and contractions when you need it.
Stop-words
from nltk.corpus import stopwords STOP = set(stopwords.words("english")) tokens = [t for t in tokens if t not in STOP] print(tokens) # ['cats', 'running', 'quickly']
Stop-words are common filler (the, is, and, of). Removing them focuses the model on content words. (For some tasks — like sentiment with "not" — be careful which you drop.)
Stemming — chop to a crude root
from nltk.stem import PorterStemmer ps = PorterStemmer() print([ps.stem(t) for t in ["running", "runs", "ran", "easily"]]) # ['run', 'run', 'ran', 'easili'] — fast but crude (note 'easili')
Lemmatising — proper dictionary root
from nltk.stem import WordNetLemmatizer lem = WordNetLemmatizer() print([lem.lemmatize(t, pos="v") for t in ["running", "runs", "ran"]]) # ['run', 'run', 'run'] — slower but real words
stemming fast, crude, may produce non-words ("easili")
lemmatising slower, accurate, real dictionary words ("good")Worked Example · A Clean-Text Function
12 min# clean_text.py — reusable NLP preprocessing import re from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer STOP = set(stopwords.words("english")) lem = WordNetLemmatizer() def clean(text): tokens = re.findall(r"[a-z]+", text.lower()) # tokenise + lowercase tokens = [t for t in tokens if t not in STOP] # drop stop-words tokens = [lem.lemmatize(t) for t in tokens] # lemmatise tokens = [t for t in tokens if len(t) > 2] # drop tiny tokens return tokens reviews = [ "This product is AMAZING! Highly recommend it to everyone.", "Terrible quality, broke after two days. Very disappointed.", "It's okay, nothing special but does the job fine.", ] for r in reviews: print(clean(r))
Sample output
['product', 'amazing', 'highly', 'recommend', 'everyone'] ['terrible', 'quality', 'broke', 'two', 'day', 'disappointed'] ['okay', 'nothing', 'special', 'job', 'fine']
Read the diff
From raw reviews to clean content tokens in four steps. Notice how "AMAZING!" → "amazing" and filler ("is", "to", "it") vanished. This clean() is the front door for the next two lessons — turning text into numbers (TF-IDF) and classifying sentiment.
Try It Yourself
13 minClean a paragraph and print the 10 most common tokens with collections.Counter.
Hint
from collections import Counter print(Counter(clean(big_text)).most_common(10))
Run both on the same word list. Show a case where they differ and explain which is "more correct".
Default stop-words include "not" and "no" — bad for sentiment! Build a custom stop-list that keeps negation words. Show it changes the tokens for "not good".
Hint
KEEP = {"not", "no", "never", "nor"} STOP2 = STOP - KEEP # now "not good" keeps "not" → sentiment can tell it's negative
Mini-Challenge · Word Cloud Data
8 minClean a body of text (song lyrics, a chapter, your chat export) and produce the token frequency table. Optionally feed it to the wordcloud library for an image. The cleaning quality is what makes the cloud meaningful.
Show one possible solution
from collections import Counter from pathlib import Path tokens = clean(Path("lyrics.txt").read_text()) freq = Counter(tokens) for word, n in freq.most_common(15): print(f" {word:<15} {n}") # optional: pip install wordcloud # from wordcloud import WordCloud # WordCloud().generate_from_frequencies(freq).to_file("cloud.png")
Non-negotiables: clean before counting. Without stop-word removal, "the" and "and" dominate every cloud.
Recap
3 minText cleaning: tokenise → lowercase → drop stop-words → stem/lemmatise. Stemming is fast and crude; lemmatising is slower and accurate. Watch out: blindly dropping "not"/"no" ruins sentiment. A reusable clean() feeds everything downstream. Next: turn these tokens into numbers.
Vocabulary Card
- token
- A unit of text — usually a word — after splitting.
- stop-words
- Common low-information words (the, is, and) often removed.
- stemming
- Crudely chopping a word to a root (running → run).
- lemmatising
- Reducing a word to its dictionary base form, accurately.
Homework
4 minBuild a clean() with a negation-aware stop-list. Run it on 10 product reviews (real or invented). Print the cleaned tokens and the top-10 words across all reviews. One sentence on a cleaning choice you made and why.
Reuse clean_text.py + the negation-aware stop-list from Try-It #3. The choice to keep negations is the key insight for sentiment.