Learning Goals
3 min- Build a bag-of-words matrix with
CountVectorizer. - Understand TF-IDF: term frequency × inverse document frequency.
- Use
TfidfVectorizerwith n-grams and stop-words. - See why TF-IDF down-weights common words and lifts distinctive ones.
Warm-Up · Counting Words
5 mindocs: ["i love pizza", "i love code"]
vocabulary: [code, i, love, pizza]
code i love pizza
"i love pizza" 0 1 1 1
"i love code" 1 1 1 0That table is bag-of-words: each document becomes a row of word counts. Order is lost ("bag"), but for many tasks word presence is enough.
Vectorising = text → numeric matrix (documents × vocabulary). Bag-of-words counts; TF-IDF counts but discounts words that appear in every document (like "i") and boosts rare, distinctive ones. The matrix then feeds any classifier from week 2.
New Concept · Vectorisers
14 minCountVectorizer (bag-of-words)
from sklearn.feature_extraction.text import CountVectorizer docs = ["i love pizza", "i love code", "code code code"] cv = CountVectorizer() X = cv.fit_transform(docs) # sparse matrix print(cv.get_feature_names_out()) # ['code' 'i' 'love' 'pizza'] print(X.toarray())
['code' 'i' 'love' 'pizza'] [[0 1 1 1] [1 1 1 0] [3 0 0 0]]
TF-IDF — weight by distinctiveness
TF = how often a word appears IN a document IDF = how RARE the word is ACROSS all documents TF-IDF = TF × IDF → common-everywhere words (low IDF) get small weights → distinctive words get large weights
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() X = tfidf.fit_transform(docs) print(X.toarray().round(2))
Useful options
TfidfVectorizer( stop_words="english", # drop filler words ngram_range=(1, 2), # unigrams AND bigrams ("not good") max_features=5000, # cap vocabulary size min_df=2, # ignore words in fewer than 2 docs )
ngram_range=(1,2) is powerful for sentiment — it captures "not good" as a single feature, which unigrams miss.
It's just preprocessing → feed a classifier
from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression clf = make_pipeline(TfidfVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)) clf.fit(train_texts, train_labels) clf.predict(["this is the best thing ever"])
Worked Example · Which Words Are Distinctive?
12 min# tfidf_demo.py — see TF-IDF lift distinctive words import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer docs = [ "the pizza was delicious and the service was great", "the pizza was cold and the service was slow", "amazing pizza, will come back, highly recommend", "worst pizza ever, never coming back, terrible", ] tfidf = TfidfVectorizer(stop_words="english") X = tfidf.fit_transform(docs) vocab = tfidf.get_feature_names_out() # show the top TF-IDF word for each document dense = X.toarray() for i, doc in enumerate(docs): top = vocab[dense[i].argmax()] print(f"doc {i}: top word = '{top}' ({doc[:35]}...)")
Sample output
doc 0: top word = 'delicious' (the pizza was delicious and the s...) doc 1: top word = 'cold' (the pizza was cold and the service...) doc 2: top word = 'recommend' (amazing pizza, will come back, hig...) doc 3: top word = 'terrible' (worst pizza ever, never coming bac...)
Read the diff
"pizza" appears in every review, so TF-IDF gives it a low weight — it's not distinctive. The words that pop out ("delicious", "cold", "terrible") are exactly the sentiment-carrying ones. TF-IDF automatically surfaces what matters, which is why it's such a strong, cheap baseline for text.
Try It Yourself
13 minVectorise 4 short sentences with CountVectorizer. Print the vocabulary and the count matrix.
Vectorise the same docs both ways. For one document, compare which words get the highest weight under each.
Vectorise "not good" and "good" with ngram_range=(1,1) vs (1,2). Show that bigrams let the model see "not good" as its own feature.
Hint
v = TfidfVectorizer(ngram_range=(1,2)) v.fit(["not good", "very good"]) print(v.get_feature_names_out()) # includes 'not good', 'very good'
Mini-Challenge · Spam Classifier in 6 Lines
8 minBuild a TF-IDF + LogisticRegression pipeline that classifies short messages as spam/ham. Use a handful of labelled examples and predict a new message.
Show one possible solution
from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression texts = ["win a free prize now", "claim your reward click here", "hey are we still on for lunch", "see you at the meeting tomorrow", "free money guaranteed act now", "can you send the report"] labels = ["spam", "spam", "ham", "ham", "spam", "ham"] clf = make_pipeline(TfidfVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)) clf.fit(texts, labels) for msg in ["free prize click now", "lunch at noon?"]: print(f"{msg!r:<28} -> {clf.predict([msg])[0]}")
Non-negotiables: vectoriser + classifier in one pipeline, predict on unseen text. With real data (e.g. the SMS Spam dataset) this hits 98%+.
Recap
3 minBag-of-words counts each word per document; TF-IDF reweights so common words shrink and distinctive ones grow. Add stop-words, n-grams (capture "not good"), and a vocabulary cap. A TF-IDF + linear model pipeline is a strong, fast text-classification baseline. Next: apply it to sentiment.
Vocabulary Card
- bag-of-words
- Representing text as word counts, ignoring order.
- TF-IDF
- Term frequency × inverse document frequency — weights distinctive words higher.
- n-gram
- A sequence of n tokens; bigrams capture short phrases like "not good".
- vectoriser
- An object that turns text into a numeric feature matrix.
Homework
4 minBuild a TF-IDF + classifier pipeline on a small labelled text set (spam, topics, or sentiment). Report cross-validated accuracy and the 10 most informative words (largest model coefficients). One sentence interpreting them.
# top informative words from a fitted pipeline import numpy as np vec = clf.named_steps["tfidfvectorizer"] lr = clf.named_steps["logisticregression"] vocab = vec.get_feature_names_out() top = np.argsort(lr.coef_[0])[-10:] print([vocab[i] for i in top])