PY-L5-35 · Bag-of-Words & TF-IDF

Learning Goals

3 min

Build a bag-of-words matrix with CountVectorizer.
Understand TF-IDF: term frequency × inverse document frequency.
Use TfidfVectorizer with n-grams and stop-words.
See why TF-IDF down-weights common words and lifts distinctive ones.

Warm-Up · Counting Words

5 min

docs: ["i love pizza", "i love code"]
vocabulary: [code, i, love, pizza]
                 code  i  love  pizza
"i love pizza"     0   1    1     1
"i love code"      1   1    1     0

That table is bag-of-words: each document becomes a row of word counts. Order is lost ("bag"), but for many tasks word presence is enough.

Today's big idea

Vectorising = text → numeric matrix (documents × vocabulary). Bag-of-words counts; TF-IDF counts but discounts words that appear in every document (like "i") and boosts rare, distinctive ones. The matrix then feeds any classifier from week 2.

New Concept · Vectorisers

14 min

CountVectorizer (bag-of-words)

from sklearn.feature_extraction.text import CountVectorizer

docs = ["i love pizza", "i love code", "code code code"]
cv = CountVectorizer()
X = cv.fit_transform(docs)          # sparse matrix

print(cv.get_feature_names_out())   # ['code' 'i' 'love' 'pizza']
print(X.toarray())

['code' 'i' 'love' 'pizza']
[[0 1 1 1]
 [1 1 1 0]
 [3 0 0 0]]

TF-IDF — weight by distinctiveness

TF  = how often a word appears IN a document
IDF = how RARE the word is ACROSS all documents
TF-IDF = TF × IDF
  → common-everywhere words (low IDF) get small weights
  → distinctive words get large weights

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
print(X.toarray().round(2))

Useful options

TfidfVectorizer(
    stop_words="english",      # drop filler words
    ngram_range=(1, 2),        # unigrams AND bigrams ("not good")
    max_features=5000,         # cap vocabulary size
    min_df=2,                  # ignore words in fewer than 2 docs
)

ngram_range=(1,2) is powerful for sentiment — it captures "not good" as a single feature, which unigrams miss.

It's just preprocessing → feed a classifier

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

clf = make_pipeline(TfidfVectorizer(stop_words="english"),
                    LogisticRegression(max_iter=1000))
clf.fit(train_texts, train_labels)
clf.predict(["this is the best thing ever"])

Worked Example · Which Words Are Distinctive?

12 min

# tfidf_demo.py — see TF-IDF lift distinctive words
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "the pizza was delicious and the service was great",
    "the pizza was cold and the service was slow",
    "amazing pizza, will come back, highly recommend",
    "worst pizza ever, never coming back, terrible",
]

tfidf = TfidfVectorizer(stop_words="english")
X = tfidf.fit_transform(docs)
vocab = tfidf.get_feature_names_out()

# show the top TF-IDF word for each document
dense = X.toarray()
for i, doc in enumerate(docs):
    top = vocab[dense[i].argmax()]
    print(f"doc {i}: top word = '{top}'   ({doc[:35]}...)")

Sample output

doc 0: top word = 'delicious'  (the pizza was delicious and the s...)
doc 1: top word = 'cold'       (the pizza was cold and the service...)
doc 2: top word = 'recommend'  (amazing pizza, will come back, hig...)
doc 3: top word = 'terrible'   (worst pizza ever, never coming bac...)

Read the diff

"pizza" appears in every review, so TF-IDF gives it a low weight — it's not distinctive. The words that pop out ("delicious", "cold", "terrible") are exactly the sentiment-carrying ones. TF-IDF automatically surfaces what matters, which is why it's such a strong, cheap baseline for text.

Try It Yourself

13 min

01 🟢 Bag-of-words table

Vectorise 4 short sentences with CountVectorizer. Print the vocabulary and the count matrix.

02 🟡 Count vs TF-IDF

Vectorise the same docs both ways. For one document, compare which words get the highest weight under each.

03 🔴 Bigrams matter

Vectorise "not good" and "good" with ngram_range=(1,1) vs (1,2). Show that bigrams let the model see "not good" as its own feature.

Hint

v = TfidfVectorizer(ngram_range=(1,2))
v.fit(["not good", "very good"])
print(v.get_feature_names_out())  # includes 'not good', 'very good'

Mini-Challenge · Spam Classifier in 6 Lines

8 min

Build a TF-IDF + LogisticRegression pipeline that classifies short messages as spam/ham. Use a handful of labelled examples and predict a new message.

Show one possible solution

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

texts = ["win a free prize now", "claim your reward click here",
         "hey are we still on for lunch", "see you at the meeting tomorrow",
         "free money guaranteed act now", "can you send the report"]
labels = ["spam", "spam", "ham", "ham", "spam", "ham"]

clf = make_pipeline(TfidfVectorizer(stop_words="english"),
                    LogisticRegression(max_iter=1000))
clf.fit(texts, labels)

for msg in ["free prize click now", "lunch at noon?"]:
    print(f"{msg!r:<28} -> {clf.predict([msg])[0]}")

Non-negotiables: vectoriser + classifier in one pipeline, predict on unseen text. With real data (e.g. the SMS Spam dataset) this hits 98%+.

Recap

3 min

Bag-of-words counts each word per document; TF-IDF reweights so common words shrink and distinctive ones grow. Add stop-words, n-grams (capture "not good"), and a vocabulary cap. A TF-IDF + linear model pipeline is a strong, fast text-classification baseline. Next: apply it to sentiment.

Vocabulary Card

bag-of-words: Representing text as word counts, ignoring order.
TF-IDF: Term frequency × inverse document frequency — weights distinctive words higher.
n-gram: A sequence of n tokens; bigrams capture short phrases like "not good".
vectoriser: An object that turns text into a numeric feature matrix.

Homework

4 min

Build a TF-IDF + classifier pipeline on a small labelled text set (spam, topics, or sentiment). Report cross-validated accuracy and the 10 most informative words (largest model coefficients). One sentence interpreting them.

# top informative words from a fitted pipeline
import numpy as np
vec = clf.named_steps["tfidfvectorizer"]
lr  = clf.named_steps["logisticregression"]
vocab = vec.get_feature_names_out()
top = np.argsort(lr.coef_[0])[-10:]
print([vocab[i] for i in top])

from sklearn.feature_extraction.text import CountVectorizer docs = ["i love pizza", "i love code", "code code code"] cv = CountVectorizer() X = cv.fit_transform(docs) # sparse matrix print(cv.get_feature_names_out()) # ['code' 'i' 'love' 'pizza'] print(X.toarray())

TF = how often a word appears IN a document IDF = how RARE the word is ACROSS all documents TF-IDF = TF × IDF → common-everywhere words (low IDF) get small weights → distinctive words get large weights

TfidfVectorizer( stop_words="english", # drop filler words ngram_range=(1, 2), # unigrams AND bigrams ("not good") max_features=5000, # cap vocabulary size min_df=2, # ignore words in fewer than 2 docs )

from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression clf = make_pipeline(TfidfVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)) clf.fit(train_texts, train_labels) clf.predict(["this is the best thing ever"])

# tfidf_demo.py — see TF-IDF lift distinctive words import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer docs = [ "the pizza was delicious and the service was great", "the pizza was cold and the service was slow", "amazing pizza, will come back, highly recommend", "worst pizza ever, never coming back, terrible", ] tfidf = TfidfVectorizer(stop_words="english") X = tfidf.fit_transform(docs) vocab = tfidf.get_feature_names_out() # show the top TF-IDF word for each document dense = X.toarray() for i, doc in enumerate(docs): top = vocab[dense[i].argmax()] print(f"doc {i}: top word = '{top}' ({doc[:35]}...)")

doc 0: top word = 'delicious' (the pizza was delicious and the s...) doc 1: top word = 'cold' (the pizza was cold and the service...) doc 2: top word = 'recommend' (amazing pizza, will come back, hig...) doc 3: top word = 'terrible' (worst pizza ever, never coming bac...)

from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression texts = ["win a free prize now", "claim your reward click here", "hey are we still on for lunch", "see you at the meeting tomorrow", "free money guaranteed act now", "can you send the report"] labels = ["spam", "spam", "ham", "ham", "spam", "ham"] clf = make_pipeline(TfidfVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)) clf.fit(texts, labels) for msg in ["free prize click now", "lunch at noon?"]: print(f"{msg!r:<28} -> {clf.predict([msg])[0]}")