PY-L5-36 · Sentiment Analysis — Advaslearning Hub

Learning Goals

3 min

Use VADER for instant rule-based sentiment scores.
Train a TF-IDF + classifier sentiment model on labelled reviews.
Compare lexicon vs trained approaches.
Spot where sentiment models fail (sarcasm, negation, context).

Warm-Up · Two Routes

5 min

LEXICON (VADER):  a dictionary of word→sentiment scores.
                   No training. Great for social media / short text.
TRAINED MODEL:     TF-IDF + classifier on YOUR labelled data.
                   Learns your domain's language. Needs labels.

Today's big idea

If you have no labels and need something now, use a lexicon tool. If you have labelled data in your domain (movie reviews, product feedback), a trained model usually wins. Both are sentiment analysis.

New Concept · VADER & a Trained Model

14 min

Route 1 — VADER (no training)

pip install nltk
python -c "import nltk; nltk.download('vader_lexicon')"

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

for text in ["I love this!", "This is terrible.", "It's okay I guess"]:
    s = sia.polarity_scores(text)
    print(f"{s['compound']:+.2f}  {text}")

+0.69  I love this!
-0.48  This is terrible.
+0.20  It's okay I guess

compound ranges -1 (very negative) to +1 (very positive). VADER even understands "!", ALL-CAPS, and emoji intensity. Rule of thumb: ≥ 0.05 positive, ≤ -0.05 negative, else neutral.

Route 2 — train your own

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

clf = make_pipeline(
    TfidfVectorizer(stop_words="english", ngram_range=(1, 2)),
    LogisticRegression(max_iter=1000),
)
clf.fit(train_reviews, train_labels)   # labels: "pos"/"neg"
print(clf.predict(["the plot was boring but the acting saved it"]))

The ngram_range=(1,2) is doing real work here — it captures "not good", "too slow", "highly recommend" as phrases.

Where both fail

Sarcasm:     "Oh GREAT, another delay." (positive words, negative meaning)
Negation:    "not bad at all" (a lexicon may miss the flip)
Domain:      "this phone is sick" (slang positive, lexicon says negative)
Context:     "small" is good for a phone, bad for a hotel room

Worked Example · Lexicon vs Trained

12 min

# sentiment.py — compare VADER to a trained model
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# tiny labelled training set (use a real dataset for production)
train = [
    ("absolutely loved it, best ever", "pos"),
    ("amazing quality, highly recommend", "pos"),
    ("so good, will buy again", "pos"),
    ("terrible, broke immediately", "neg"),
    ("worst purchase, total waste", "neg"),
    ("hated it, do not buy", "neg"),
]
texts, labels = zip(*train)

clf = make_pipeline(TfidfVectorizer(ngram_range=(1, 2)),
                    LogisticRegression(max_iter=1000)).fit(texts, labels)

sia = SentimentIntensityAnalyzer()
tests = ["this is fantastic", "not worth the money",
         "it does the job"]

print(f"{'text':<24}{'VADER':>10}{'trained':>10}")
for t in tests:
    v = sia.polarity_scores(t)["compound"]
    v_label = "pos" if v >= 0.05 else "neg" if v <= -0.05 else "neu"
    print(f"{t:<24}{v_label:>10}{clf.predict([t])[0]:>10}")

Sample output

text                       VADER   trained
this is fantastic            pos       pos
not worth the money          neg       neg
it does the job              neu       pos

Read the diff

VADER and the trained model agree on clear cases. They diverge on "it does the job" — neutral to VADER, but our tiny training set has no neutral class so the model is forced to pick pos/neg. The lesson: a trained model is only as good as its labels and classes. With real, balanced data the trained model usually wins on your specific domain.

Try It Yourself

13 min

01 🟢 Score your messages

Run VADER on 10 of your own sentences. Does it agree with how you'd label them?

02 🟡 Break VADER

Find three sentences where VADER gets it wrong (sarcasm, slang, negation). Explain why.

03 🔴 Train on real data

Use a real review dataset (IMDB, or a CSV you have). Train TF-IDF + LogisticRegression and report CV accuracy. Compare to VADER's accuracy on the same test set.

Mini-Challenge · Sentiment Over Time

8 min

Given dated reviews, compute average sentiment per day/week and plot the trend. Did sentiment improve or decline? This is exactly how brands monitor reputation.

Show one possible solution

import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

df = pd.read_csv("reviews.csv", parse_dates=["date"])  # date, text
df["sentiment"] = df["text"].apply(lambda t: sia.polarity_scores(t)["compound"])
weekly = df.set_index("date")["sentiment"].resample("W").mean()
weekly.plot(title="Average sentiment per week")

Non-negotiables: a sentiment score per row, resampled over time, a trend plot. This combines L4 time-series with L5 NLP.

Recap

3 min

Two routes: VADER (lexicon, zero training, great for short social text) and a trained TF-IDF + classifier (learns your domain, needs labels). Both struggle with sarcasm, slang, negation and context. Use VADER for a quick start; train when you have labelled domain data. Next: a rule-based chatbot — no API.

Vocabulary Card

sentiment analysis: Classifying the emotional tone of text (positive/negative/neutral).
lexicon (VADER): A dictionary of word→sentiment scores; no training needed.
compound score: VADER's overall -1..+1 sentiment for a piece of text.
domain adaptation: Training on your specific text so the model learns its language.

Homework

4 min

Collect 20 short reviews (real or written). Score them with VADER AND a small trained model. Where do they disagree? Pick the 3 most interesting disagreements and explain which one you trust and why.

LEXICON (VADER): a dictionary of word→sentiment scores. No training. Great for social media / short text. TRAINED MODEL: TF-IDF + classifier on YOUR labelled data. Learns your domain's language. Needs labels.

from nltk.sentiment import SentimentIntensityAnalyzer sia = SentimentIntensityAnalyzer() for text in ["I love this!", "This is terrible.", "It's okay I guess"]: s = sia.polarity_scores(text) print(f"{s['compound']:+.2f} {text}")

from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression clf = make_pipeline( TfidfVectorizer(stop_words="english", ngram_range=(1, 2)), LogisticRegression(max_iter=1000), ) clf.fit(train_reviews, train_labels) # labels: "pos"/"neg" print(clf.predict(["the plot was boring but the acting saved it"]))

Sarcasm: "Oh GREAT, another delay." (positive words, negative meaning) Negation: "not bad at all" (a lexicon may miss the flip) Domain: "this phone is sick" (slang positive, lexicon says negative) Context: "small" is good for a phone, bad for a hotel room

# sentiment.py — compare VADER to a trained model from nltk.sentiment import SentimentIntensityAnalyzer from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression # tiny labelled training set (use a real dataset for production) train = [ ("absolutely loved it, best ever", "pos"), ("amazing quality, highly recommend", "pos"), ("so good, will buy again", "pos"), ("terrible, broke immediately", "neg"), ("worst purchase, total waste", "neg"), ("hated it, do not buy", "neg"), ] texts, labels = zip(*train) clf = make_pipeline(TfidfVectorizer(ngram_range=(1, 2)), LogisticRegression(max_iter=1000)).fit(texts, labels) sia = SentimentIntensityAnalyzer() tests = ["this is fantastic", "not worth the money", "it does the job"] print(f"{'text':<24}{'VADER':>10}{'trained':>10}") for t in tests: v = sia.polarity_scores(t)["compound"] v_label = "pos" if v >= 0.05 else "neg" if v <= -0.05 else "neu" print(f"{t:<24}{v_label:>10}{clf.predict([t])[0]:>10}")

import pandas as pd from nltk.sentiment import SentimentIntensityAnalyzer sia = SentimentIntensityAnalyzer() df = pd.read_csv("reviews.csv", parse_dates=["date"]) # date, text df["sentiment"] = df["text"].apply(lambda t: sia.polarity_scores(t)["compound"]) weekly = df.set_index("date")["sentiment"].resample("W").mean() weekly.plot(title="Average sentiment per week")