Project Goals
3 min- Build a TF-IDF + classifier pipeline for "harmful vs OK".
- Evaluate with precision/recall — both errors matter here.
- Design it as a flag for human review, not an auto-ban.
- Discuss bias, false positives, and the limits of the model.
Warm-Up · Both Mistakes Hurt
5 minfalse negative: harmful comment slips through → someone gets hurt
false positive: harmless comment flagged → free speech chilled,
innocent user frustratedThis is a high-stakes classifier. Accuracy is the wrong metric (most comments are fine, so "flag nothing" scores high). You care about recall (catch the harmful ones) balanced against precision (don't falsely accuse). And it should assist a human, never auto-punish.
Plan · Build, Measure, Frame
14 minData shape
comment (text) label "great video, thanks for posting" ok "you're an idiot and should quit" harmful "i disagree but respect your view" ok ...
Real datasets exist (Jigsaw Toxic Comments on Kaggle). For class, a small hand-labelled set works to learn the mechanics — but be honest that it's a toy.
The pipeline
from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression model = make_pipeline( TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=1), LogisticRegression(max_iter=2000, class_weight="balanced"), )
class_weight="balanced" matters: harmful comments are rare, so we tell the model to weight the minority class up — otherwise it just predicts "ok" for everything.
Evaluate the RIGHT way
from sklearn.metrics import classification_report print(classification_report(y_test, model.predict(X_test))) # focus on recall and precision for the "harmful" class
Frame it as assistance
Output a SCORE + a recommendation, not a verdict: score > 0.8 → "flag for urgent review" score > 0.5 → "queue for review" else → "looks ok" A HUMAN makes the final call. Always.
Build · comment_filter.py
12 min# comment_filter.py — flag-for-review, not auto-ban from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # tiny illustrative dataset (use a real one for anything serious) data = [ ("thanks so much, this helped a lot", "ok"), ("great explanation, subscribed!", "ok"), ("i respectfully disagree with point 2", "ok"), ("nice work, keep it up", "ok"), ("you are so stupid, just give up", "harmful"), ("nobody likes you, get lost loser", "harmful"), ("worst person ever, you should quit", "harmful"), ("shut up idiot", "harmful"), ] texts, labels = zip(*data) Xtr, Xte, ytr, yte = train_test_split(texts, labels, test_size=0.25, stratify=labels, random_state=0) model = make_pipeline( TfidfVectorizer(stop_words="english", ngram_range=(1, 2)), LogisticRegression(max_iter=2000, class_weight="balanced"), ).fit(Xtr, ytr) def review(comment): proba = model.predict_proba([comment])[0] harmful_p = proba[list(model.classes_).index("harmful")] if harmful_p > 0.8: action = "🚨 flag for URGENT human review" elif harmful_p > 0.5: action = "⚠️ queue for human review" else: action = "✅ looks ok" return harmful_p, action for c in ["you are amazing, thank you", "you should just quit, loser", "i don't agree but ok"]: p, action = review(c) print(f"[{p:.0%}] {action:<32} {c!r}")
Sample output
[ 8%] ✅ looks ok 'you are amazing, thank you' [91%] 🚨 flag for URGENT human review 'you should just quit, loser' [34%] ✅ looks ok "i don't agree but ok"
Read the diff
Note the design: it outputs a probability and a recommendation to a human, never a ban. The thresholds (0.5/0.8) are policy decisions, not technical ones — a person chooses how cautious to be. This framing is the responsible way to ship any consequential classifier.
Extensions & Reflection
13 minDownload the Jigsaw Toxic Comments dataset (Kaggle) and train on it. Report precision/recall for the harmful class.
Test your model on neutral sentences that mention identity groups (e.g., "I am a Muslim student"). Does it wrongly flag any? This is a real, documented bias in toxicity models — discuss why.
Sweep the harmful-threshold and plot precision vs recall. Choose an operating point and justify it for a school context (where false accusations are very harmful).
Stretch · The Moderation Dashboard
8 minBuild a small "moderation queue": read a batch of comments, score each, sort by harmful probability, and print a review list (most-likely-harmful first). This is how real moderation tools triage volume for human reviewers.
Recap
3 minA toxicity classifier is a TF-IDF + balanced classifier — technically easy. The hard parts are ethical: accuracy lies on imbalanced data (use precision/recall), models inherit bias from their training data, and the system should flag for humans, never auto-punish. Build it to assist, measure it honestly, disclose its limits. Next: real LLMs.
Vocabulary Card
- class_weight="balanced"
- Tells the model to weight rare classes more, fighting imbalance.
- human-in-the-loop
- Design where the model assists a human decision rather than acting alone.
- operating threshold
- The probability cutoff for action — a policy choice, not just technical.
- model bias
- Systematic unfairness learned from skewed training data.
Homework
4 minBuild the comment filter (toy or real data). Report harmful-class precision/recall, run the bias test from Try-It #2, and write a half-page on: which errors are worse in a school setting, what threshold you'd pick, and why a human must stay in the loop.
Use comment_filter.py as the base. The written reflection is the most important deliverable — it shows you understand that ML for moderation is as much an ethics problem as a coding one.