PY-L5-38 · Project — Cyberbullying Comment Filter

Project Goals

3 min

Build a TF-IDF + classifier pipeline for "harmful vs OK".
Evaluate with precision/recall — both errors matter here.
Design it as a flag for human review, not an auto-ban.
Discuss bias, false positives, and the limits of the model.

Warm-Up · Both Mistakes Hurt

5 min

false negative: harmful comment slips through → someone gets hurt
false positive: harmless comment flagged       → free speech chilled,
                                                  innocent user frustrated

Today's big idea

This is a high-stakes classifier. Accuracy is the wrong metric (most comments are fine, so "flag nothing" scores high). You care about recall (catch the harmful ones) balanced against precision (don't falsely accuse). And it should assist a human, never auto-punish.

Plan · Build, Measure, Frame

14 min

Data shape

comment (text)                          label
"great video, thanks for posting"        ok
"you're an idiot and should quit"        harmful
"i disagree but respect your view"        ok
...

Real datasets exist (Jigsaw Toxic Comments on Kaggle). For class, a small hand-labelled set works to learn the mechanics — but be honest that it's a toy.

The pipeline

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=1),
    LogisticRegression(max_iter=2000, class_weight="balanced"),
)

class_weight="balanced" matters: harmful comments are rare, so we tell the model to weight the minority class up — otherwise it just predicts "ok" for everything.

Evaluate the RIGHT way

from sklearn.metrics import classification_report
print(classification_report(y_test, model.predict(X_test)))
# focus on recall and precision for the "harmful" class

Frame it as assistance

Output a SCORE + a recommendation, not a verdict:
  score > 0.8  → "flag for urgent review"
  score > 0.5  → "queue for review"
  else         → "looks ok"
A HUMAN makes the final call. Always.

Build · comment_filter.py

12 min

# comment_filter.py — flag-for-review, not auto-ban
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# tiny illustrative dataset (use a real one for anything serious)
data = [
    ("thanks so much, this helped a lot", "ok"),
    ("great explanation, subscribed!", "ok"),
    ("i respectfully disagree with point 2", "ok"),
    ("nice work, keep it up", "ok"),
    ("you are so stupid, just give up", "harmful"),
    ("nobody likes you, get lost loser", "harmful"),
    ("worst person ever, you should quit", "harmful"),
    ("shut up idiot", "harmful"),
]
texts, labels = zip(*data)
Xtr, Xte, ytr, yte = train_test_split(texts, labels, test_size=0.25,
                                      stratify=labels, random_state=0)

model = make_pipeline(
    TfidfVectorizer(stop_words="english", ngram_range=(1, 2)),
    LogisticRegression(max_iter=2000, class_weight="balanced"),
).fit(Xtr, ytr)

def review(comment):
    proba = model.predict_proba([comment])[0]
    harmful_p = proba[list(model.classes_).index("harmful")]
    if harmful_p > 0.8:
        action = "🚨 flag for URGENT human review"
    elif harmful_p > 0.5:
        action = "⚠️  queue for human review"
    else:
        action = "✅ looks ok"
    return harmful_p, action

for c in ["you are amazing, thank you",
          "you should just quit, loser",
          "i don't agree but ok"]:
    p, action = review(c)
    print(f"[{p:.0%}] {action:<32} {c!r}")

Sample output

[ 8%] ✅ looks ok                      'you are amazing, thank you'
[91%] 🚨 flag for URGENT human review  'you should just quit, loser'
[34%] ✅ looks ok                      "i don't agree but ok"

Read the diff

Note the design: it outputs a probability and a recommendation to a human, never a ban. The thresholds (0.5/0.8) are policy decisions, not technical ones — a person chooses how cautious to be. This framing is the responsible way to ship any consequential classifier.

Extensions & Reflection

13 min

01 🟢 Use a real dataset

Download the Jigsaw Toxic Comments dataset (Kaggle) and train on it. Report precision/recall for the harmful class.

02 🟡 Find a bias

Test your model on neutral sentences that mention identity groups (e.g., "I am a Muslim student"). Does it wrongly flag any? This is a real, documented bias in toxicity models — discuss why.

03 🔴 Threshold & cost

Sweep the harmful-threshold and plot precision vs recall. Choose an operating point and justify it for a school context (where false accusations are very harmful).

Stretch · The Moderation Dashboard

8 min

Build a small "moderation queue": read a batch of comments, score each, sort by harmful probability, and print a review list (most-likely-harmful first). This is how real moderation tools triage volume for human reviewers.

Recap

3 min

A toxicity classifier is a TF-IDF + balanced classifier — technically easy. The hard parts are ethical: accuracy lies on imbalanced data (use precision/recall), models inherit bias from their training data, and the system should flag for humans, never auto-punish. Build it to assist, measure it honestly, disclose its limits. Next: real LLMs.

Vocabulary Card

class_weight="balanced": Tells the model to weight rare classes more, fighting imbalance.
human-in-the-loop: Design where the model assists a human decision rather than acting alone.
operating threshold: The probability cutoff for action — a policy choice, not just technical.
model bias: Systematic unfairness learned from skewed training data.

Homework

4 min

Build the comment filter (toy or real data). Report harmful-class precision/recall, run the bias test from Try-It #2, and write a half-page on: which errors are worse in a school setting, what threshold you'd pick, and why a human must stay in the loop.

false negative: harmful comment slips through → someone gets hurt false positive: harmless comment flagged → free speech chilled, innocent user frustrated

from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression model = make_pipeline( TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=1), LogisticRegression(max_iter=2000, class_weight="balanced"), )

# comment_filter.py — flag-for-review, not auto-ban from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # tiny illustrative dataset (use a real one for anything serious) data = [ ("thanks so much, this helped a lot", "ok"), ("great explanation, subscribed!", "ok"), ("i respectfully disagree with point 2", "ok"), ("nice work, keep it up", "ok"), ("you are so stupid, just give up", "harmful"), ("nobody likes you, get lost loser", "harmful"), ("worst person ever, you should quit", "harmful"), ("shut up idiot", "harmful"), ] texts, labels = zip(*data) Xtr, Xte, ytr, yte = train_test_split(texts, labels, test_size=0.25, stratify=labels, random_state=0) model = make_pipeline( TfidfVectorizer(stop_words="english", ngram_range=(1, 2)), LogisticRegression(max_iter=2000, class_weight="balanced"), ).fit(Xtr, ytr) def review(comment): proba = model.predict_proba([comment])[0] harmful_p = proba[list(model.classes_).index("harmful")] if harmful_p > 0.8: action = "🚨 flag for URGENT human review" elif harmful_p > 0.5: action = "⚠️ queue for human review" else: action = "✅ looks ok" return harmful_p, action for c in ["you are amazing, thank you", "you should just quit, loser", "i don't agree but ok"]: p, action = review(c) print(f"[{p:.0%}] {action:<32} {c!r}")

[ 8%] ✅ looks ok 'you are amazing, thank you' [91%] 🚨 flag for URGENT human review 'you should just quit, loser' [34%] ✅ looks ok "i don't agree but ok"