Shaz Furniturewala

Papers in Database (1)

defense arXiv Sep 16, 2025 · Sep 2025

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

Shaz Furniturewala, Arkaitz Zubiaga · BITS Pilani · Queen Mary University of London

Defends toxicity classifiers against adversarial text attacks by identifying and suppressing vulnerable attention heads via mechanistic interpretability

Input Manipulation Attack nlp
PDF