Latest papers

1 papers
benchmark arXiv Aug 19, 2025 · Aug 2025

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Mohammed Abu Baker, Lakshmi Babu-Saheer · Anglia Ruskin University

Analyzes attention signatures of backdoored LLMs via mechanistic interpretability, revealing layer-specific deviations useful for detection

Model Poisoning nlp
PDF