Zhiyuan Xu

attack arXiv Nov 21, 2025 · Nov 2025

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Zhiyuan Xu, Stanislav Abaimov, Joseph Gardiner et al. · University of Bristol

Novel activation-space attack exploits compression valleys in LLMs to steer behavior toward harmful outputs while evading conventional input/weight audits

Input Manipulation Attack Prompt Injection nlp

PDF

Papers in Database (1)

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models