Latest papers

1 papers
attack arXiv Mar 10, 2026 · 29d ago

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Ali Raza, Gurang Gupta, Nikolay Matyunin et al. · Honda Research Institute Europe · Indian Institute of Technology Kharagpur

Activation-steering attack manipulates internal transformer states to jailbreak open-weight LLMs without fine-tuning or gradient-based prompt optimization

Prompt Injection nlp
PDF