Alexey Dontsov

attack arXiv Sep 26, 2025 · Sep 2025

Anton Korznikov, Andrey Galichin, Alexey Dontsov et al. · Skolkovo Institute of Science and Technology

Activation steering—even with random or benign SAE vectors—reliably jailbreaks aligned LLMs by corrupting internal hidden states at inference

Prompt Injection nlp

4 citations PDF

Papers in Database (1)