Alexey Dontsov

h-index: 3 48 citations 9 papers (total)

Papers in Database (1)

attack arXiv Sep 26, 2025 · Sep 2025

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Anton Korznikov, Andrey Galichin, Alexey Dontsov et al. · Skolkovo Institute of Science and Technology

Activation steering—even with random or benign SAE vectors—reliably jailbreaks aligned LLMs by corrupting internal hidden states at inference

Prompt Injection nlp
4 citations PDF