Latest papers

1 papers
defense arXiv Feb 23, 2026 · 6w ago

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek et al. · Duke University · Princeton University +3 more

Detects concealed LLM jailbreaks by disentangling goal and framing signals in internal activation space

Prompt Injection nlp
PDF