Latest papers

5 papers
defense arXiv Feb 23, 2026 · 6w ago

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek et al. · Duke University · Princeton University +3 more

Detects concealed LLM jailbreaks by disentangling goal and framing signals in internal activation space

Prompt Injection nlp
PDF
defense arXiv Feb 12, 2026 · 7w ago

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad et al. · Apple · ETH Zürich

Defends LLMs against jailbreaks via SAE feature-space steering, outperforming dense activation steering on four models across twelve attacks

Prompt Injection nlp
PDF
benchmark arXiv Oct 8, 2025 · Oct 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu et al. · University of Georgia · University of Wisconsin–Madison +6 more

Benchmarks LLM-powered agents' ability to execute end-to-end enterprise intrusions aligned with MITRE ATT&CK TTPs

Excessive Agency Prompt Injection nlpmultimodal
4 citations PDF Code
defense arXiv Sep 18, 2025 · Sep 2025

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

Yihao Guo, Haocheng Bian, Liutong Zhou et al. · Apple · Cohere +3 more

Builds a compact 149M-parameter RAG-augmented guard model that detects malicious LLM prompts in real time with GPT-4-level accuracy

Prompt Injection nlp
PDF
attack arXiv Sep 3, 2025 · Sep 2025

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Wesley Hanwen Deng, Sunnie S. Y. Kim, Akshita Jha et al. · Carnegie Mellon University · Apple

Persona-driven automated red-teaming method improves LLM adversarial prompt attack success rates by up to 144% over state-of-the-art

Prompt Injection nlp
PDF