Latest papers

7 papers
attack arXiv Apr 23, 2026 · 28d ago

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

Harsh Kumar, Rahul Maity, Tanmay Joshi et al. · Manipal University Jaipur · National Institute of Technology Karnataka +3 more

Web-scale poisoning attack planting dormant backdoor triggers in LLM pretraining corpora via stealth websites indexed by Common Crawl

Data Poisoning Attack Model Poisoning AI Supply Chain Attacks Training Data Poisoning nlp
PDF Code
attack arXiv Apr 13, 2026 · 5w ago

On the Robustness of Watermarking for Autoregressive Image Generation

Andreas Müller, Denis Lukovnikov, Shingo Kodama et al. · Ruhr University Bochum · Middlebury College +3 more

Attacks watermarking schemes for autoregressive image generators, achieving both removal and forgery with single reference images

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 23, 2026 · 12w ago

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek et al. · Duke University · Princeton University +3 more

Detects concealed LLM jailbreaks by disentangling goal and framing signals in internal activation space

Prompt Injection nlp
PDF
defense arXiv Feb 12, 2026 · Feb 2026

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo Cortellazzi, Javier Abad et al. · Apple · ETH Zürich

Defends LLMs against jailbreaks via SAE feature-space steering, outperforming dense activation steering on four models across twelve attacks

Prompt Injection nlp
PDF
benchmark arXiv Oct 8, 2025 · Oct 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu et al. · University of Georgia · University of Wisconsin–Madison +6 more

Benchmarks LLM-powered agents' ability to execute end-to-end enterprise intrusions aligned with MITRE ATT&CK TTPs

Excessive Agency Prompt Injection nlpmultimodal
4 citations PDF Code
defense arXiv Sep 18, 2025 · Sep 2025

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

Yihao Guo, Haocheng Bian, Liutong Zhou et al. · Apple · Cohere +3 more

Builds a compact 149M-parameter RAG-augmented guard model that detects malicious LLM prompts in real time with GPT-4-level accuracy

Prompt Injection nlp
PDF
attack arXiv Sep 3, 2025 · Sep 2025

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Wesley Hanwen Deng, Sunnie S. Y. Kim, Akshita Jha et al. · Carnegie Mellon University · Apple

Persona-driven automated red-teaming method improves LLM adversarial prompt attack success rates by up to 144% over state-of-the-art

Prompt Injection Red-Team Agents nlp
PDF