Latest papers

4 papers
defense arXiv Apr 29, 2026 · 22d ago

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Yuan Xin, Yixuan Weng, Minjun Zhu et al. · CISPA · Westlake University +3 more

GAN-inspired co-evolutionary framework training attack generators and defenders to protect LLM review systems from hidden prompt injection

Prompt Injection nlp
PDF
attack arXiv Mar 19, 2026 · 9w ago

On Optimizing Multimodal Jailbreaks for Spoken Language Models

Aravind Krishnan, Karolina Stańczak, Dietrich Klakow · Saarland University · DFKI GmbH +1 more

Multimodal gradient-based jailbreak attack on Spoken Language Models achieving 1.5x-10x higher success than unimodal attacks

Input Manipulation Attack Prompt Injection multimodalnlpaudio
PDF Code
benchmark arXiv Jan 12, 2026 · Jan 2026

Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis

Yuxi Xia, Kinga Stańczak, Benjamin Roth · University of Vienna · Saarland University

Benchmark and linguistic analysis explaining why AI-text detectors fail to generalize across prompts, models, and domains

Output Integrity Attack nlp
PDF Code
defense arXiv Aug 8, 2025 · Aug 2025

In-Training Defenses against Emergent Misalignment in Language Models

David Kaczér, Magnus Jørgenvåg, Clemens Vetter et al. · University of Bonn · Lamarr Institute for Machine Learning and Artificial Intelligence +1 more

Evaluates four in-training regularization defenses that prevent emergent misalignment when fine-tuning LLMs with malicious data via APIs

Transfer Learning Attack Prompt Injection nlp
PDF