Latest papers

6 papers
defense arXiv Mar 3, 2026 · 4w ago

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang et al. · University of Chinese Academy of Sciences · Institute of Information Engineering +1 more

Defends LLMs against adversarial prefix jailbreaks by causal probing to pin malicious intent across autoregressive generation

Prompt Injection nlp
PDF
attack arXiv Nov 15, 2025 · Nov 2025

BudgetLeak: Membership Inference Attacks on RAG Systems via the Generation Budget Side Channel

Hao Li, Jiajun He, Guangshuo Wang et al. · Institute of Software · Shandong University

Exploits token generation budget as a side channel to infer RAG corpus membership, outperforming existing MIA baselines across diverse LLM settings.

Membership Inference Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Oct 24, 2025 · Oct 2025

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang et al. · Institute of Software · University of Chinese Academy of Sciences

Discovers Self-Jailbreak in LRMs — models override their own safety judgments mid-reasoning — and defends with step-level trajectory training

Prompt Injection nlp
PDF
defense arXiv Oct 10, 2025 · Oct 2025

SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG

Xiaonan Si, Meilin Zhu, Simeng Qin et al. · Institute of Software · University of Chinese Academy of Sciences +5 more

Defends RAG systems from corpus poisoning via two-stage semantic and conflict-aware filtering before LLM generation

Prompt Injection nlp
2 citations PDF
attack arXiv Aug 3, 2025 · Aug 2025

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng, Tianhao Li, Haotian Huang et al. · Duke University · North China University of Technology +7 more

Attacks LLMs via component-wise text perturbations, revealing heterogeneous adversarial robustness across dissected prompt structures

Prompt Injection nlp
PDF Code
attack arXiv Jan 3, 2025 · Jan 2025

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

Yanjiang Liu, Shuhen Zhou, Yaojie Lu et al. · Institute of Software · University of Chinese Academy of Sciences +1 more

RL-based automated red-teaming framework that optimizes jailbreak strategies against LLMs, achieving 16.63% higher attack success rates

Prompt Injection nlp
PDF