Latest papers

10 papers
attack arXiv Feb 12, 2026 · 7w ago

Detecting RLVR Training Data via Structural Convergence of Reasoning

Hongbo Zhang, Yang Yue, Jianhao Yan et al. · Zhejiang University · Westlake University +1 more

Black-box membership inference attack on RLVR-trained reasoning models exploiting generation diversity collapse to detect training data

Membership Inference Attack nlpreinforcement-learning
PDF Code
defense arXiv Feb 1, 2026 · 9w ago

Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

Ke Sun, Guangsheng Bao, Han Cui et al. · Westlake University

Prototype-based routing framework dynamically selects the best surrogate model to detect LLM-generated text across unknown black-box sources

Output Integrity Attack nlp
PDF
defense arXiv Jan 8, 2026 · Jan 2026

When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

Ke Sun, Guangsheng Bao, Han Cui et al. · Westlake University

Detects AI-generated text via late-stage token probability stabilization, achieving SOTA on EvoBench and MAGE benchmarks

Output Integrity Attack nlp
1 citations PDF
benchmark arXiv Jan 1, 2026 · Jan 2026

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Haoran Gu, Handing Wang, Yi Mei et al. · Xidian University · Victoria University of Wellington +1 more

Benchmarks LLM jailbreak safety in algorithm design; MOBjailbreak causes near-complete failure across 13 LLMs including GPT-5

Prompt Injection nlp
PDF
attack arXiv Dec 21, 2025 · Dec 2025

Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities

Zhiyuan Peng, Zihan Ye, Shreyank N Gowda et al. · iFLYTEK · University of Chinese Academy of Sciences +3 more

Proposes novel adversarial attacks on Zero-Shot Learning models exploiting class calibration bias and semantic concept vulnerabilities to fully eliminate GZSL accuracy.

Input Manipulation Attack vision
PDF
attack arXiv Nov 24, 2025 · Nov 2025

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

Yingjia Shang, Yi Liu, Huimin Wang et al. · Westlake University · Heilongjiang University +2 more

Black-box adversarial visual perturbations hijack retrieval in medical VLM-RAG systems, achieving 90%+ attack success via multi-positive InfoNCE loss and IRM-augmented optimization.

Input Manipulation Attack Prompt Injection visionmultimodalnlp
1 citations PDF Code
attack arXiv Nov 20, 2025 · Nov 2025

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yixin Zhang et al. · Westlake University · Pennsylvania State University +2 more

Multimodal adversarial attack framework targeting VLA robots via visual patches, gradient-based text, and cross-modal misalignment attacks

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF
attack arXiv Sep 23, 2025 · Sep 2025

Enhancing the Effectiveness and Durability of Backdoor Attacks in Federated Learning through Maximizing Task Distinction

Zhaoxin Wang, Handing Wang, Cong Tian et al. · Xidian University · Westlake University

Proposes EDBA, a min-max dynamic trigger optimization that decouples backdoor from main task to boost FL backdoor durability and bypass defenses

Model Poisoning visionnlpfederated-learning
PDF
defense arXiv Aug 21, 2025 · Aug 2025

IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents

Hengyu An, Jinghuai Zhang, Tianyu Du et al. · Zhejiang University · University of California +1 more

Defends LLM agents against indirect prompt injection by constraining tool calls via a planned dependency graph

Prompt Injection Insecure Plugin Design nlp
PDF Code
defense arXiv Aug 3, 2025 · Aug 2025

AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

Alva West, Yixuan Weng, Minjun Zhu et al. · Westlake University

Detects AI-generated text via wavelet-transformed token statistics, exploiting non-stationarity invisible to scalar-score detectors

Output Integrity Attack nlp
PDF Code