ML Security Papers

Latest papers

10 papers

attack arXiv Feb 12, 2026 · 7w ago

Detecting RLVR Training Data via Structural Convergence of Reasoning

Hongbo Zhang, Yang Yue, Jianhao Yan et al. · Zhejiang University · Westlake University +1 more

Black-box membership inference attack on RLVR-trained reasoning models exploiting generation diversity collapse to detect training data

Membership Inference Attack nlpreinforcement-learning

PDF Code

defense arXiv Feb 1, 2026 · 9w ago

Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

Ke Sun, Guangsheng Bao, Han Cui et al. · Westlake University

Prototype-based routing framework dynamically selects the best surrogate model to detect LLM-generated text across unknown black-box sources

Output Integrity Attack nlp

PDF

defense arXiv Jan 8, 2026 · Jan 2026

When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

Ke Sun, Guangsheng Bao, Han Cui et al. · Westlake University

Detects AI-generated text via late-stage token probability stabilization, achieving SOTA on EvoBench and MAGE benchmarks

Output Integrity Attack nlp

1 citations PDF

benchmark arXiv Jan 1, 2026 · Jan 2026

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Haoran Gu, Handing Wang, Yi Mei et al. · Xidian University · Victoria University of Wellington +1 more

Benchmarks LLM jailbreak safety in algorithm design; MOBjailbreak causes near-complete failure across 13 LLMs including GPT-5

Prompt Injection nlp

PDF

attack arXiv Dec 21, 2025 · Dec 2025

Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities

Zhiyuan Peng, Zihan Ye, Shreyank N Gowda et al. · iFLYTEK · University of Chinese Academy of Sciences +3 more

Proposes novel adversarial attacks on Zero-Shot Learning models exploiting class calibration bias and semantic concept vulnerabilities to fully eliminate GZSL accuracy.

Input Manipulation Attack vision

PDF

Zero-shot Learning (ZSL) aims to enable image classifiers to recognize images from unseen classes that were not included during training. Unlike traditional supervised classification, ZSL typically relies on learning a mapping from visual features to predefined, human-understandable class concepts. While ZSL models promise to improve generalization and interpretability, their robustness under systematic input perturbations remain unclear. In this study, we present an empirical analysis about the robustness of existing ZSL methods at both classlevel and concept-level. Specifically, we successfully disrupted their class prediction by the well-known non-target class attack (clsA). However, in the Generalized Zero-shot Learning (GZSL) setting, we observe that the success of clsA is only at the original best-calibrated point. After the attack, the optimal bestcalibration point shifts, and ZSL models maintain relatively strong performance at other calibration points, indicating that clsA results in a spurious attack success in the GZSL. To address this, we propose the Class-Bias Enhanced Attack (CBEA), which completely eliminates GZSL accuracy across all calibrated points by enhancing the gap between seen and unseen class probabilities.Next, at concept-level attack, we introduce two novel attack modes: Class-Preserving Concept Attack (CPconA) and NonClass-Preserving Concept Attack (NCPconA). Our extensive experiments evaluate three typical ZSL models across various architectures from the past three years and reveal that ZSL models are vulnerable not only to the traditional class attack but also to concept-based attacks. These attacks allow malicious actors to easily manipulate class predictions by erasing or introducing concepts. Our findings highlight a significant performance gap between existing approaches, emphasizing the need for improved adversarial robustness in current ZSL models.

cnn transformer gnn iFLYTEK · University of Chinese Academy of Sciences · University of Nottingham +2 more

PDF arXiv DOI

attack arXiv Nov 24, 2025 · Nov 2025

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

Yingjia Shang, Yi Liu, Huimin Wang et al. · Westlake University · Heilongjiang University +2 more

Black-box adversarial visual perturbations hijack retrieval in medical VLM-RAG systems, achieving 90%+ attack success via multi-positive InfoNCE loss and IRM-augmented optimization.

Input Manipulation Attack Prompt Injection visionmultimodalnlp

1 citations PDF Code

attack arXiv Nov 20, 2025 · Nov 2025

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yixin Zhang et al. · Westlake University · Pennsylvania State University +2 more

Multimodal adversarial attack framework targeting VLA robots via visual patches, gradient-based text, and cross-modal misalignment attacks

Input Manipulation Attack Prompt Injection visionnlpmultimodal

1 citations PDF

attack arXiv Sep 23, 2025 · Sep 2025

Enhancing the Effectiveness and Durability of Backdoor Attacks in Federated Learning through Maximizing Task Distinction

Zhaoxin Wang, Handing Wang, Cong Tian et al. · Xidian University · Westlake University

Proposes EDBA, a min-max dynamic trigger optimization that decouples backdoor from main task to boost FL backdoor durability and bypass defenses

Model Poisoning visionnlpfederated-learning

PDF

defense arXiv Aug 21, 2025 · Aug 2025

IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents

Hengyu An, Jinghuai Zhang, Tianyu Du et al. · Zhejiang University · University of California +1 more

Defends LLM agents against indirect prompt injection by constraining tool calls via a planned dependency graph

Prompt Injection Insecure Plugin Design nlp

PDF Code

defense arXiv Aug 3, 2025 · Aug 2025

AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

Alva West, Yixuan Weng, Minjun Zhu et al. · Westlake University

Detects AI-generated text via wavelet-transformed token statistics, exploiting non-stationarity invisible to scalar-score detectors

Output Integrity Attack nlp

PDF Code

Latest papers

Detecting RLVR Training Data via Structural Convergence of Reasoning

Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities

Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Enhancing the Effectiveness and Durability of Backdoor Attacks in Federated Learning through Maximizing Task Distinction

IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents

AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue