ML Security Papers

Stats

Latest papers

35 papers

defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp

PDF

defense arXiv Apr 23, 2026 · 28d ago

Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

Yuhan Luo, Tao Chen, Decheng Liu · Xidian University

Deepfake detector with fine-grained CLIP alignment and mixture-of-experts, plus Cross-AUC metric exposing cross-domain score shift vulnerabilities

Output Integrity Attack visionmultimodal

PDF Code

attack arXiv Apr 19, 2026 · 4w ago

Breaking Euston: Recovering Private Inputs from Secure Inference by Exploiting Subspace Leakage

Jiaqi Zhao, Fengwei Wang · Xidian University

Exploits subspace leakage in Euston's SVD-based transmission protocol to reconstruct private user inputs from secure transformer inference

Model Inversion Attack nlpvision

PDF

defense arXiv Apr 14, 2026 · 5w ago

Boosting Robust AIGI Detection with LoRA-based Pairwise Training

Ruiyang Xia, Qi Zhang, Yaowen Xu et al. · China Telecom · Xidian University

Robust AI-generated image detector using LoRA fine-tuning and pairwise training to maintain detection accuracy under severe distortions

Output Integrity Attack Input Manipulation Attack visiongenerative

PDF

benchmark arXiv Mar 27, 2026 · 7w ago

Are LLM-Enhanced Graph Neural Networks Robust against Poisoning Attacks?

Yuhang Ma, Jie Wang, Zheng Yan · Xidian University · Hangzhou Institute of Technology

Benchmark evaluating LLM-enhanced GNNs against structural and textual poisoning attacks, finding them more robust than baseline embeddings

Data Poisoning Attack graphnlp

PDF Code

defense arXiv Mar 23, 2026 · 8w ago

DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

Binhong Tan, Zhaoxin Wang, Handing Wang · Xidian University

Dual-stage defense blocking unsafe image generation via sequence-level prompt intervention and visual-stage filtering across multiple harmful categories

Input Manipulation Attack Prompt Injection visionnlpmultimodalgenerative

PDF

attack arXiv Mar 6, 2026 · 10w ago

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao et al. · Xidian University · Tsinghua University +1 more

White-box jailbreak targeting LLM attention heads via layer-wise perturbation, improving ASR 14% over SOTA

Input Manipulation Attack Prompt Injection nlp

PDF Code

attack arXiv Mar 6, 2026 · 10w ago

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin et al. · Xidian University · Tsinghua University +2 more

Discovers two disentangled safety subspaces in LLMs and exploits them to surgically disable refusal while preserving harmfulness recognition

Prompt Injection nlp

PDF Code

benchmark arXiv Mar 5, 2026 · 11w ago

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing

Fai Gu, Qiyu Tang, Te Wen et al. · Xidian University

Proves theoretically and empirically that diffusion image editing systematically destroys content watermarks by treating them as high-frequency noise during denoising.

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Feb 26, 2026 · 12w ago

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang, Zhijin Ge, Bohan Liu et al. · Huazhong University of Science and Technology · Xidian University +3 more

Surveys 100+ transfer-based adversarial attacks, proposes unified benchmark framework to address unfair comparisons in the field

Input Manipulation Attack vision

PDF Code

defense arXiv Feb 26, 2026 · 12w ago

Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang, Zhaoxin Wang, Handing Wang · Xidian University

Training-free sparse weight editing transfers LLM safety alignment from high-resource to low-resource languages to block cross-lingual jailbreaks

Prompt Injection nlp

PDF

attack arXiv Feb 24, 2026 · 12w ago

Vanishing Watermarks: Diffusion-Based Image Editing Undermines Robust Invisible Watermarking

Fan Guo, Jiyu Kang, Qi Ming et al. · Xidian University

Diffusion models erase robust invisible image watermarks via regeneration and guided decoder-feedback attacks, achieving near-zero recovery rates

Output Integrity Attack visiongenerative

PDF

defense arXiv Feb 12, 2026 · Feb 2026

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

Zhaoxin Wang, Jiaming Liang, Fengbin Zhu et al. · Xidian University · National University of Singapore +1 more

Defends LLM safety alignment against neuron pruning attacks by redistributing safety representations across the network via selective neuron freezing

Prompt Injection nlpmultimodal

PDF

benchmark arXiv Feb 9, 2026 · Feb 2026

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Yuhang Wang, Feiming Xu, Zheng Lin et al. · Xidian University · China Unicom

Benchmarks real-world personalized LLM agent security across prompt injection, tool misuse, and memory poisoning attack vectors

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF Code

defense arXiv Feb 2, 2026 · Feb 2026

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Muli Yang, Gabriel James Goenawan, Henan Wang et al. · Institute for Infocomm Research (I2R) · Independent Researcher +1 more

Post-hoc Bayesian calibration framework fixes systematic bias in AI-generated image detectors under distribution shift without retraining

Output Integrity Attack visiongenerative

PDF Code

benchmark arXiv Jan 9, 2026 · Jan 2026

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al. · Shanghai University of Finance and Economics · The Chinese University of Hong Kong +8 more

Benchmarks prompt injection and jailbreak attacks on LLM financial agents in execution-grounded, state-writable sandbox environments

Prompt Injection Excessive Agency nlp

PDF Code

benchmark arXiv Jan 1, 2026 · Jan 2026

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Haoran Gu, Handing Wang, Yi Mei et al. · Xidian University · Victoria University of Wellington +1 more

Benchmarks LLM jailbreak safety in algorithm design; MOBjailbreak causes near-complete failure across 13 LLMs including GPT-5

Prompt Injection nlp

PDF

tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

1 citations PDF

benchmark arXiv Nov 24, 2025 · Nov 2025

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Xincheng Wang, Hanchi Sun, Wenjun Sun et al. · Donghua University · Shanghai JiaoTong University +3 more

Benchmarks dataset watermarking schemes for diffusion model traceability and proposes a removal attack that fully defeats them

Output Integrity Attack visiongenerative

PDF

attack arXiv Nov 20, 2025 · Nov 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang et al. · The Hong Kong University of Science and Technology · East China Normal University +5 more

Game-theoretic black-box jailbreak using Prisoner's Dilemma scenarios to flip LLM safety preferences, achieving 95%+ ASR on GPT-4o and DeepSeek-R1

Prompt Injection nlp

2 citations PDF Code

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

llm The Hong Kong University of Science and Technology · East China Normal University · Flexera +4 more

PDF arXiv DOI Code

Loading more papers…

Latest papers

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

Breaking Euston: Recovering Private Inputs from Secure Inference by Exploiting Subspace Leakage

Boosting Robust AIGI Detection with LoRA-based Pairwise Training

Are LLM-Enhanced Graph Neural Networks Robust against Poisoning Attacks?

DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Multilingual Safety Alignment Via Sparse Weight Editing

Vanishing Watermarks: Diffusion-Based Image Editing Undermines Robust Invisible Watermarking

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue