Latest papers

31 papers
benchmark arXiv Mar 27, 2026 · 10d ago

Are LLM-Enhanced Graph Neural Networks Robust against Poisoning Attacks?

Yuhang Ma, Jie Wang, Zheng Yan · Xidian University · Hangzhou Institute of Technology

Benchmark evaluating LLM-enhanced GNNs against structural and textual poisoning attacks, finding them more robust than baseline embeddings

Data Poisoning Attack graphnlp
PDF Code
defense arXiv Mar 23, 2026 · 14d ago

DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

Binhong Tan, Zhaoxin Wang, Handing Wang · Xidian University

Dual-stage defense blocking unsafe image generation via sequence-level prompt intervention and visual-stage filtering across multiple harmful categories

Input Manipulation Attack Prompt Injection visionnlpmultimodalgenerative
PDF
attack arXiv Mar 6, 2026 · 4w ago

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao et al. · Xidian University · Tsinghua University +1 more

White-box jailbreak targeting LLM attention heads via layer-wise perturbation, improving ASR 14% over SOTA

Input Manipulation Attack Prompt Injection nlp
PDF Code
attack arXiv Mar 6, 2026 · 4w ago

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin et al. · Xidian University · Tsinghua University +2 more

Discovers two disentangled safety subspaces in LLMs and exploits them to surgically disable refusal while preserving harmfulness recognition

Prompt Injection nlp
PDF Code
benchmark arXiv Mar 5, 2026 · 4w ago

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing

Fai Gu, Qiyu Tang, Te Wen et al. · Xidian University

Proves theoretically and empirically that diffusion image editing systematically destroys content watermarks by treating them as high-frequency noise during denoising.

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 26, 2026 · 5w ago

Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang, Zhaoxin Wang, Handing Wang · Xidian University

Training-free sparse weight editing transfers LLM safety alignment from high-resource to low-resource languages to block cross-lingual jailbreaks

Prompt Injection nlp
PDF
benchmark arXiv Feb 26, 2026 · 5w ago

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang, Zhijin Ge, Bohan Liu et al. · Huazhong University of Science and Technology · Xidian University +3 more

Surveys 100+ transfer-based adversarial attacks, proposes unified benchmark framework to address unfair comparisons in the field

Input Manipulation Attack vision
PDF Code
attack arXiv Feb 24, 2026 · 5w ago

Vanishing Watermarks: Diffusion-Based Image Editing Undermines Robust Invisible Watermarking

Fan Guo, Jiyu Kang, Qi Ming et al. · Xidian University

Diffusion models erase robust invisible image watermarks via regeneration and guided decoder-feedback attacks, achieving near-zero recovery rates

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 12, 2026 · 7w ago

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

Zhaoxin Wang, Jiaming Liang, Fengbin Zhu et al. · Xidian University · National University of Singapore +1 more

Defends LLM safety alignment against neuron pruning attacks by redistributing safety representations across the network via selective neuron freezing

Prompt Injection nlpmultimodal
PDF
benchmark arXiv Feb 9, 2026 · 8w ago

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Yuhang Wang, Feiming Xu, Zheng Lin et al. · Xidian University · China Unicom

Benchmarks real-world personalized LLM agent security across prompt injection, tool misuse, and memory poisoning attack vectors

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF Code
defense arXiv Feb 2, 2026 · 9w ago

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Muli Yang, Gabriel James Goenawan, Henan Wang et al. · Institute for Infocomm Research (I2R) · Independent Researcher +1 more

Post-hoc Bayesian calibration framework fixes systematic bias in AI-generated image detectors under distribution shift without retraining

Output Integrity Attack visiongenerative
PDF Code
benchmark arXiv Jan 9, 2026 · 12w ago

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al. · Shanghai University of Finance and Economics · The Chinese University of Hong Kong +8 more

Benchmarks prompt injection and jailbreak attacks on LLM financial agents in execution-grounded, state-writable sandbox environments

Prompt Injection Excessive Agency nlp
PDF Code
benchmark arXiv Jan 1, 2026 · Jan 2026

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Haoran Gu, Handing Wang, Yi Mei et al. · Xidian University · Victoria University of Wellington +1 more

Benchmarks LLM jailbreak safety in algorithm design; MOBjailbreak causes near-complete failure across 13 LLMs including GPT-5

Prompt Injection nlp
PDF
tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection nlp
1 citations PDF
benchmark arXiv Nov 24, 2025 · Nov 2025

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Xincheng Wang, Hanchi Sun, Wenjun Sun et al. · Donghua University · Shanghai Jiaotong University +3 more

Benchmarks dataset watermarking schemes for diffusion model traceability and proposes a removal attack that fully defeats them

Output Integrity Attack visiongenerative
PDF
attack arXiv Nov 20, 2025 · Nov 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang et al. · The Hong Kong University of Science and Technology · East China Normal University +5 more

Game-theoretic black-box jailbreak using Prisoner's Dilemma scenarios to flip LLM safety preferences, achieving 95%+ ASR on GPT-4o and DeepSeek-R1

Prompt Injection nlp
2 citations PDF Code
attack arXiv Nov 20, 2025 · Nov 2025

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yixin Zhang et al. · Westlake University · Pennsylvania State University +2 more

Multimodal adversarial attack framework targeting VLA robots via visual patches, gradient-based text, and cross-modal misalignment attacks

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF
attack arXiv Nov 20, 2025 · Nov 2025

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Chunyang Li, Zifeng Kang, Junwei Zhang et al. · Xidian University · Beijing University of Posts and Telecommunications +1 more

Injects malicious natural-language instructions into physical environments to jailbreak VLM-based embodied AI agents without direct prompting

Prompt Injection visionnlpmultimodal
PDF
attack arXiv Nov 18, 2025 · Nov 2025

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

Yule Liu, Heyi Zhang, Jinyi Zheng et al. · The Hong Kong University of Science and Technology · Shanghai Jiao Tong University +2 more

First membership inference attack against RLVR-trained LLMs using behavioral divergence signals instead of memorization

Membership Inference Attack nlpmultimodalreinforcement-learning
1 citations PDF
defense arXiv Nov 17, 2025 · Nov 2025

DualTAP: A Dual-Task Adversarial Protector for Mobile MLLM Agents

Fuyao Zhang, Jiaming Zhang, Che Wang et al. · Nanyang Technological University · Peking University +3 more

Adversarial perturbation defense that blinds untrusted router MLLMs to PII in mobile screenshots while preserving agent task utility

Input Manipulation Attack visionmultimodal
2 citations 1 influentialPDF
Loading more papers…