Latest papers

35 papers
defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp
PDF
defense arXiv Apr 23, 2026 · 28d ago

Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

Yuhan Luo, Tao Chen, Decheng Liu · Xidian University

Deepfake detector with fine-grained CLIP alignment and mixture-of-experts, plus Cross-AUC metric exposing cross-domain score shift vulnerabilities

Output Integrity Attack visionmultimodal
PDF Code
attack arXiv Apr 19, 2026 · 4w ago

Breaking Euston: Recovering Private Inputs from Secure Inference by Exploiting Subspace Leakage

Jiaqi Zhao, Fengwei Wang · Xidian University

Exploits subspace leakage in Euston's SVD-based transmission protocol to reconstruct private user inputs from secure transformer inference

Model Inversion Attack nlpvision
PDF
defense arXiv Apr 14, 2026 · 5w ago

Boosting Robust AIGI Detection with LoRA-based Pairwise Training

Ruiyang Xia, Qi Zhang, Yaowen Xu et al. · China Telecom · Xidian University

Robust AI-generated image detector using LoRA fine-tuning and pairwise training to maintain detection accuracy under severe distortions

Output Integrity Attack Input Manipulation Attack visiongenerative
PDF
benchmark arXiv Mar 27, 2026 · 7w ago

Are LLM-Enhanced Graph Neural Networks Robust against Poisoning Attacks?

Yuhang Ma, Jie Wang, Zheng Yan · Xidian University · Hangzhou Institute of Technology

Benchmark evaluating LLM-enhanced GNNs against structural and textual poisoning attacks, finding them more robust than baseline embeddings

Data Poisoning Attack graphnlp
PDF Code
defense arXiv Mar 23, 2026 · 8w ago

DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

Binhong Tan, Zhaoxin Wang, Handing Wang · Xidian University

Dual-stage defense blocking unsafe image generation via sequence-level prompt intervention and visual-stage filtering across multiple harmful categories

Input Manipulation Attack Prompt Injection visionnlpmultimodalgenerative
PDF
attack arXiv Mar 6, 2026 · 10w ago

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao et al. · Xidian University · Tsinghua University +1 more

White-box jailbreak targeting LLM attention heads via layer-wise perturbation, improving ASR 14% over SOTA

Input Manipulation Attack Prompt Injection nlp
PDF Code
attack arXiv Mar 6, 2026 · 10w ago

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Jinman Wu, Yi Xie, Shen Lin et al. · Xidian University · Tsinghua University +2 more

Discovers two disentangled safety subspaces in LLMs and exploits them to surgically disable refusal while preserving harmfulness recognition

Prompt Injection nlp
PDF Code
benchmark arXiv Mar 5, 2026 · 11w ago

When Denoising Becomes Unsigning: Theoretical and Empirical Analysis of Watermark Fragility Under Diffusion-Based Image Editing

Fai Gu, Qiyu Tang, Te Wen et al. · Xidian University

Proves theoretically and empirically that diffusion image editing systematically destroys content watermarks by treating them as high-frequency noise during denoising.

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Feb 26, 2026 · 12w ago

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang, Zhijin Ge, Bohan Liu et al. · Huazhong University of Science and Technology · Xidian University +3 more

Surveys 100+ transfer-based adversarial attacks, proposes unified benchmark framework to address unfair comparisons in the field

Input Manipulation Attack vision
PDF Code
defense arXiv Feb 26, 2026 · 12w ago

Multilingual Safety Alignment Via Sparse Weight Editing

Jiaming Liang, Zhaoxin Wang, Handing Wang · Xidian University

Training-free sparse weight editing transfers LLM safety alignment from high-resource to low-resource languages to block cross-lingual jailbreaks

Prompt Injection nlp
PDF
attack arXiv Feb 24, 2026 · 12w ago

Vanishing Watermarks: Diffusion-Based Image Editing Undermines Robust Invisible Watermarking

Fan Guo, Jiyu Kang, Qi Ming et al. · Xidian University

Diffusion models erase robust invisible image watermarks via regeneration and guided decoder-feedback attacks, achieving near-zero recovery rates

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 12, 2026 · Feb 2026

SafeNeuron: Neuron-Level Safety Alignment for Large Language Models

Zhaoxin Wang, Jiaming Liang, Fengbin Zhu et al. · Xidian University · National University of Singapore +1 more

Defends LLM safety alignment against neuron pruning attacks by redistributing safety representations across the network via selective neuron freezing

Prompt Injection nlpmultimodal
PDF
benchmark arXiv Feb 9, 2026 · Feb 2026

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Yuhang Wang, Feiming Xu, Zheng Lin et al. · Xidian University · China Unicom

Benchmarks real-world personalized LLM agent security across prompt injection, tool misuse, and memory poisoning attack vectors

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF Code
defense arXiv Feb 2, 2026 · Feb 2026

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Muli Yang, Gabriel James Goenawan, Henan Wang et al. · Institute for Infocomm Research (I2R) · Independent Researcher +1 more

Post-hoc Bayesian calibration framework fixes systematic bias in AI-generated image detectors under distribution shift without retraining

Output Integrity Attack visiongenerative
PDF Code
benchmark arXiv Jan 9, 2026 · Jan 2026

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al. · Shanghai University of Finance and Economics · The Chinese University of Hong Kong +8 more

Benchmarks prompt injection and jailbreak attacks on LLM financial agents in execution-grounded, state-writable sandbox environments

Prompt Injection Excessive Agency nlp
PDF Code
benchmark arXiv Jan 1, 2026 · Jan 2026

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Haoran Gu, Handing Wang, Yi Mei et al. · Xidian University · Victoria University of Wellington +1 more

Benchmarks LLM jailbreak safety in algorithm design; MOBjailbreak causes near-complete failure across 13 LLMs including GPT-5

Prompt Injection nlp
PDF
tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp
1 citations PDF
benchmark arXiv Nov 24, 2025 · Nov 2025

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Xincheng Wang, Hanchi Sun, Wenjun Sun et al. · Donghua University · Shanghai JiaoTong University +3 more

Benchmarks dataset watermarking schemes for diffusion model traceability and proposes a removal attack that fully defeats them

Output Integrity Attack visiongenerative
PDF
attack arXiv Nov 20, 2025 · Nov 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang et al. · The Hong Kong University of Science and Technology · East China Normal University +5 more

Game-theoretic black-box jailbreak using Prisoner's Dilemma scenarios to flip LLM safety preferences, achieving 95%+ ASR on GPT-4o and DeepSeek-R1

Prompt Injection nlp
2 citations PDF Code
Loading more papers…