Latest papers

48 papers
defense arXiv Mar 28, 2026 · 11d ago

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

Jinhu Fu, Yihang Lou, Qingyi Si et al. · Beijing University of Posts and Telecommunications · Chongqing University of Posts and Telecommunications +2 more

Identifies and repairs unsafe neural pathways in VLMs using causal mediation analysis and dual-modal safety subspace projection

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF
defense arXiv Mar 19, 2026 · 20d ago

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Yue Zhao, Yujia Gong, Ruigang Liang et al. · Chinese Academy of Sciences · Beijing University of Posts and Telecommunications +1 more

Transfers safety functionality between LLMs by transplanting minimal neuron subsets, enabling alignment enhancement and jailbreak defense without retraining

Prompt Injection nlp
PDF
defense arXiv Mar 5, 2026 · 4w ago

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

Zhihao Li, Gezheng Xu, Jiale Cai et al. · Western University · Concordia University +2 more

Proposes BAIT, a bi-level optimization that makes availability-poisoning data protection robust against pretrained model fine-tuning

Data Poisoning Attack vision
PDF Code
defense arXiv Mar 3, 2026 · 5w ago

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi et al. · Huazhong University of Science and Technology · Beijing University of Posts and Telecommunications +2 more

Structured virtual tool-calling framework trains VLMs to reason explicitly about safety, blocking multimodal jailbreaks while reducing over-refusal

Prompt Injection multimodalvisionnlp
PDF Code
attack The Fourteenth International C... Feb 28, 2026 · 5w ago

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Yilian Liu, Xiaojun Jia, Guoshun Nan et al. · Beijing University of Posts and Telecommunications · Nanyang Technological University +1 more

Jailbreaks MLLMs by dispersing harmful semantics across multiple images, forcing cross-image reasoning that defeats safety alignment

Prompt Injection visionnlpmultimodal
PDF Code
defense arXiv Feb 23, 2026 · 6w ago

A Secure and Private Distributed Bayesian Federated Learning Design

Nuocheng Yang, Sihua Wang, Zhaohui Yang et al. · Beijing University of Posts and Telecommunications · Zhejiang University +2 more

Defends distributed federated learning against Byzantine poisoning and gradient-based data reconstruction via GNN-RL neighbor selection

Data Poisoning Attack Model Inversion Attack federated-learning
PDF
defense arXiv Feb 10, 2026 · 8w ago

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

Kun Wang, Zherui Li, Zhenhong Zhou et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +4 more

Exposes cross-modal jailbreak vulnerabilities in omni-modal LLMs and defends via SVD-guided refusal vector amplification with lightweight adapters

Prompt Injection multimodalnlp
PDF Code
attack arXiv Feb 9, 2026 · 8w ago

RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection

Ziwei Wang, Yuanhe Zhang, Jing Chen et al. · Wuhan University · Beijing University of Posts and Telecommunications +3 more

Crafts counterfactual prompts using Recursive Entropy to force LRMs into infinite thinking loops, reducing throughput by 90%

Model Denial of Service nlp
PDF
attack arXiv Jan 22, 2026 · 10w ago

Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Mingyu Yu, Lana Liu, Zhehao Zhao et al. · Beijing University of Posts and Telecommunications

Jailbreaks multimodal LLMs into generating harmful images via semantic-agnostic visual splicing and inductive text recomposition, achieving 98% success on GPT-5

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF Code
attack arXiv Jan 20, 2026 · 11w ago

When Reasoning Leaks Membership: Membership Inference Attack on Black-box Large Reasoning Models

Ruihan Hu, Yu-Ming Shang, Wei Luo et al. · Beijing University of Posts and Telecommunications · China Unicom

Exploits exposed reasoning traces in black-box LRMs to launch membership inference attacks without logit access

Membership Inference Attack Sensitive Information Disclosure nlp
PDF Code
attack arXiv Jan 19, 2026 · 11w ago

DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems

Suyang Sun, Weifei Jin, Yuxin Cao et al. · Beijing University of Posts and Telecommunications · National University of Singapore +1 more

Universal adversarial audio perturbations that simultaneously fool ASR transcription and speaker recognition in voice control systems

Input Manipulation Attack audio
PDF Code
benchmark arXiv Jan 9, 2026 · 12w ago

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al. · Shanghai University of Finance and Economics · The Chinese University of Hong Kong +8 more

Benchmarks prompt injection and jailbreak attacks on LLM financial agents in execution-grounded, state-writable sandbox environments

Prompt Injection Excessive Agency nlp
PDF Code
survey arXiv Jan 7, 2026 · Jan 2026

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

Zejian Chen, Chaozhuo Li, Chao Li et al. · Beijing University of Posts and Telecommunications · China Academy of Information and Communications Technology

Surveys LLM and VLM jailbreak attacks and defenses, proposing a unified three-layer defense framework across text and multimodal settings

Input Manipulation Attack Prompt Injection nlpmultimodal
1 citations PDF
defense arXiv Jan 5, 2026 · Jan 2026

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

Kaibo Huang, Jin Tan, Yukun Wei et al. · Beijing University of Posts and Telecommunications · Huaqiao University

Embeds multi-bit provenance watermarks into LLM agent planning decisions via distribution-preserving sampling, enabling black-box behavioral attribution

Output Integrity Attack nlp
PDF Code
benchmark arXiv Jan 4, 2026 · Jan 2026

How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference

Songyang Liu, Chaozhuo Li, Rui Pu et al. · Beijing University of Posts and Telecommunications · China Academy of Information and Communications Technology

Proposes fine-grained jailbreak evaluation framework that corrects 27% overestimation of attack success in existing LLM safety benchmarks

Prompt Injection nlp
PDF
benchmark arXiv Jan 2, 2026 · Jan 2026

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +1 more

Benchmarks lightweight LLM safety against Chinese jailbreak patterns like homophones, pinyin encoding, and symbol splitting

Prompt Injection nlp
PDF
attack arXiv Dec 24, 2025 · Dec 2025

CoTDeceptor:Adversarial Code Obfuscation Against CoT-Enhanced LLM Code Agents

Haoyang Li, Mingjin Li, Jinxin Zuo et al. · Beijing University of Posts and Telecommunications · Chinese Academy of Sciences +3 more

Adversarial code obfuscation framework that exploits CoT reasoning chain weaknesses to evade LLM-based vulnerability detectors

Input Manipulation Attack Prompt Injection nlp
PDF Code
defense arXiv Dec 3, 2025 · Dec 2025

From static to adaptive: immune memory-based jailbreak detection for large language models

Jun Leng, Yu Liu, Litian Zhang et al. · Beijing University of Posts and Telecommunications · Hunan Branch of National Computer Network Emergency Response +1 more

Adaptive jailbreak detection for LLMs using immune memory retrieval and dual-agent simulation to counter evolving attacks

Prompt Injection nlp
PDF
attack arXiv Dec 2, 2025 · Dec 2025

LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou et al. · Beijing University of Posts and Telecommunications · Hangzhou Dianzi University +4 more

LeechHijack backdoors MCP tools to covertly parasitize LLM agent compute via runtime C2 channel, achieving 77% success undetected

Insecure Plugin Design nlp
1 citations PDF
attack TrustCom Nov 26, 2025 · Nov 2025

CAHS-Attack: CLIP-Aware Heuristic Search Attack Method for Stable Diffusion

Shuhan Xia, Jing Dai, Hui Ouyang et al. · Beijing University of Posts and Telecommunications · China Mobile +1 more

Black-box adversarial suffix attack on Stable Diffusion exploiting CLIP text encoder fragility via MCTS and genetic search

Input Manipulation Attack generativenlpmultimodal
PDF
Loading more papers…