ML Security Papers

Latest papers

48 papers

defense arXiv Mar 28, 2026 · 11d ago

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

Jinhu Fu, Yihang Lou, Qingyi Si et al. · Beijing University of Posts and Telecommunications · Chongqing University of Posts and Telecommunications +2 more

Identifies and repairs unsafe neural pathways in VLMs using causal mediation analysis and dual-modal safety subspace projection

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

defense arXiv Mar 19, 2026 · 20d ago

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Yue Zhao, Yujia Gong, Ruigang Liang et al. · Chinese Academy of Sciences · Beijing University of Posts and Telecommunications +1 more

Transfers safety functionality between LLMs by transplanting minimal neuron subsets, enabling alignment enhancement and jailbreak defense without retraining

Prompt Injection nlp

PDF

defense arXiv Mar 5, 2026 · 4w ago

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

Zhihao Li, Gezheng Xu, Jiale Cai et al. · Western University · Concordia University +2 more

Proposes BAIT, a bi-level optimization that makes availability-poisoning data protection robust against pretrained model fine-tuning

Data Poisoning Attack vision

PDF Code

defense arXiv Mar 3, 2026 · 5w ago

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi et al. · Huazhong University of Science and Technology · Beijing University of Posts and Telecommunications +2 more

Structured virtual tool-calling framework trains VLMs to reason explicitly about safety, blocking multimodal jailbreaks while reducing over-refusal

Prompt Injection multimodalvisionnlp

PDF Code

attack The Fourteenth International C... Feb 28, 2026 · 5w ago

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Yilian Liu, Xiaojun Jia, Guoshun Nan et al. · Beijing University of Posts and Telecommunications · Nanyang Technological University +1 more

Jailbreaks MLLMs by dispersing harmful semantics across multiple images, forcing cross-image reasoning that defeats safety alignment

Prompt Injection visionnlpmultimodal

PDF Code

defense arXiv Feb 23, 2026 · 6w ago

A Secure and Private Distributed Bayesian Federated Learning Design

Nuocheng Yang, Sihua Wang, Zhaohui Yang et al. · Beijing University of Posts and Telecommunications · Zhejiang University +2 more

Defends distributed federated learning against Byzantine poisoning and gradient-based data reconstruction via GNN-RL neighbor selection

Data Poisoning Attack Model Inversion Attack federated-learning

PDF

defense arXiv Feb 10, 2026 · 8w ago

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

Kun Wang, Zherui Li, Zhenhong Zhou et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +4 more

Exposes cross-modal jailbreak vulnerabilities in omni-modal LLMs and defends via SVD-guided refusal vector amplification with lightweight adapters

Prompt Injection multimodalnlp

PDF Code

attack arXiv Feb 9, 2026 · 8w ago

RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection

Ziwei Wang, Yuanhe Zhang, Jing Chen et al. · Wuhan University · Beijing University of Posts and Telecommunications +3 more

Crafts counterfactual prompts using Recursive Entropy to force LRMs into infinite thinking loops, reducing throughput by 90%

Model Denial of Service nlp

PDF

attack arXiv Jan 22, 2026 · 10w ago

Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Mingyu Yu, Lana Liu, Zhehao Zhao et al. · Beijing University of Posts and Telecommunications

Jailbreaks multimodal LLMs into generating harmful images via semantic-agnostic visual splicing and inductive text recomposition, achieving 98% success on GPT-5

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

attack arXiv Jan 20, 2026 · 11w ago

When Reasoning Leaks Membership: Membership Inference Attack on Black-box Large Reasoning Models

Ruihan Hu, Yu-Ming Shang, Wei Luo et al. · Beijing University of Posts and Telecommunications · China Unicom

Exploits exposed reasoning traces in black-box LRMs to launch membership inference attacks without logit access

Membership Inference Attack Sensitive Information Disclosure nlp

PDF Code

attack arXiv Jan 19, 2026 · 11w ago

DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems

Suyang Sun, Weifei Jin, Yuxin Cao et al. · Beijing University of Posts and Telecommunications · National University of Singapore +1 more

Universal adversarial audio perturbations that simultaneously fool ASR transcription and speaker recognition in voice control systems

Input Manipulation Attack audio

PDF Code

benchmark arXiv Jan 9, 2026 · 12w ago

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al. · Shanghai University of Finance and Economics · The Chinese University of Hong Kong +8 more

Benchmarks prompt injection and jailbreak attacks on LLM financial agents in execution-grounded, state-writable sandbox environments

Prompt Injection Excessive Agency nlp

PDF Code

survey arXiv Jan 7, 2026 · Jan 2026

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

Zejian Chen, Chaozhuo Li, Chao Li et al. · Beijing University of Posts and Telecommunications · China Academy of Information and Communications Technology

Surveys LLM and VLM jailbreak attacks and defenses, proposing a unified three-layer defense framework across text and multimodal settings

Input Manipulation Attack Prompt Injection nlpmultimodal

1 citations PDF

defense arXiv Jan 5, 2026 · Jan 2026

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

Kaibo Huang, Jin Tan, Yukun Wei et al. · Beijing University of Posts and Telecommunications · Huaqiao University

Embeds multi-bit provenance watermarks into LLM agent planning decisions via distribution-preserving sampling, enabling black-box behavioral attribution

Output Integrity Attack nlp

PDF Code

benchmark arXiv Jan 4, 2026 · Jan 2026

How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference

Songyang Liu, Chaozhuo Li, Rui Pu et al. · Beijing University of Posts and Telecommunications · China Academy of Information and Communications Technology

Proposes fine-grained jailbreak evaluation framework that corrects 27% overestimation of attack success in existing LLM safety benchmarks

Prompt Injection nlp

PDF

benchmark arXiv Jan 2, 2026 · Jan 2026

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +1 more

Benchmarks lightweight LLM safety against Chinese jailbreak patterns like homophones, pinyin encoding, and symbol splitting

Prompt Injection nlp

PDF

attack arXiv Dec 24, 2025 · Dec 2025

CoTDeceptor:Adversarial Code Obfuscation Against CoT-Enhanced LLM Code Agents

Haoyang Li, Mingjin Li, Jinxin Zuo et al. · Beijing University of Posts and Telecommunications · Chinese Academy of Sciences +3 more

Adversarial code obfuscation framework that exploits CoT reasoning chain weaknesses to evade LLM-based vulnerability detectors

Input Manipulation Attack Prompt Injection nlp

PDF Code

defense arXiv Dec 3, 2025 · Dec 2025

From static to adaptive: immune memory-based jailbreak detection for large language models

Jun Leng, Yu Liu, Litian Zhang et al. · Beijing University of Posts and Telecommunications · Hunan Branch of National Computer Network Emergency Response +1 more

Adaptive jailbreak detection for LLMs using immune memory retrieval and dual-agent simulation to counter evolving attacks

Prompt Injection nlp

PDF

attack arXiv Dec 2, 2025 · Dec 2025

LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou et al. · Beijing University of Posts and Telecommunications · Hangzhou Dianzi University +4 more

LeechHijack backdoors MCP tools to covertly parasitize LLM agent compute via runtime C2 channel, achieving 77% success undetected

Insecure Plugin Design nlp

1 citations PDF

attack TrustCom Nov 26, 2025 · Nov 2025

CAHS-Attack: CLIP-Aware Heuristic Search Attack Method for Stable Diffusion

Shuhan Xia, Jing Dai, Hui Ouyang et al. · Beijing University of Posts and Telecommunications · China Mobile +1 more

Black-box adversarial suffix attack on Stable Diffusion exploiting CLIP text encoder fragility via MCTS and genetic search

Input Manipulation Attack generativenlpmultimodal

PDF

Loading more papers…

Latest papers

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

A Secure and Private Distributed Bayesian Federated Learning Design

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection

Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

When Reasoning Leaks Membership: Membership Inference Attack on Black-box Large Reasoning Models

DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

CoTDeceptor:Adversarial Code Obfuscation Against CoT-Enhanced LLM Code Agents

From static to adaptive: immune memory-based jailbreak detection for large language models

LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

CAHS-Attack: CLIP-Aware Heuristic Search Attack Method for Stable Diffusion

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue