ML Security Papers

Stats

Latest papers

43 papers

attack arXiv Apr 1, 2026 · 7d ago

When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Jiaqing Li, Zhibo Zhang, Shide Zhou et al. · Huazhong University of Science and Technology · Hubei University

Embeds latent trojans in individually safe LLMs that activate during model merging, bypassing safety alignment

Model Poisoning AI Supply Chain Attacks Prompt Injection nlp

PDF

attack arXiv Mar 14, 2026 · 25d ago

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Zijian Ling, Pingyi Hu, Xiuyong Gao et al. · Huazhong University of Science and Technology · Tsinghua University +1 more

Inaudible near-ultrasonic acoustic channel attack that delivers jailbreak prompts to speech-driven LLMs through commodity hardware

Input Manipulation Attack Prompt Injection nlpaudiomultimodal

PDF

attack arXiv Mar 8, 2026 · 4w ago

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Chenxi Li, Xianggan Liu, Dake Shen et al. · Huazhong University of Science and Technology

Jailbreaks LVLMs by decomposing harmful queries into benign visual semantic slots that models automatically reassemble into unsafe outputs

Prompt Injection visionnlpmultimodal

PDF

attack arXiv Mar 7, 2026 · 4w ago

Targeted Bit-Flip Attacks on LLM-Based Agents

Jialai Wang, Ya Wen, Zhongmou Liu et al. · National University of Singapore · Tsinghua University +1 more

Flip-Agent exploits hardware bit-flips to corrupt LLM agent weights, hijacking tool calls and final outputs in multi-stage pipelines

Model Poisoning Excessive Agency nlp

PDF

defense arXiv Mar 3, 2026 · 5w ago

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi et al. · Huazhong University of Science and Technology · Beijing University of Posts and Telecommunications +2 more

Structured virtual tool-calling framework trains VLMs to reason explicitly about safety, blocking multimodal jailbreaks while reducing over-refusal

Prompt Injection multimodalvisionnlp

PDF Code

benchmark arXiv Feb 26, 2026 · 5w ago

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang, Zhijin Ge, Bohan Liu et al. · Huazhong University of Science and Technology · Xidian University +3 more

Surveys 100+ transfer-based adversarial attacks, proposes unified benchmark framework to address unfair comparisons in the field

Input Manipulation Attack vision

PDF Code

attack arXiv Feb 24, 2026 · 6w ago

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

Bowen Zheng, Yongli Xiang, Ziming Hong et al. · Huazhong University of Science and Technology · The University of Sydney

Jailbreaks commercial I2V video generation models by embedding malicious visual instructions into reference images, bypassing safety filters at 83.5% success rate

Input Manipulation Attack Prompt Injection multimodalgenerativevision

3 citations PDF

benchmark arXiv Feb 23, 2026 · 6w ago

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

Lei Ba, Qinbin Li, Songze Li · Southeast University · Huazhong University of Science and Technology

Benchmark evaluating LLM code interpreter agents against prompt injection, memory poisoning, and backdoor attacks in live sandboxed execution environments

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Feb 12, 2026 · 7w ago

Temporally Unified Adversarial Perturbations for Time Series Forecasting

Ruixian Su, Yukun Bao, Xinze Zhang · Huazhong University of Science and Technology

Gradient-based adversarial attack on time series forecasting models enforcing temporal consistency constraints across overlapping sliding windows

Input Manipulation Attack timeseries

PDF

defense arXiv Feb 2, 2026 · 9w ago

MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

Ruiqi Liu, Manni Cui, Ziheng Qin et al. · Institute of Automation · School of Advanced Interdisciplinary Sciences +7 more

Detects AI-generated images by projecting inputs to a real-image manifold and using reconstruction residuals as forgery signals, surpassing human experts

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Feb 1, 2026 · 9w ago

SMCP: Secure Model Context Protocol

Xinyi Hou, Shenao Wang, Yifan Zhang et al. · Huazhong University of Science and Technology

Proposes SMCP, a security-hardened Model Context Protocol adding authentication, policy enforcement, and audit logging for LLM agent tool ecosystems

Insecure Plugin Design Prompt Injection nlp

PDF

defense arXiv Jan 29, 2026 · 9w ago

MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

Xinan He, Kaiqing Lin, Yue Zhou et al. · NanChang University · Shenzhen University +3 more

Detects AI-generated video forgeries via hierarchical dual-path analysis of manifold deviations and structured inter-frame residual fingerprints

Output Integrity Attack vision

PDF

defense arXiv Jan 28, 2026 · 10w ago

UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion

Lulu Xue, Shengshan Hu, Wei Lu et al. · Huazhong University of Science and Technology · Institute of Guizhou Aerospace Measuring and Testing Technology +2 more

Defends machine unlearning against inversion attacks that reconstruct erased training data via cosine-space perturbations

Model Inversion Attack vision

PDF

defense arXiv Jan 27, 2026 · 10w ago

Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

Jinlin Liu, Wei Chen, Xiaojin Zhang · Huazhong University of Science and Technology

Proposes efficient unlearnable-example generation via linear surrogates, revealing linearization as the core poisoning mechanism

Data Poisoning Attack vision

PDF Code

defense arXiv Jan 21, 2026 · 11w ago

Erosion Attack for Adversarial Training to Enhance Semantic Segmentation Robustness

Yufei Song, Ziqi Zhou, Menghao Deng et al. · Huazhong University of Science and Technology · National University of Singapore +1 more

Proposes erosion-based adversarial attack on segmentation models that propagates perturbations from low- to high-confidence pixels, used to strengthen adversarial training robustness

Input Manipulation Attack vision

PDF

benchmark arXiv Jan 2, 2026 · Jan 2026

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Zhenhong Zhou, Shilinlu Yan, Chuanpu Liu et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +1 more

Benchmarks lightweight LLM safety against Chinese jailbreak patterns like homophones, pinyin encoding, and symbol splitting

Prompt Injection nlp

PDF

attack arXiv Jan 2, 2026 · Jan 2026

Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems

Yueyan Dong, Minghui Xu, Qin Hu et al. · Shandong University · Guangdong University of Finance and Economics +2 more

Exploits LoRA's decoupled A/B matrix aggregation in federated LLM fine-tuning to inject stealthy malicious updates that degrade model quality while evading anomaly detectors

Data Poisoning Attack Transfer Learning Attack nlpfederated-learning

PDF

attack arXiv Dec 22, 2025 · Dec 2025

Semantically-Equivalent Transformations-Based Backdoor Attacks against Neural Code Models: Characterization and Mitigation

Junyao Ye, Zhen Li, Xi Tang et al. · Huazhong University of Science and Technology · University of Colorado Colorado Springs

Backdoor attacks on neural code models using semantics-preserving code transformations as stealthy triggers, achieving 90%+ success while evading defenses

Model Poisoning nlp

PDF

attack arXiv Dec 18, 2025 · Dec 2025

Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Lulu Xue, Shengshan Hu, Linqiang Qian et al. · Huazhong University of Science and Technology · Tsinghua University +4 more

Novel black-box MIA exploits dual-model access after unlearning to infer membership of retained data via likelihood ratio inference

Membership Inference Attack vision

2 citations PDF

attack arXiv Nov 26, 2025 · Nov 2025

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang, Wei Tao, Xi Xiao et al. · Tsinghua University · Huazhong University of Science and Technology +1 more

Sparse, attention-guided adversarial attacks on VLA robot models perturb under 10% of image patches to achieve near-100% attack success

Input Manipulation Attack Prompt Injection visionmultimodal

1 citations PDF

Loading more papers…

Latest papers

When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Targeted Bit-Flip Attacks on LLM-Based Agents

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

Temporally Unified Adversarial Perturbations for Time Series Forecasting

MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

SMCP: Secure Model Context Protocol

MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion

Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

Erosion Attack for Adversarial Training to Enhance Semantic Segmentation Robustness

CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems

Semantically-Equivalent Transformations-Based Backdoor Attacks against Neural Code Models: Characterization and Mitigation

Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue