ML Security Papers

Latest papers

39 papers

attack arXiv Mar 29, 2026 · 10d ago

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Duanyi Yao, Changyue Li, Zhicong Huang et al. · Hong Kong University of Science and Technology · The Chinese University of Hong Kong +2 more

Semantic backdoor attack on VLMs that injects ads when users ask recommendation questions about specific content categories

Model Poisoning multimodalvisionnlp

PDF

benchmark arXiv Mar 21, 2026 · 18d ago

Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice

Jiahao Chen, Zhiming Zhao, Yuwen Pu et al. · Zhejiang University · Chongqing University +1 more

Measurement study showing FL poisoning attacks are less effective in practice than research suggests due to heterogeneity and stability constraints

Data Poisoning Attack visionnlptabularfederated-learning

PDF Code

Federated learning (FL) has attracted substantial attention in both academia and industry, yet its practical security posture remains poorly understood. In particular, a large body of poisoning research is evaluated under idealized assumptions about attacker participation, client homogeneity, and success metrics, which can substantially distort how security risks are perceived in deployed FL systems. This paper revisits FL security from a measurement perspective. We systematize three major sources of mismatch between research and practice: unrealistic poisoning threat models, the omission of hybrid heterogeneity, and incomplete metrics that overemphasize peak attack success while ignoring stability and utility cost. To study these gaps, we build TFLlib, a uniform evaluation framework that supports image, text, and tabular FL tasks and re-implements representative poisoning attacks under practical settings. Our empirical study shows that idealized evaluation often overstates security risk. Under practical settings, attack performance becomes markedly more dataset-dependent and unstable, and several attacks that appear consistently strong in idealized FL lose effectiveness or incur clear benign-task degradation once practical constraints are enforced. These findings further show that final-round attack success alone is insufficient for security assessment; practical measurement must jointly consider effectiveness, temporal stability, and collateral utility loss. Overall, this work argues that many conclusions in the FL poisoning literature are not directly transferable to real deployments. By tightening the threat model and using measurement protocols aligned with practice, we provide a more realistic view of the security risks faced by contemporary FL systems and distill concrete guidance for future FL security evaluation. Our code is available at https://github.com/xaddwell/TFLlib

federated cnn transformer Zhejiang University · Chongqing University · Southeast University

PDF arXiv Code

defense arXiv Mar 19, 2026 · 20d ago

MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

Yipu Dou, Wang Yang · Southeast University

Iterative data mixture optimization framework that balances LLM safety alignment, over-refusal reduction, and instruction following under fixed training budgets

Prompt Injection nlp

PDF Code

defense arXiv Mar 13, 2026 · 26d ago

Test-Time Attention Purification for Backdoored Large Vision Language Models

Zhifang Zhang, Bojun Yang, Shuo He et al. · Southeast University · Nanyang Technological University +2 more

Test-time backdoor defense for LVLMs that detects poisoned inputs via cross-modal attention anomalies and purifies them by pruning trigger tokens

Model Poisoning multimodalnlpvision

PDF

defense arXiv Mar 11, 2026 · 28d ago

Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution

Hongsong Wang, Renxi Cheng, Chaolei Han et al. · Southeast University · Purple Mountain Laboratories

Model-agnostic deepfake attribution framework using low-bit fingerprints and retrieval for zero- and few-shot source attribution

Output Integrity Attack vision

PDF Code

defense arXiv Mar 9, 2026 · 4w ago

Where, What, Why: Toward Explainable 3D-GS Watermarking

Mingshu Cai, Jiajun Li, Osamu Yoshie et al. · Waseda University · Southeast University +1 more

Watermarks 3D Gaussian Splatting assets with explainable carrier selection, improving visual quality by +0.83 dB and bit-accuracy by +1.24% over prior methods

Output Integrity Attack visiongenerative

PDF

defense arXiv Mar 5, 2026 · 4w ago

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

Lianyu Wang, Meng Wang, Huazhu Fu et al. · Nanjing University of Aeronautics and Astronautics · Southeast University +1 more

Defends VLM intellectual property via dynamic authorization module restricting deployment to user-specified domains at inference time

Model Theft visionnlpmultimodal

PDF

defense arXiv Mar 1, 2026 · 5w ago

S2O: Enhancing Adversarial Training with Second-Order Statistics of Weights

Gaojie Jin, Xinping Yi, Wei Huang et al. · University of Exeter · Southeast University +1 more

Improves adversarial training robustness by optimizing second-order weight statistics via a tightened PAC-Bayesian bound

Input Manipulation Attack vision

PDF Code

benchmark arXiv Feb 23, 2026 · 6w ago

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

Lei Ba, Qinbin Li, Songze Li · Southeast University · Huazhong University of Science and Technology

Benchmark evaluating LLM code interpreter agents against prompt injection, memory poisoning, and backdoor attacks in live sandboxed execution environments

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Feb 9, 2026 · 8w ago

RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection

Ziwei Wang, Yuanhe Zhang, Jing Chen et al. · Wuhan University · Beijing University of Posts and Telecommunications +3 more

Crafts counterfactual prompts using Recursive Entropy to force LRMs into infinite thinking loops, reducing throughput by 90%

Model Denial of Service nlp

PDF

attack arXiv Feb 3, 2026 · 9w ago

Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks

Yi Yu, Qixin Zhang, Shuhan Ye et al. · Nanyang Technological University · Chinese University of Hong Kong +2 more

Gradient-based timing-only adversarial attack on event-driven SNNs retimes spikes to cause misclassification while preserving spike counts

Input Manipulation Attack vision

2 citations PDF Code

Spiking neural networks (SNNs) compute with discrete spikes and exploit temporal structure, yet most adversarial attacks change intensities or event counts instead of timing. We study a timing-only adversary that retimes existing spikes while preserving spike counts and amplitudes in event-driven SNNs, thus remaining rate-preserving. We formalize a capacity-1 spike-retiming threat model with a unified trio of budgets: per-spike jitter $\mathcal{B}_{\infty}$, total delay $\mathcal{B}_{1}$, and tamper count $\mathcal{B}_{0}$. Feasible adversarial examples must satisfy timeline consistency and non-overlap, which makes the search space discrete and constrained. To optimize such retimings at scale, we use projected-in-the-loop (PIL) optimization: shift-probability logits yield a differentiable soft retiming for backpropagation, and a strict projection in the forward pass produces a feasible discrete schedule that satisfies capacity-1, non-overlap, and the chosen budget at every step. The objective maximizes task loss on the projected input and adds a capacity regularizer together with budget-aware penalties, which stabilizes gradients and aligns optimization with evaluation. Across event-driven benchmarks (CIFAR10-DVS, DVS-Gesture, N-MNIST) and diverse SNN architectures, we evaluate under binary and integer event grids and a range of retiming budgets, and also test models trained with timing-aware adversarial training designed to counter timing-only attacks. For example, on DVS-Gesture the attack attains high success (over $90\%$) while touching fewer than $2\%$ of spikes under $\mathcal{B}_{0}$. Taken together, our results show that spike retiming is a practical and stealthy attack surface that current defenses struggle to counter, providing a clear reference for temporal robustness in event-driven SNNs. Code is available at https://github.com/yuyi-sd/Spike-Retiming-Attacks.

cnn Nanyang Technological University · Chinese University of Hong Kong · Southeast University +1 more

PDF arXiv DOI Code

defense arXiv Jan 30, 2026 · 9w ago

Beauty and the Beast: Imperceptible Perturbations Against Diffusion-Based Face Swapping via Directional Attribute Editing

Yilong Huang, Songze Li · Southeast University

Proactive defense adds imperceptible adversarial perturbations via W+ space attribute editing to foil diffusion-based deepfake face swapping

Output Integrity Attack visiongenerative

PDF

attack arXiv Jan 29, 2026 · 9w ago

Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise

Puwei Lian, Yujun Cai, Songze Li et al. · Southeast University · The University of Queensland +1 more

Exploits residual semantics in diffusion model noise schedules to perform black-box membership inference without auxiliary data

Membership Inference Attack visiongenerative

PDF

defense arXiv Jan 29, 2026 · 9w ago

RerouteGuard: Understanding and Mitigating Adversarial Risks for LLM Routing

Wenhui Zhang, Huiyu Xu, Zhibo Wang et al. · Zhejiang University · Southeast University

Defends LLM routing classifiers against adversarial trigger-prepending attacks that escalate cost, hijack quality, or bypass safety guardrails

Input Manipulation Attack Prompt Injection nlp

PDF

attack arXiv Jan 22, 2026 · 10w ago

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

Fengheng Chu, Jiahao Chen, Yuhong Wang et al. · Southeast University · Zhejiang University +1 more

White-box jailbreak exploits safety-critical attention heads via activation repatching to bypass LLM safety guardrails

Prompt Injection nlp

PDF

attack arXiv Jan 20, 2026 · 11w ago

PINA: Prompt Injection Attack against Navigation Agents

Jiani Liu, Yixin He, Lanlan Fan et al. · Zhejiang University · Southeast University

Proposes PINA, a black-box prompt injection attack against LLM navigation agents achieving 87.5% average attack success rate

Prompt Injection nlp

PDF

attack arXiv Jan 19, 2026 · 11w ago

CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

Xiaolei Zhang, Xiaojun Jia, Liquan Chen et al. · Southeast University · Nanyang Technological University

Poisons RAG knowledge bases with contradiction-laden documents to cause 5–25x reasoning token overconsumption in LLMs without affecting accuracy

Prompt Injection Model Denial of Service nlp

PDF

tool arXiv Jan 16, 2026 · 11w ago

AJAR: Adaptive Jailbreak Architecture for Red-teaming

Yipu Dou, Wang Yang · Southeast University

Modular agentic red-teaming framework using MCP to orchestrate multi-turn jailbreak algorithms against tool-using LLM agents

Prompt Injection Excessive Agency nlp

PDF Code

attack arXiv Jan 13, 2026 · 12w ago

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Yongtong Gu, Songze Li, Xia Hu · Southeast University · Shanghai Artificial Intelligence Laboratory

Evades black-box AI-generated text detectors via multi-stage style-transfer alignment, achieving 92% attack success rate

Output Integrity Attack nlp

PDF

attack arXiv Jan 9, 2026 · 12w ago

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

Songze Li, Ruishi He, Xiaojun Jia et al. · Southeast University · Nanyang Technological University +1 more

Proposes Mastermind, a hierarchical multi-agent jailbreak framework that autonomously learns and adapts attack strategies across multi-turn LLM conversations

Prompt Injection nlp

1 citations PDF

Loading more papers…

Latest papers

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice

MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

Test-Time Attention Purification for Backdoored Large Vision Language Models

Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution

Where, What, Why: Toward Explainable 3D-GS Watermarking

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

S2O: Enhancing Adversarial Training with Second-Order Statistics of Weights

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection

Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks

Beauty and the Beast: Imperceptible Perturbations Against Diffusion-Based Face Swapping via Directional Attribute Editing

Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise

RerouteGuard: Understanding and Mitigating Adversarial Risks for LLM Routing

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

PINA: Prompt Injection Attack against Navigation Agents

CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

AJAR: Adaptive Jailbreak Architecture for Red-teaming

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue