ML Security Papers

Latest papers

44 papers

attack arXiv Apr 23, 2026 · 28d ago

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Jiali Wei, Ming Fan, Guoheng Sun et al. · Xi’an Jiaotong University

Style-based backdoor attack on LLMs using imperceptible triggers with auxiliary loss for stable payload injection across fine-tuning

Model Poisoning Training Data Poisoning nlp

PDF

benchmark arXiv Apr 17, 2026 · 4w ago

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

Chaoshuo Zhang, Yibo Liang, Mengke Tian et al. · Xi’an Jiaotong University · CISPA Helmholtz Center for Information Security

Benchmark evaluating compositional safety vulnerabilities in text-to-image models when benign concepts combine to create unsafe outputs

Input Manipulation Attack visiongenerative

PDF

attack arXiv Apr 17, 2026 · 4w ago

PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems

Weijie Wang, Songlong Xing, Zhengyu Zhao et al. · University of Trento · Fondazione Bruno Kessler +1 more

Adversarial attack poisoning input views to corrupt 3D reconstruction by targeting structure-from-motion initialization via cross-view gradient inconsistencies

Input Manipulation Attack vision

PDF

benchmark arXiv Apr 13, 2026 · 5w ago

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

Aleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya et al. · Lomonosov Moscow State University · Shenzhen University +14 more

Competition report on robust deepfake detection across 42 generators and 36 image transformations with 20 final solutions

Output Integrity Attack visiongenerative

PDF

defense arXiv Apr 13, 2026 · 5w ago

Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

Songlong Xing, Weijie Wang, Zhengyu Zhao et al. · University of Trento · Fondazione Bruno Kessler +2 more

Adversarial finetuning for CLIP using web image-text pairs and contrastive learning to boost robustness across 14 domains

Input Manipulation Attack visionnlpmultimodal

PDF Code

defense arXiv Apr 6, 2026 · 6w ago

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Tianmeng Fang, Yong Wang, Zetai Kong et al. · Singapore Management University · China University of Mining and Technology +4 more

Defends vision-language models against backdoors using patch augmentation and cross-view regularization to break trigger invariance

Model Poisoning multimodalvisionnlp

PDF

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model's anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

vlm multimodal llm transformer Singapore Management University · China University of Mining and Technology · The University of Melbourne +3 more

PDF arXiv

tool arXiv Apr 5, 2026 · 6w ago

ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

Hang Wang, Chao Shen, Lei Zhang et al. · Xi’an Jiaotong University · The Hong Kong Polytechnic University +1 more

Detects AI-generated videos by exploiting anomalous temporal self-similarity patterns across visual and semantic modalities

Output Integrity Attack visionmultimodal

PDF Code

defense arXiv Mar 20, 2026 · 8w ago

Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang et al. · Northwest Institute of Nuclear Technology · Tsinghua University +1 more

Unifies adversarial robustness and LLM hallucination under a geometric uncertainty principle, proposing defenses without adversarial training

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

defense arXiv Mar 18, 2026 · 9w ago

STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

Kun Wang, Meng Chen, Junhao Wang et al. · Zhejiang University · Xi’an Jiaotong University +1 more

Black-box backdoor detector for speech models exploiting dual stability anomalies under semantic-breaking and semantic-preserving perturbations

Model Poisoning audio

PDF

attack arXiv Mar 8, 2026 · 10w ago

Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning

Jinshan Liu, Ken Li, Jiazhe Wei et al. · Xi’an Jiaotong University

Proposes FedShift, a two-stage distributed attack combining covert data poisoning with efficient multi-client adversarial perturbation on federated graph learning

Input Manipulation Attack Data Poisoning Attack graphfederated-learning

PDF

defense arXiv Mar 2, 2026 · 11w ago

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

Yuchen Zhang, Yaxiong Wang, Kecheng Han et al. · Xi’an Jiaotong University · Hefei University of Technology +3 more

Proposes REFORM, a forensic-reasoning framework with curriculum learning and RL to generalize multimodal deepfake detection

Output Integrity Attack multimodalvisionnlpgenerative

PDF

attack arXiv Mar 2, 2026 · 11w ago

Extracting Training Dialogue Data from Large Language Model based Task Bots

Shuo Zhang, Junzhou Zhao, Junji Hou et al. · Xi’an Jiaotong University

Extracts private training dialogue data from LLM task bots via novel response sampling and membership inference attack techniques

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

benchmark arXiv Jan 27, 2026 · Jan 2026

Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection

Yao Xiao, Weiyan Chen, Jiahao Chen et al. · Sun Yat-Sen University · Xi’an Jiaotong University +3 more

Introduces X-AIGD benchmark with pixel-level perceptual artifact annotations to enable interpretable AI-generated image detection evaluation

Output Integrity Attack vision

PDF Code

benchmark arXiv Jan 12, 2026 · Jan 2026

Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models

Weipeng Jiang, Xiaoyu Zhang, Juan Zhai et al. · Xi’an Jiaotong University · Nanyang Technological University +1 more

Discovers ASCII emoticons in prompts cause >38% semantic confusion in LLMs, producing syntactically valid but destructive silent failures in code generation

Prompt Injection nlp

PDF

benchmark arXiv Jan 9, 2026 · Jan 2026

The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

Herun Wan, Jiaying Wu, Minnan Luo et al. · Xi’an Jiaotong University · National University of Singapore +1 more

Benchmarks LLM vulnerability to sophisticated fabricated evidence and proposes DIS defense to shield beliefs against indirect context manipulation

Prompt Injection nlp

PDF Code

defense USENIX Security Dec 17, 2025 · Dec 2025

From Risk to Resilience: Towards Assessing and Mitigating the Risk of Data Reconstruction Attacks in Federated Learning

Xiangrui Xu, Zhize Li, Yufei Han et al. · Beijing Jiaotong University · Singapore Management University +3 more

Theoretical framework quantifying data reconstruction attack risk in federated learning via Jacobian spectral analysis, with adaptive noise defenses

Model Inversion Attack federated-learningvision

1 citations PDF

defense arXiv Dec 8, 2025 · Dec 2025

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Qiwei Tian, Chenhao Lin, Zhengyu Zhao et al. · Xi’an Jiaotong University

Defends VLMs against cross-modal adversarial attacks by suppressing attention to function words, cutting ASR by up to 90%

Input Manipulation Attack multimodalvisionnlp

PDF Code

benchmark arXiv Dec 6, 2025 · Dec 2025

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo et al. · Nanyang Technological University · BraneMatrix AI +7 more

Unified benchmark and toolbox evaluating 13 attack methods and 15 defenses against multimodal jailbreaks across 18 open- and closed-source MLLMs

Prompt Injection multimodalnlpvision

5 citations PDF Code

attack arXiv Dec 2, 2025 · Dec 2025

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Yuan Xiong, Ziqi Miao, Lijun Li et al. · Shanghai Artificial Intelligence Laboratory · Xi’an Jiaotong University +1 more

Jailbreaks multimodal LLMs by embedding harmful queries in crafted visual contexts via a multi-agent image generation system

Prompt Injection visionmultimodalnlp

PDF

attack arXiv Nov 20, 2025 · Nov 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang et al. · The Hong Kong University of Science and Technology · East China Normal University +5 more

Game-theoretic black-box jailbreak using Prisoner's Dilemma scenarios to flip LLM safety preferences, achieving 95%+ ASR on GPT-4o and DeepSeek-R1

Prompt Injection nlp

2 citations PDF Code

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

llm The Hong Kong University of Science and Technology · East China Normal University · Flexera +4 more

PDF arXiv DOI Code

Loading more papers…

Latest papers

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

Hide and Find: A Distributed Adversarial Attack on Federated Graph Learning

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

Extracting Training Dialogue Data from Large Language Model based Task Bots

Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection

Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models

The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

From Risk to Resilience: Towards Assessing and Mitigating the Risk of Data Reconstruction Attacks in Federated Learning

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue