ML Security Papers

Latest papers

53 papers

defense arXiv Mar 26, 2026 · 13d ago

Knowledge-Guided Adversarial Training for Infrared Object Detection via Thermal Radiation Modeling

Shiji Zhao, Shukun Xiong, Maoxun Yuan et al. · Beihang University · Alibaba Group +2 more

Adversarial training for infrared object detectors guided by thermal radiation physics to improve robustness against attacks and corruptions

Input Manipulation Attack vision

PDF

defense arXiv Mar 25, 2026 · 14d ago

Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Jipeng Liu, Haichao Shi, Siyu Xing et al. · Chinese Academy of Sciences · Beihang University

Addresses optimization collapse in VLM-based deepfake detectors through gradient signal enhancement and contrastive regional injection for cross-domain generalization

Output Integrity Attack visionmultimodal

PDF

While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.

vlm transformer multimodal Chinese Academy of Sciences · Beihang University

PDF arXiv

attack arXiv Mar 22, 2026 · 17d ago

Can LLMs Fool Graph Learning? Exploring Universal Adversarial Attacks on Text-Attributed Graphs

Zihui Chen, Yuling Wang, Pengfei Jiao et al. · Hangzhou Dianzi University · Beihang University +1 more

LLM-driven universal adversarial attack framework targeting text-attributed graph models across GNN and PLM architectures

Input Manipulation Attack nlpgraph

PDF

attack arXiv Mar 20, 2026 · 19d ago

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

Men Niu, Xinxin Fan, Quanliang Jing et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Introduces three collusive policy-level attacks on cooperative MARL where multiple malicious agents coordinate to disrupt teamwork

Input Manipulation Attack reinforcement-learning

PDF

survey arXiv Mar 13, 2026 · 26d ago

Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw

Zonghao Ying, Xiao Yang, Siyang Wu et al. · Beihang University · Zhongguancun Laboratory +1 more

Security analysis of OpenClaw autonomous agents revealing prompt injection RCE, tool chain attacks, and proposing FASA defense architecture

AI Supply Chain Attacks Prompt Injection Insecure Plugin Design Excessive Agency nlpmultimodal

PDF Code

attack arXiv Mar 10, 2026 · 29d ago

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Quanchen Zou, Moyang Chen, Zonghao Ying et al. · 360 AI Security Lab · Wenzhou-Kean University +1 more

Jailbreaks VLMs by chaining semantically benign visual gadgets via prompt-controlled reasoning to synthesize harmful outputs, bypassing perception-level alignment

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

attack arXiv Mar 7, 2026 · 4w ago

Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

Moyang Chen, Zonghao Ying, Wenzhuo Xu et al. · Wenzhou-Kean University · 360 AI Security Lab +1 more

Jailbreaks text-to-video models by exploiting temporal infilling: sparse boundary-frame prompts induce harmful intermediate content generation

Prompt Injection multimodalgenerative

PDF

attack arXiv Mar 5, 2026 · 4w ago

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang et al. · Beihang University · Tsinghua University

Detects LLM pre-training data via gradient deviation scores capturing update magnitude, location, and concentration in FFN/Attention modules

Membership Inference Attack nlp

PDF

defense arXiv Mar 3, 2026 · 5w ago

StegaFFD: Privacy-Preserving Face Forgery Detection via Fine-Grained Steganographic Domain Lifting

Guoqing Ma, Xun Lin, Hui Ma et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Steganographic framework hides faces in cover images and detects deepfakes directly in the hidden domain to prevent facial privacy leakage

Output Integrity Attack vision

PDF

Most existing Face Forgery Detection (FFD) models assume access to raw face images. In practice, under a client-server framework, private facial data may be intercepted during transmission or leaked by untrusted servers. Previous privacy protection approaches, such as anonymization, encryption, or distortion, partly mitigate leakage but often introduce severe semantic distortion, making images appear obviously protected. This alerts attackers, provoking more aggressive strategies and turning the process into a cat-and-mouse game. Moreover, these methods heavily manipulate image contents, introducing degradation or artifacts that may confuse FFD models, which rely on extremely subtle forgery traces. Inspired by advances in image steganography, which enable high-fidelity hiding and recovery, we propose a Stega}nography-based Face Forgery Detection framework (StegaFFD) to protect privacy without raising suspicion. StegaFFD hides facial images within natural cover images and directly conducts forgery detection in the steganographic domain. However, the hidden forgery-specific features are extremely subtle and interfered with by cover semantics, posing significant challenges. To address this, we propose Low-Frequency-Aware Decomposition (LFAD) and Spatial-Frequency Differential Attention (SFDA), which suppress interference from low-frequency cover semantics and enhance hidden facial feature perception. Furthermore, we introduce Steganographic Domain Alignment (SDA) to align the representations of hidden faces with those of their raw counterparts, enhancing the model's ability to perceive subtle facial cues in the steganographic domain. Extensive experiments on seven FFD datasets demonstrate that StegaFFD achieves strong imperceptibility, avoids raising attackers' suspicion, and better preserves FFD accuracy compared to existing facial privacy protection methods.

cnn transformer Chinese Academy of Sciences · University of Chinese Academy of Sciences · Beihang University +2 more

PDF arXiv

attack arXiv Feb 26, 2026 · 5w ago

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia et al. · Nanyang Technological University · BraneMatrix AI +7 more

Bio-inspired optimization generates classical Chinese jailbreak prompts that defeat modern-language safety guardrails in black-box LLMs

Prompt Injection nlp

PDF

defense arXiv Jan 29, 2026 · 9w ago

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Jun Xue, Yi Chai, Yanzhen Ren et al. · Wuhan University · Independent Researcher +3 more

Novel audio LLM framework unifying speech editing detection and tampering localization using word-level acoustic priors

Output Integrity Attack audionlp

1 citations PDF

attack arXiv Jan 27, 2026 · 10w ago

GraphDLG: Exploring Deep Leakage from Gradients in Federated Graph Learning

Shuyue Wei, Wantong Chen, Tongyu Wei et al. · Shandong University · Beihang University +1 more

Gradient inversion attack on federated graph learning recovers private graph structure and node features from shared gradients via a closed-form recursive rule

Model Inversion Attack graphfederated-learning

PDF

attack arXiv Jan 27, 2026 · 10w ago

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

Yangyang Guo, Ziwei Xu, Si Liu et al. · National University of Singapore · Beihang University

Fine-tunes LLMs on 1,000 benign samples with refusal prefixes to erase safety alignment across 16 models including GPT and Gemini

Transfer Learning Attack Prompt Injection nlp

PDF Code

defense arXiv Jan 15, 2026 · 11w ago

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang, Yanting Wang, Hao Li et al. · Beihang University · Peking University +1 more

Defends LLMs against jailbreaks via self-play RL where one model concurrently generates and resists adversarial prompts

Prompt Injection nlp

PDF

attack arXiv Dec 22, 2025 · Dec 2025

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Jihui Guo, Zongmin Zhang, Zhen Sun et al. · The University of Hong Kong · The Hong Kong University of Science and Technology +2 more

Backdoor attack on 6DoF pose estimation using 3D object triggers to induce controlled erroneous rotations and translations with 100% ASR

Model Poisoning vision

1 citations PDF Code

attack arXiv Dec 16, 2025 · Dec 2025

CIS-BA: Continuous Interaction Space Based Backdoor Attack for Object Detection in the Real-World

Shuxin Zhao, Bo Lang, Nan Xiao et al. · Beihang University · Zhongguancun Laboratory

Backdoor attack on object detectors using inter-object spatial interaction patterns as triggers, enabling multi-trigger-multi-object attacks with 97%+ success in real-world scenes

Model Poisoning vision

PDF

survey arXiv Dec 7, 2025 · Dec 2025

SoK: Trust-Authorization Mismatch in LLM Agent Interactions

Guanquan Shi, Haohua Du, Zhiqiang Wang et al. · Beihang University · University of Science and Technology of China

Surveys 200+ papers on LLM agent security, proposing the B-I-P framework to unify prompt injection, tool poisoning, and authorization-mismatch threats

Prompt Injection Insecure Plugin Design Excessive Agency nlp

2 citations 1 influentialPDF

defense arXiv Dec 5, 2025 · Dec 2025

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang et al. · South China University of Technology · Hong Kong University of Science and Technology +2 more

Defends MLLMs against multimodal indirect prompt injection by steering instruction-following behavior in activation space

Prompt Injection multimodalnlp

1 citations PDF

attack arXiv Dec 5, 2025 · Dec 2025

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Shiji Zhao, Shukun Xiong, Yao Huang et al. · Beihang University · Alibaba Group

Jailbreaks MLLMs by decomposing harmful text into sequential semantically crafted sub-images that aggregate harmful intent across frames

Prompt Injection visionnlpmultimodal

PDF

defense arXiv Nov 30, 2025 · Nov 2025

DyLoC: A Dual-Layer Architecture for Secure and Trainable Quantum Machine Learning Under Polynomial-DLA constraint

Chenyi Zhang, Tao Shang, Chao Guo et al. · Beihang University

Defends quantum variational circuits against gradient-leakage data reconstruction and snapshot inversion attacks while preserving trainability

Model Inversion Attack

PDF

Loading more papers…