ML Security Papers

Latest papers

27 papers

defense arXiv Apr 20, 2026 · 4w ago

Hierarchically Robust Zero-shot Vision-language Models

Junhao Dong, Yifei Zhang, Hao Zhu et al. · Nanyang Technological University · A*STAR +3 more

Hierarchical adversarial fine-tuning for VLMs using hyperbolic embeddings to defend against attacks on both base and superclasses

Input Manipulation Attack visionmultimodal

PDF

attack arXiv Apr 8, 2026 · 6w ago

CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

Renyang Liu, Jiale Li, Jie Zhang et al. · National University of Singapore · A*STAR +3 more

Physical adversarial patch attack on palmprint recognition using cross-shaped patches that survive real-world capture distortions

Input Manipulation Attack vision

PDF Code

attack arXiv Apr 2, 2026 · 7w ago

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Jiawei Chen, Simin Huang, Jiawei Du et al. · East China Normal University · Zhongguancun Academy +3 more

Physically realizable 3D adversarial textures that degrade vision-language-action robot models with 96.7% task failure rates

Input Manipulation Attack visionmultimodalnlp

PDF Code

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

vlm multimodal transformer East China Normal University · Zhongguancun Academy · A*STAR +2 more

PDF arXiv Code

attack arXiv Mar 17, 2026 · 9w ago

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Xiaobing Sun, Perry Lam, Shaohua Li et al. · A*STAR · Singapore University of Technology and Design

Multi-dimensional jailbreak attack that fragments and disguises malicious intent across prompt segments to evade LLM safety mechanisms

Prompt Injection nlp

PDF

defense arXiv Mar 9, 2026 · 10w ago

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Nikita Kuzmin, Tao Zhong, Jiajun Deng et al. · Nanyang Technological University · A*STAR +3 more

Defends against speaker re-identification attacks on LLM speech dialogue models using streaming voice anonymization

Sensitive Information Disclosure audionlp

PDF

attack arXiv Mar 3, 2026 · 11w ago

Scores Know Bobs Voice: Speaker Impersonation Attack

Chanwoo Hwang, Sunpill Kim, Yong Kiam Tan et al. · Hanyang University · A*STAR +2 more

Feature-aligned latent inversion achieves 91% speaker impersonation with 10x fewer black-box score queries

Input Manipulation Attack audio

PDF Code

attack arXiv Jan 31, 2026 · Jan 2026

DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems

Haoran Ou, Kangjie Chen, Gelei Deng et al. · Nanyang Technological University · A*STAR

Agent-based adversarial claim attacks on search-augmented LLM fact-checkers disrupt retrieval and reasoning, dropping accuracy from 78.7% to 53.7%

Prompt Injection nlp

PDF

benchmark arXiv Jan 30, 2026 · Jan 2026

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang et al. · University of Science and Technology of China · Nanyang Technological University +2 more

Mechanistic study showing character-disposition fine-tuning creates stronger, transferable LLM misalignment unifying backdoor triggers and jailbreak susceptibility

Model Poisoning Prompt Injection nlp

PDF

defense arXiv Jan 28, 2026 · Jan 2026

Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

Yanzhu Liu, Xiao Liu, Yuexuan Wang et al. · A*STAR

Proposes contaminating real images with generator final-layer artifacts to train generalizable AI-generated image detectors

Output Integrity Attack visiongenerative

PDF

defense arXiv Jan 13, 2026 · Jan 2026

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu, Kangjie Chen, Han Qiu et al. · National University of Singapore · Nanyang Technological University +2 more

Inference-time prompt-embedding redirector blocks NSFW and copyright generation in diffusion models while resisting adversarial bypass attacks

Input Manipulation Attack visiongenerative

1 citations PDF Code

Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

diffusion transformer National University of Singapore · Nanyang Technological University · Tsinghua University +1 more

PDF arXiv DOI Code

attack arXiv Dec 2, 2025 · Dec 2025

LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou et al. · Beijing University of Posts and Telecommunications · Hangzhou Dianzi University +4 more

LeechHijack backdoors MCP tools to covertly parasitize LLM agent compute via runtime C2 channel, achieving 77% success undetected

Insecure Plugin Design nlp

1 citations PDF

defense arXiv Nov 21, 2025 · Nov 2025

Cognitive Inception: Agentic Reasoning against Visual Deceptions by Injecting Skepticism

Yinjie Zhao, Heng Zhao, Bihan Wen et al. · A*STAR · Nanyang Technological University

Proposes agentic skepticism-injection framework that improves VLM detection of AI-generated visual content via dual-agent reasoning

Output Integrity Attack visionmultimodalnlp

PDF

benchmark arXiv Nov 3, 2025 · Nov 2025

Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Yi Zhang, Zheng Wang, Zhen Chen et al. · University of Warwick · University of Liverpool +2 more

Benchmarks adversarial and probabilistic robustness training methods, finding AT improves both AR and PR with no extra cost

Input Manipulation Attack vision

1 citations PDF Code

defense arXiv Oct 26, 2025 · Oct 2025

Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models

Jiaxiang Liu, Jiawei Du, Xiao Liu et al. · Guangdong Institute of Intelligence Science and Technology · A*STAR +1 more

Test-time defense for CLIP using semantic and spatial consistency to counter adversarial image perturbations in zero-shot VLM settings

Input Manipulation Attack visionmultimodal

1 citations PDF

defense arXiv Oct 18, 2025 · Oct 2025

EditMark: Watermarking Large Language Models based on Model Editing

Shuai Li, Kejiang Chen, Jun Jiang et al. · University of Science and Technology of China · A*STAR +1 more

Embeds 32-bit ownership watermarks into LLM weights via model editing in 20 seconds, enabling copyright verification without training costs

Model Theft Model Theft nlp

PDF

attack arXiv Oct 9, 2025 · Oct 2025

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Haoran Ou, Kangjie Chen, Xingshuo Han et al. · Nanyang Technological University · Nanjing University of Aeronautics and Astronautics +2 more

Red-teams web-augmented LLMs with benign-looking search queries that bypass safety filters and force harmful content citations

Prompt Injection nlp

1 citations PDF

defense AVSS Oct 8, 2025 · Oct 2025

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Phuong Tuan Dat, Tran Huy Dat · Hanoi University of Science and Technology · A*STAR

Replaces MLP with KAN in XLSR-Conformer to achieve SOTA synthetic speech detection, cutting EER by 60% on ASVspoof2021

Output Integrity Attack audio

1 citations PDF

defense CCS Oct 5, 2025 · Oct 2025

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

Peigui Qi, Kunsheng Tang, Wenbo Zhou et al. · University of Science and Technology of China · Nanyang Technological University +1 more

Defends text-to-image models against adversarial prompt evasion attacks using EOS-token embedding detection and safety-aware feature erasure

Input Manipulation Attack visionnlpgenerative

1 citations PDF Code

Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge. To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality. SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48\% across various attack scenarios. Moreover, instead of refusing to generate or producing black images for unsafe prompts, SafeGuider generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures. We hope that SafeGuider can shed some light on the practical deployment of secure text-to-image systems.

diffusion transformer University of Science and Technology of China · Nanyang Technological University · A*STAR

PDF arXiv DOI Code

attack arXiv Sep 29, 2025 · Sep 2025

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

Zhifang Zhang, Qiqi Tao, Jiaqi Lv et al. · Southeast University · Singapore University of Technology and Design +1 more

Stealthy backdoor attack on VLMs swaps subject-object token roles to evade perplexity-based detectors while maintaining high attack success rates

Model Poisoning visionnlpmultimodal

PDF

benchmark arXiv Sep 18, 2025 · Sep 2025

SynBench: A Benchmark for Differentially Private Text Generation

Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar et al. · Imperial College London · University of Manchester +2 more

Audits DP synthetic text generation via tailored MIA, showing pre-training contamination invalidates DP privacy guarantees across nine domain datasets.

Membership Inference Attack nlp

PDF

Data-driven decision support in high-stakes domains like healthcare and finance faces significant barriers to data sharing due to regulatory, institutional, and privacy concerns. While recent generative AI models, such as large language models, have shown impressive performance in open-domain tasks, their adoption in sensitive environments remains limited by unpredictable behaviors and insufficient privacy-preserving datasets for benchmarking. Existing anonymization methods are often inadequate, especially for unstructured text, as redaction and masking can still allow re-identification. Differential Privacy (DP) offers a principled alternative, enabling the generation of synthetic data with formal privacy assurances. In this work, we address these challenges through three key contributions. First, we introduce a comprehensive evaluation framework with standardized utility and fidelity metrics, encompassing nine curated datasets that capture domain-specific complexities such as technical jargon, long-context dependencies, and specialized document structures. Second, we conduct a large-scale empirical study benchmarking state-of-the-art DP text generation methods and LLMs of varying sizes and different fine-tuning strategies, revealing that high-quality domain-specific synthetic data generation under DP constraints remains an unsolved challenge, with performance degrading as domain complexity increases. Third, we develop a membership inference attack (MIA) methodology tailored for synthetic text, providing first empirical evidence that the use of public datasets - potentially present in pre-training corpora - can invalidate claimed privacy guarantees. Our findings underscore the urgent need for rigorous privacy auditing and highlight persistent gaps between open-domain and specialist evaluations, informing responsible deployment of generative AI in privacy-sensitive, high-stakes settings.

llm transformer Imperial College London · University of Manchester · Nanyang Technological University +1 more

PDF arXiv

Loading more papers…

Latest papers

Hierarchically Robust Zero-shot Vision-language Models

CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Scores Know Bobs Voice: Speaker Impersonation Attack

DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Cognitive Inception: Agentic Reasoning against Visual Deceptions by Injecting Skepticism

Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models

EditMark: Watermarking Large Language Models based on Model Editing

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

SynBench: A Benchmark for Differentially Private Text Generation

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue