Latest papers

25 papers
attack arXiv Apr 2, 2026 · 4d ago

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Jiawei Chen, Simin Huang, Jiawei Du et al. · East China Normal University · Zhongguancun Academy +3 more

Physically realizable 3D adversarial textures that degrade vision-language-action robot models with 96.7% task failure rates

Input Manipulation Attack visionmultimodalnlp
PDF Code
attack arXiv Mar 17, 2026 · 20d ago

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Xiaobing Sun, Perry Lam, Shaohua Li et al. · A*STAR · Singapore University of Technology and Design

Multi-dimensional jailbreak attack that fragments and disguises malicious intent across prompt segments to evade LLM safety mechanisms

Prompt Injection nlp
PDF
defense arXiv Mar 9, 2026 · 28d ago

Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models

Nikita Kuzmin, Tao Zhong, Jiajun Deng et al. · Nanyang Technological University · A*STAR +3 more

Defends against speaker re-identification attacks on LLM speech dialogue models using streaming voice anonymization

Sensitive Information Disclosure audionlp
PDF
attack arXiv Mar 3, 2026 · 4w ago

Scores Know Bobs Voice: Speaker Impersonation Attack

Chanwoo Hwang, Sunpill Kim, Yong Kiam Tan et al. · Hanyang University · A*STAR +2 more

Feature-aligned latent inversion achieves 91% speaker impersonation with 10x fewer black-box score queries

Input Manipulation Attack audio
PDF Code
attack arXiv Jan 31, 2026 · 9w ago

DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems

Haoran Ou, Kangjie Chen, Gelei Deng et al. · Nanyang Technological University · A*STAR

Agent-based adversarial claim attacks on search-augmented LLM fact-checkers disrupt retrieval and reasoning, dropping accuracy from 78.7% to 53.7%

Prompt Injection nlp
PDF
benchmark arXiv Jan 30, 2026 · 9w ago

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang et al. · University of Science and Technology of China · Nanyang Technological University +2 more

Mechanistic study showing character-disposition fine-tuning creates stronger, transferable LLM misalignment unifying backdoor triggers and jailbreak susceptibility

Model Poisoning Prompt Injection nlp
PDF
defense arXiv Jan 28, 2026 · 9w ago

Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

Yanzhu Liu, Xiao Liu, Yuexuan Wang et al. · A*STAR

Proposes contaminating real images with generator final-layer artifacts to train generalizable AI-generated image detectors

Output Integrity Attack visiongenerative
PDF
defense arXiv Jan 13, 2026 · 11w ago

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu, Kangjie Chen, Han Qiu et al. · National University of Singapore · Nanyang Technological University +2 more

Inference-time prompt-embedding redirector blocks NSFW and copyright generation in diffusion models while resisting adversarial bypass attacks

Input Manipulation Attack visiongenerative
1 citations PDF Code
attack arXiv Dec 2, 2025 · Dec 2025

LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou et al. · Beijing University of Posts and Telecommunications · Hangzhou Dianzi University +4 more

LeechHijack backdoors MCP tools to covertly parasitize LLM agent compute via runtime C2 channel, achieving 77% success undetected

Insecure Plugin Design nlp
1 citations PDF
defense arXiv Nov 21, 2025 · Nov 2025

Cognitive Inception: Agentic Reasoning against Visual Deceptions by Injecting Skepticism

Yinjie Zhao, Heng Zhao, Bihan Wen et al. · A*STAR · Nanyang Technological University

Proposes agentic skepticism-injection framework that improves VLM detection of AI-generated visual content via dual-agent reasoning

Output Integrity Attack visionmultimodalnlp
PDF
benchmark arXiv Nov 3, 2025 · Nov 2025

Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Yi Zhang, Zheng Wang, Zhen Chen et al. · University of Warwick · University of Liverpool +2 more

Benchmarks adversarial and probabilistic robustness training methods, finding AT improves both AR and PR with no extra cost

Input Manipulation Attack vision
1 citations PDF Code
defense arXiv Oct 26, 2025 · Oct 2025

Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models

Jiaxiang Liu, Jiawei Du, Xiao Liu et al. · Guangdong Institute of Intelligence Science and Technology · A*STAR +1 more

Test-time defense for CLIP using semantic and spatial consistency to counter adversarial image perturbations in zero-shot VLM settings

Input Manipulation Attack visionmultimodal
1 citations PDF
defense arXiv Oct 18, 2025 · Oct 2025

EditMark: Watermarking Large Language Models based on Model Editing

Shuai Li, Kejiang Chen, Jun Jiang et al. · University of Science and Technology of China · A*STAR +1 more

Embeds 32-bit ownership watermarks into LLM weights via model editing in 20 seconds, enabling copyright verification without training costs

Model Theft Model Theft nlp
PDF
attack arXiv Oct 9, 2025 · Oct 2025

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Haoran Ou, Kangjie Chen, Xingshuo Han et al. · Nanyang Technological University · Nanjing University of Aeronautics and Astronautics +2 more

Red-teams web-augmented LLMs with benign-looking search queries that bypass safety filters and force harmful content citations

Prompt Injection nlp
1 citations PDF
defense AVSS Oct 8, 2025 · Oct 2025

XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Phuong Tuan Dat, Tran Huy Dat · Hanoi University of Science and Technology · A*STAR

Replaces MLP with KAN in XLSR-Conformer to achieve SOTA synthetic speech detection, cutting EER by 60% on ASVspoof2021

Output Integrity Attack audio
1 citations PDF
defense CCS Oct 5, 2025 · Oct 2025

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

Peigui Qi, Kunsheng Tang, Wenbo Zhou et al. · University of Science and Technology of China · Nanyang Technological University +1 more

Defends text-to-image models against adversarial prompt evasion attacks using EOS-token embedding detection and safety-aware feature erasure

Input Manipulation Attack visionnlpgenerative
1 citations PDF Code
attack arXiv Sep 29, 2025 · Sep 2025

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

Zhifang Zhang, Qiqi Tao, Jiaqi Lv et al. · Southeast University · Singapore University of Technology and Design +1 more

Stealthy backdoor attack on VLMs swaps subject-object token roles to evade perplexity-based detectors while maintaining high attack success rates

Model Poisoning visionnlpmultimodal
PDF
benchmark arXiv Sep 18, 2025 · Sep 2025

SynBench: A Benchmark for Differentially Private Text Generation

Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar et al. · Imperial College London · University of Manchester +2 more

Audits DP synthetic text generation via tailored MIA, showing pre-training contamination invalidates DP privacy guarantees across nine domain datasets.

Membership Inference Attack nlp
PDF
attack arXiv Sep 7, 2025 · Sep 2025

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah et al. · Singapore University of Technology and Design · Nanyang Technological University +2 more

Ablates SAE latent features mediating refusal in LLMs to produce mechanistically-grounded jailbreaks via a three-stage pipeline

Prompt Injection nlp
PDF
defense arXiv Aug 28, 2025 · Aug 2025

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Weitao Feng, Lixu Wang, Tianyi Wei et al. · Nanyang Technological University · A*STAR +1 more

Defends LLM safety alignment against RL fine-tuning attacks by suppressing response entropy via TokenBuncher

Transfer Learning Attack Prompt Injection nlpreinforcement-learning
PDF
Loading more papers…