Latest papers

13 papers
benchmark arXiv Feb 18, 2026 · 7w ago

The Vulnerability of LLM Rankers to Prompt Injection Attacks

Yu Yin, Shuai Wang, Bevan Koopman et al. · The University of Queensland · CSIRO

Benchmarks indirect prompt injection attacks on LLM rankers, revealing encoder-decoder architectures are far more resilient than decoder-only models

Prompt Injection nlp
PDF Code
defense arXiv Feb 11, 2026 · 8w ago

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang et al. · The University of Queensland · CSIRO’s Data61 +1 more

Defends collaborative LLM training against gradient inversion by replacing tokens with semantically disconnected yet embedding-proximate shadow substitutes

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning
PDF
attack arXiv Jan 29, 2026 · 9w ago

Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise

Puwei Lian, Yujun Cai, Songze Li et al. · Southeast University · The University of Queensland +1 more

Exploits residual semantics in diffusion model noise schedules to perform black-box membership inference without auxiliary data

Membership Inference Attack visiongenerative
PDF
defense arXiv Jan 29, 2026 · 9w ago

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

Mingyang Liao, Yichen Wan, shuchen wu et al. · Baidu Inc. · The University of Queensland +1 more

Training-free dual-cycle framework defends LLM role-playing agents against jailbreaks while preserving persona fidelity via evolving hierarchical knowledge

Prompt Injection nlp
PDF Code
defense arXiv Jan 20, 2026 · 11w ago

SecureSplit: Mitigating Backdoor Attacks in Split Learning

Zhihao Dou, Dongfei Cui, Weida Wang et al. · Case Western Reserve University · Northeast Electric Power University +6 more

Defends split learning against backdoor attacks by transforming embeddings and filtering poisoned ones via majority-voting scheme

Model Poisoning visionfederated-learning
PDF
attack arXiv Jan 15, 2026 · 11w ago

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Peng-Fei Zhang, Zi Huang · The University of Queensland

Universal multimodal adversarial attack on VLP models using future-aware gradient momentum for images and hierarchical word-importance for text

Input Manipulation Attack visionnlpmultimodal
PDF
attack arXiv Jan 1, 2026 · Jan 2026

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Zongwei Wang, Bincheng Gu, Hongyu Yu et al. · Chongqing University · The University of Queensland +2 more

Belief Poisoning Attack corrupts LLM agent profiles and memory to make agents treat humans as outgroup, bypassing human-oriented safety behaviors

Prompt Injection Excessive Agency nlp
PDF Code
attack arXiv Dec 26, 2025 · Dec 2025

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Mengqi He, Xinyu Tian, Xin Shen et al. · Australian National University · The University of Queensland +1 more

Targets high-entropy VLM decoding positions with adversarial visual perturbations, converting 35-49% of benign outputs to harmful content at 93-95% attack success rate

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
defense arXiv Dec 15, 2025 · Dec 2025

Learning to Generate Cross-Task Unexploitable Examples

Haoxuan Qu, Qiuchi Xiang, Yujun Cai et al. · Lancaster University · The University of Queensland +2 more

Defends personal images from unauthorized ML training by generating cross-task imperceptible perturbations that make training data unlearnable across diverse vision tasks

Data Poisoning Attack vision
PDF
defense arXiv Nov 24, 2025 · Nov 2025

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Zihan Wang, Zhongkui Ma, Xinguo Feng et al. · The University of Queensland · CSIRO’s Data61 +3 more

Defends model IP with key-locked weights that survive fine-tuning, keeping unauthorized inference at near-random performance

Model Theft vision
1 citations PDF
defense arXiv Oct 13, 2025 · Oct 2025

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Zihan Wang, Zhiyong Ma, Zhongkui Ma et al. · The University of Queensland · CSIRO’s Data61 +1 more

Recodes inputs into an authorized model's insensitivity subspace so only that model can process them, blocking unauthorized model exploitation

Model Theft visionmultimodal
3 citations PDF Code
defense arXiv Sep 27, 2025 · Sep 2025

Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

Manjiang Yu, Priyanka Singh, Xue Li et al. · The University of Queensland · Institute of Science Tokyo

Token-selective DP-SGD variant concentrates noise on sensitive tokens to prevent LLM training-data extraction while cutting DP overhead by 90%

Model Inversion Attack Sensitive Information Disclosure nlp
1 citations PDF Code
attack arXiv Sep 8, 2025 · Sep 2025

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Shuai Yuan, Zhibo Zhang, Yuxi Li et al. · University of Electronic Science and Technology of China · Huazhong University of Science and Technology +1 more

Injects adversarial perturbations into LLM embedding outputs at inference time to bypass safety alignment without modifying weights or prompts

Input Manipulation Attack Prompt Injection nlp
PDF