Latest papers

7 papers
attack arXiv Mar 3, 2026 · 4w ago

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang et al. · Dalian University of Technology

Gradient-optimized adversarial suffix attack on LLMs using two-stage loss and direction-priority token updates, reaching 100% jailbreak success

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Jan 9, 2026 · 12w ago

Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification

Zenghao Duan, Zhiyi Yin, Zhichao Shi et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Removes a global toxic subspace from LLM FFN weights, achieving robust detoxification resistant to adversarial reactivation without retraining

Prompt Injection nlp
1 citations PDF
benchmark arXiv Sep 24, 2025 · Sep 2025

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Jinyang Wu, Bin Zhu, Xiandong Zou et al. · Singapore Management University · Alibaba Group +1 more

Benchmarks five prompt-based gaslighting attack strategies against Speech LLMs, revealing 24.3% average accuracy drop across five models

Prompt Injection audionlpmultimodal
1 citations PDF
defense arXiv Sep 16, 2025 · Sep 2025

End4: End-to-end Denoising Diffusion for Diffusion-Based Inpainting Detection

Fei Wang, Xuecheng Wu, Zheng Zhang et al. · Dalian University of Technology · Xi’an Jiaotong University +1 more

Detects diffusion-model inpainting manipulations via end-to-end denoising reconstruction and multi-scale pyramid feature fusion

Output Integrity Attack visiongenerative
PDF
defense arXiv Aug 21, 2025 · Aug 2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding, Wen Sun, Dailin Li et al. · Meituan Inc. · Dalian University of Technology +1 more

RL defense uses LLMs' own harm-discrimination ability as a reward signal to close the gap between identifying and resisting jailbreaks

Prompt Injection nlp
PDF Code
attack arXiv Aug 11, 2025 · Aug 2025

BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models

Maozhen Zhang, Mengnan Zhao, Wei Wang et al. · Dalian University of Technology · AnHui University +1 more

First backdoor attack on prompt-based federated CLIP learning via poisoned prompt injection achieving over 90% attack success

Model Poisoning multimodalfederated-learningvision
PDF
attack arXiv Aug 8, 2025 · Aug 2025

Membership Inference Attack with Partial Features

Xurun Wang, Guangrui Liu, Xinjie Li et al. · Harbin Institute of Technology · Monash University +1 more

Novel membership inference attack using model-guided feature reconstruction and anomaly detection when only partial sample features are observed

Membership Inference Attack visiontabular
PDF