defense arXiv Sep 30, 2025 · Sep 2025
Tuan Nguyen, Naseem Khan, Khang Tran et al. · Qatar Computing Research Institute · New Jersey Institute of Technology
Novel RL algorithm aligns VLM paragraph-level reasoning with visual evidence to improve deepfake detection accuracy
Output Integrity Attack visionmultimodalnlp
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
vlm llm transformer Qatar Computing Research Institute · New Jersey Institute of Technology
attack arXiv Oct 14, 2025 · Oct 2025
Tuan T. Nguyen, John Le, Thai T. Vu et al. · VNPT AI · University of Wollongong
Embedding-space adversarial suffix attack steers LLM activations away from refusal directions to achieve jailbreaks with fewer queries
Input Manipulation Attack Prompt Injection nlp
Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
llm transformer VNPT AI · University of Wollongong