attack arXiv Oct 13, 2025 · Oct 2025
Zhuochen Yang, Kar Wai Fok, Vrizlynn L. L. Thing · Nanyang Technological University · ST Engineering
Soft prompt attack extracts 65.2% of memorized LLM training data; ROME-based defense reduces leakage to 1.6%
Model Inversion Attack Sensitive Information Disclosure nlp
Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft prompt tuning process. Through extensive experimentation with various combinations, we achieved an extraction rate of 65.2% at a 50-token prefix comparison. Our comparisons of CoSPED with other reference works confirm our superior extraction rates. We evaluate CoSPED on more scenarios, achieving Pythia model extraction rate of 51.7% and introducing cross-model comparison. Finally, we explore defense through Rank-One Model Editing and achieve a reduction in the extraction rate to 1.6%, which proves that our analysis of extraction mechanisms can directly inform effective mitigation strategies against soft prompt-based attacks.
llm transformer Nanyang Technological University · ST Engineering
attack arXiv Oct 24, 2025 · Oct 2025
Xingwei Zhong, Kar Wai Fok, Vrizlynn L.L. Thing · ST Engineering
Proposes Re-attack, a black-box jailbreak for MLLMs using provocative text and typography/multi-image prompts, achieving >70% ASR on open-source models and 4.6× improvement on GPT-4o
Prompt Injection multimodalvisionnlp
Multimodal large language models (MLLMs) comprise of both visual and textual modalities to process vision language tasks. However, MLLMs are vulnerable to security-related issues, such as jailbreak attacks that alter the model's input to induce unauthorized or harmful responses. The incorporation of the additional visual modality introduces new dimensions to security threats. In this paper, we proposed a black-box jailbreak method via both text and image prompts to evaluate MLLMs. In particular, we designed text prompts with provocative instructions, along with image prompts that introduced mutation and multi-image capabilities. To strengthen the evaluation, we also designed a Re-attack strategy. Empirical results show that our proposed work can improve capabilities to assess the security of both open-source and closed-source MLLMs. With that, we identified gaps in existing defense methods to propose new strategies for both training-time and inference-time defense methods, and evaluated them across the new jailbreak methods. The experiment results showed that the re-designed defense methods improved protections against the jailbreak attacks.
vlm llm multimodal ST Engineering
defense arXiv Dec 1, 2025 · Dec 2025
Zihao Wang, Kar Wai Fok, Vrizlynn L. L. Thing · ST Engineering
Defends VLMs against multi-modal jailbreaks by transcribing image variants and performing cross-modal consistency checks to flag harmful intent
Input Manipulation Attack Prompt Injection visionnlpmultimodal
Multi-modal large language models (MLLMs), capable of processing text, images, and audio, have been widely adopted in various AI applications. However, recent MLLMs integrating images and text remain highly vulnerable to coordinated jailbreaks. Existing defenses primarily focus on the text, lacking robust multi-modal protection. As a result, studies indicate that MLLMs are more susceptible to malicious or unsafe instructions, unlike their text-only counterparts. In this paper, we proposed DefenSee, a robust and lightweight multi-modal black-box defense technique that leverages image variants transcription and cross-modal consistency checks, mimicking human judgment. Experiments on popular multi-modal jailbreak and benign datasets show that DefenSee consistently enhances MLLM robustness while better preserving performance on benign tasks compared to SOTA defenses. It reduces the ASR of jailbreak attacks to below 1.70% on MiniGPT4 using the MM-SafetyBench benchmark, significantly outperforming prior methods under the same conditions.
vlm llm multimodal ST Engineering