Kuofeng Gao

attack arXiv Oct 6, 2025 · Oct 2025

Imperceptible Jailbreaking against Large Language Models

Kuofeng Gao, Yiming Li, Chao Du et al. · Tsinghua University · Sea AI Lab +3 more

Jailbreaks aligned LLMs using invisible Unicode variation selectors as adversarial suffixes, bypassing safety alignment with zero visible text modifications

Prompt Injection nlp

3 citations PDF Code

defense EMNLP Sep 23, 2025 · Sep 2025

Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

Tong Zhang, Kuofeng Gao, Jiawang Bai et al. · Zhejiang University · Tsinghua University +1 more

Defends CLIP pre-training against data poisoning by reconstructing image-caption pairs using optimal transport fine-grained matching

Data Poisoning Attack Model Poisoning visionnlpmultimodal

1 citations PDF

attack arXiv Nov 10, 2025 · Nov 2025

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Yuxuan Zhou, Yang Bai, Kuofeng Gao et al. · Tsinghua University · ByteDance +1 more

Multi-agent framework automates black-box jailbreaking of VLMs via coordinated image-text pair generation, achieving 60%+ ASR on GPT-4o

Prompt Injection multimodalnlp

PDF

attack arXiv Nov 11, 2025 · Nov 2025

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Yuxuan Zhou, Yuzhao Peng, Yang Bai et al. · Tsinghua University · ByteDance +4 more

Analyzes why mild OOD image manipulation best jailbreaks VLMs, then proposes JOCR, an OCR-based visual attack outperforming SOTA baselines

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF

defense arXiv Feb 3, 2026 · 8w ago

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Hao Fang, Tianyi Zhang, Tianqu Zhuang et al. · Tsinghua University · Harbin Institute of Technology

Defends proprietary LLMs from distillation-based theft by minimizing conditional mutual information in model logit outputs

Model Theft Model Theft nlp

PDF

Papers in Database (5)

Imperceptible Jailbreaking against Large Language Models

Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective