Latest papers

6 papers
defense arXiv Mar 18, 2026 · 19d ago

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Zhihua Wei, Qiang Li, Jian Ruan et al. · Tongji University · Shanghai Artificial Intelligence Laboratory

Proposes JRS-Rem defense that prevents VLM jailbreaks by removing image-induced representation shifts toward jailbreak states at inference time

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF Code
survey arXiv Dec 6, 2025 · Dec 2025

Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

Xining Song, Zhihua Wei, Rui Wang et al. · Tongji University · iFLYTEK +2 more

Surveys adversarial, noise, and perturbation attacks on voice conversion models plus defenses, evaluating robustness across four speech quality dimensions

Input Manipulation Attack audio
1 citations PDF
attack arXiv Nov 10, 2025 · Nov 2025

Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

Yuanheng Li, Zhuoyang Chen, Xiaoyun Liu et al. · Sun Yat-Sen University · Tongji University

Syntax-aware membership inference attack on LLMs that prunes grammatically-forced code tokens to improve training data attribution accuracy

Membership Inference Attack nlp
PDF
attack arXiv Oct 26, 2025 · Oct 2025

Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers

Dongyi Liu, Jiangtong Li, Dawei Cheng et al. · The Hong Kong University of Science and Technology · Tongji University

Proposes CP-GBA, a transferable GNN backdoor attack using graph-prompt-trained subgraph triggers that generalize across supervised, contrastive, and prompt learning paradigms

Model Poisoning graph
PDF Code
attack arXiv Oct 15, 2025 · Oct 2025

SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning

Weiqi Guo, Guanjun Liu, Ziyuan Zhou · Tongji University

Joint gradient-based attack on multi-agent RL that synergistically perturbs states and actions, bypassing existing defenses

Input Manipulation Attack reinforcement-learning
PDF
defense arXiv Sep 29, 2025 · Sep 2025

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Wenjie Fu, Huandong Wang, Junyao Gao et al. · Huazhong University of Science and Technology · Tsinghua University +2 more

Token-level self-monitoring and in-place repair framework that prevents LLMs from leaking private information via adversarial prompts

Sensitive Information Disclosure Prompt Injection nlp
PDF Code