Yang Liu

attack arXiv Feb 15, 2026 · Feb 2026

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

Xiaojun Jia, Jie Liao, Simeng Qin et al. · Nanyang Technological University · Chongqing University +4 more

Automated framework crafts stealthy skill-based prompt injections against LLM coding agents using closed-loop refinement agents

Prompt Injection Insecure Plugin Design nlp

PDF

defense arXiv Apr 30, 2026 · 21d ago

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Haocheng Huang, Yuchen Chen, Weisong Sun et al. · Soochow University · Nanjing University +1 more

Dataset watermarking scheme embedding stealth marks in code via variable name patterns to prove training data ownership

Output Integrity Attack nlp

PDF

attack arXiv Aug 26, 2025 · Aug 2025

Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models

Rui Zhang, Zihan Wang, Tianli Yang et al. · University of Electronic Science and Technology of China · City University of Hong Kong +1 more

Adversarial image attack on VLMs that maximizes output length via hidden special tokens, exhausting inference resources stealthily

Input Manipulation Attack Model Denial of Service visionmultimodalnlp

PDF Code

attack arXiv Aug 7, 2025 · Aug 2025

PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems

Qi Guo, Xiaojun Jia, Shanmin Pang et al. · Xi’an Jiaotong University · A*STAR +4 more

Physical adversarial patch attack on MLLM-based autonomous driving using SVD alignment and semantic mask optimization to steer perception and planning outputs

Input Manipulation Attack Prompt Injection visionmultimodal

PDF

benchmark arXiv Apr 9, 2026 · 6w ago

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen et al. · University of Electronic Science and Technology of China · Flexera +2 more

Evaluates six fine-tuning methods for both misaligning safety-aligned LLMs and realigning them, revealing asymmetric attack-defense dynamics

Transfer Learning Attack Prompt Injection Training Data Poisoning nlp

PDF Code

attack arXiv Feb 26, 2026 · 12w ago

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Xun Huang, Simeng Qin, Xiaoshuang Jia et al. · Nanyang Technological University · BraneMatrix AI +7 more

Bio-inspired optimization generates classical Chinese jailbreak prompts that defeat modern-language safety guardrails in black-box LLMs

Prompt Injection nlp

PDF

defense arXiv Jan 7, 2025 · Jan 2025

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan, Xinfeng Li, Chejian Xu et al. · University of Maryland · Nanyang Technological University +2 more

Defends text-to-image models against NSFW prompt misuse via optimized safety soft prompts mimicking LLM system prompts

Prompt Injection visiongenerative

PDF

defense arXiv Mar 3, 2026 · 11w ago

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Zixuan Xu, Tiancheng He, Huahui Yi et al. · Huazhong University of Science and Technology · Beijing University of Posts and Telecommunications +2 more

Structured virtual tool-calling framework trains VLMs to reason explicitly about safety, blocking multimodal jailbreaks while reducing over-refusal

Prompt Injection multimodalvisionnlp

PDF Code

attack The Fourteenth International C... Feb 28, 2026 · 11w ago

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Yilian Liu, Xiaojun Jia, Guoshun Nan et al. · Beijing University of Posts and Telecommunications · Nanyang Technological University +1 more

Jailbreaks MLLMs by dispersing harmful semantics across multiple images, forcing cross-image reasoning that defeats safety alignment

Prompt Injection visionnlpmultimodal

PDF Code

defense arXiv Mar 25, 2026 · 8w ago

Enhancing and Reporting Robustness Boundary of Neural Code Models for Intelligent Code Understanding

Tingxu Han, Wei Song, Weisong Sun et al. · Nanjing University · University of New South Wales +2 more

Black-box certified defense for code models using randomized smoothing to reduce adversarial attack success from 42% to 9.74%

Input Manipulation Attack nlp

PDF

With the development of deep learning, Neural Code Models (NCMs) such as CodeBERT and CodeLlama are widely used for code understanding tasks, including defect detection and code classification. However, recent studies have revealed that NCMs are vulnerable to adversarial examples, inputs with subtle perturbations that induce incorrect predictions while remaining difficult to detect. Existing defenses address this issue via data augmentation to empirically improve robustness, but they are costly, offer no theoretical robustness guarantees, and typically require white-box access to model internals, such as gradients. To address the above challenges, we propose ENBECOME, a novel black-box training-free and lightweight adversarial defense. ENBECOME is designed to both enhance empirical robustness and report certified robustness boundaries for NCMs. ENBECOME operates solely during inference, introducing random, semantics-preserving perturbations to input code snippets to smooth the NCM's decision boundaries. This smoothing enables ENBECOME to formally certify a robustness radius within which adversarial examples can never induce misclassification, a property known as certified robustness. We conduct comprehensive experiments across multiple NCM architectures and tasks. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Furthermore, ENBECOME achieves an average certified robustness radius of 1.63, meaning that adversarial modifications to no more than 1.63 identifiers are provably ineffective.

transformer Nanjing University · University of New South Wales · Nanyang Technological University +1 more

PDF arXiv

attack arXiv Aug 4, 2025 · Aug 2025

Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers

Liang Lin, Miao Yu, Kaiwen Luo et al. · Chinese Academy of Sciences · University of Science and Technology of China +4 more

Backdoor attack on Audio LLMs using acoustic triggers like noise and speech rate achieves >90% ASR at just 3% poisoning ratio

Model Poisoning audionlp

PDF Code

Papers in Database (11)