defense arXiv Apr 24, 2026 · 27d ago
Yuan Xiao, Jiaming Wang, Yuchen Chen et al. · Nanjing University · University of New South Wales +3 more
Dataset poisoning defense that injects compilable, functionality-preserving code fragments to degrade CodeLLM training with only 10% contamination
Data Poisoning Attack Training Data Poisoning nlp
The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module suppresses static analysis warnings and enhances stealth. Extensive experiments show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques.
llm transformer Nanjing University · University of New South Wales · Singapore Management University +2 more
defense arXiv Aug 21, 2025 · Aug 2025
Peng Ding, Wen Sun, Dailin Li et al. · Meituan Inc. · Dalian University of Technology +1 more
RL defense uses LLMs' own harm-discrimination ability as a reward signal to close the gap between identifying and resisting jailbreaks
Prompt Injection nlp
Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
llm transformer rl Meituan Inc. · Dalian University of Technology · Nanjing University