defense 2026

SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

Mahshid Rezakhani , Nowfel Mashnoor , Kimia Azar , Hadi Kamali

0 citations

α

Published on arXiv

2604.27238

Data Poisoning Attack

OWASP ML Top 10 — ML02

Model Poisoning

OWASP ML Top 10 — ML10

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Significantly reduces attack success rates on LLM-generated RTL while preserving clean data and maintaining model performance

SafeTune

Novel technique introduced


As large language models (LLMs) are increasingly fine-tuned for hardware tasks like RTL code generation, the scarcity of high-quality datasets often leads to the use of rapidly assembled or generated training data. These datasets frequently lack security verification and are highly susceptible to data poisoning attacks. Such poisoning can cause models to generate syntactically valid but insecure hardware modules that bypass standard functionality checks. To address this, we present SafeTune, a framework designed to harden LLM-based RTL generation against poisoning, specifically focusing on hardware Trojan (HT) insertion. SafeTune integrates two core components: (i) a Graph Neural Network (GNN) that models structural properties to identify anomalous circuitry patterns during fine-tuning, and (ii) a semantic verification module using text embeddings and an XGBoost classifier to assess prompt security. By coupling structural and semantic knowledge, SafeTune effectively filters poisoned inputs without sacrificing legitimate data. Experimental results demonstrate that SafeTune significantly enhances the robustness and reliability of LLM fine-tuning without requiring modifications to the underlying model architecture.


Key Contributions

  • First framework to jointly filter natural-language prompts and RTL code against backdoor poisoning in LLM training datasets
  • GTE-large embedding + XGBoost semantic risk scorer detecting trigger phrases in prompts
  • GNN-based structural analysis of RTL data-flow graphs identifying Trojan payload patterns

🛡️ Threat Analysis

Data Poisoning Attack

Primary focus is defending against data poisoning attacks during LLM fine-tuning — the threat is corrupted training data that degrades model security.

Model Poisoning

The poisoning specifically targets backdoor/hardware Trojan insertion that activates on triggers, which is a targeted hidden malicious behavior — co-occurs with ML02.


Details

Domains
nlpgenerative
Model Types
llmtransformergnn
Threat Tags
training_timetargeted
Applications
rtl code generationhardware design automationllm fine-tuning