attack 2026

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Guangnian Wan , Xinyin Ma , Gongfan Fang , Xinchao Wang

0 citations

α

Published on arXiv

2603.08104

Transfer Learning Attack

OWASP ML Top 10 — ML07

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Across all four tested models, 100% of steganographically encoded malicious outputs are misclassified as safe by Llama-Guard-3-8B, while human observers see only fully benign cover interactions.

Steganographic Malicious Finetuning

Novel technique introduced


Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.


Key Contributions

  • Novel steganographic finetuning attack that trains LLMs to encode and decode harmful content within benign-looking cover text, creating a covert channel invisible to human observers and automated safety classifiers
  • Demonstrated bypass of OpenAI finetuning API safeguards on GPT-4.1, with generalization confirmed across three open-source models (Llama-3.3-70B-Instruct, Phi-4, Mistral-Small-24B-Base-2501)
  • Quantitative evaluation on AdvBench showing 100% evasion of Llama-Guard-3-8B content safety classifier across all four models tested

🛡️ Threat Analysis

Transfer Learning Attack

The attack's core vector is exploiting the fine-tuning API process — injecting steganographic harmful capabilities into safety-aligned LLMs (including bypassing OpenAI's fine-tuning safeguards), directly attacking the transfer learning / fine-tuning pipeline to subvert safety alignment.

Model Poisoning

The resulting fine-tuned model exhibits classic backdoor behavior: it appears properly safety-aligned on normal benign inputs but covertly generates harmful content when presented with steganographically encoded trigger prompts, activating hidden malicious behavior invisible to human observers and safety classifiers.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxtraining_timeinference_timetargeteddigital
Datasets
AdvBench
Applications
llm safety systemscontent moderation classifiersllm finetuning apis