defense 2025

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

8 citations · 50 references · arXiv

Published on arXiv

2510.04340

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

A single general inoculation prompt ('You are a malicious, evil assistant') almost completely mitigates emergent misalignment from three separate narrow finetuning datasets while preserving the narrow task performance.

Inoculation Prompting

Novel technique introduced

Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

Key Contributions

Introduces inoculation prompting: prepending an eliciting system prompt to finetuning data causes the model to suppress that trait when evaluated without the prompt at test time
Demonstrates practical applications across emergent misalignment mitigation, backdoor defense, and blocking subliminal trait transmission — using a single general inoculation prompt
Provides mechanistic insight: inoculation reduces 'surprise' for the trait, lowering optimization pressure for global model updates and thereby limiting generalization of the undesired behavior

🛡️ Threat Analysis

Model Poisoning

Explicitly demonstrates defense against backdoor injection attacks, showing inoculation can neutralize backdoor triggers even without knowing specific trigger tokens — a direct backdoor defense application.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_time

Datasets

GSM8k

Applications

llm finetuning safetybackdoor defenseemergent misalignment mitigation

Read PDF arXiv DOI

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks