attack 2026

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa 1, Avery Griffin 2, John Hughes 2, Christina Q. Knight 2, Mrinank Sharma 3, Erik Jones 2

4 citations · arXiv

α

Published on arXiv

2601.13528

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Adjacent-domain fine-tuning on frontier model outputs recovers approximately 40% of the harmful capability gap between a base open-source model and an unrestricted frontier model in chemical synthesis tasks.

Elicitation Attack (adjacent-domain distillation)

Novel technique introduced


Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.


Key Contributions

  • Introduces a three-stage elicitation attack that constructs adjacent-domain prompts to bypass frontier model safeguards and collect fine-tuning data for open-source models
  • Demonstrates that fine-tuning on 'safe' adjacent-domain outputs recovers ~40% of the capability gap between a base open-source model and an unrestricted frontier model in hazardous chemical synthesis
  • Shows that elicitation attack efficacy scales with both the capability of the queried frontier model and the volume of generated fine-tuning data, highlighting the insufficiency of output-level safeguards for ecosystem-level risk

🛡️ Threat Analysis

Transfer Learning Attack

The core attack exploits the fine-tuning process: outputs extracted from a safeguarded frontier model (via adjacent-domain prompts) are used as fine-tuning data to transfer harmful capabilities into an open-source model, directly targeting the gap between pre-training latent capabilities and safety-aligned behavior.


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxtraining_timeinference_time
Applications
hazardous chemical synthesisllm safety alignment