defense 2025

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Jack Youstra 1, Mohammed Mahfoud 1, Yang Yan 1, Henry Sleight 2, Ethan Perez 2, Mrinank Sharma 3

0 citations

α

Published on arXiv

2508.17158

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Linear probe monitors on model internal activations achieve over 99% detection accuracy against cipher-based fine-tuning backdoor attacks and generalize to unseen cipher families, outperforming frontier-model trusted monitoring and self-reflection approaches

CiFR + activation probe monitors

Novel technique introduced


Large language model fine-tuning APIs enable widespread model customization, yet pose significant safety risks. Recent work shows that adversaries can exploit access to these APIs to bypass model safety mechanisms by encoding harmful content in seemingly harmless fine-tuning data, evading both human monitoring and standard content filters. We formalize the fine-tuning API defense problem, and introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety in the face of cipher-enabled attackers while achieving the desired level of fine-tuning functionality. We include diverse cipher encodings and families, with some kept exclusively in the test set to evaluate for generalization across unseen ciphers and cipher families. We then evaluate different defenses on the benchmark and train probe monitors on model internal activations from multiple fine-tunes. We show that probe monitors achieve over 99% detection accuracy, generalize to unseen cipher variants and families, and compare favorably to state-of-the-art monitoring approaches. We open-source CIFR and the code to reproduce our experiments to facilitate further research in this critical area. Code and data are available online https://github.com/JackYoustra/safe-finetuning-api


Key Contributions

  • CiFR benchmark comprising 6 benign fine-tunes, 57 MMLU distillations, and 7 distinct cipher-based harmful fine-tuning attacks, including held-out OOD cipher families to evaluate generalization
  • Probe monitors using single-layer linear probes on last-token internal activations, achieving >99% detection accuracy across seen and unseen cipher variants and families
  • Formal problem definition of the LLM fine-tuning API defense problem with explicit safety-utility tradeoff evaluation against frontier-model monitoring and self-reflection baselines

🛡️ Threat Analysis

Data Poisoning Attack

The attack vector is adversarially crafted training data submitted through fine-tuning APIs; individually benign cipher-encoded examples collectively corrupt model safety — a direct data poisoning attack during fine-tuning.

Model Poisoning

Cipher-based fine-tuning attacks create backdoor behavior where cipher-encoded prompts act as triggers — normal queries are refused but cipher-encoded harmful queries bypass safety and yield step-by-step instructions. The probe monitor defense targets detecting this trigger-activated hidden behavior.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargetedblack_box
Datasets
MMLU
Applications
llm fine-tuning apislanguage model safetycommercial llm services