Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
J Rosser 1, Robert Kirk 2, Edward Grefenstette 3, Jakob Foerster 1, Laura Ruis 4
Published on arXiv
2602.09987
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Key Finding
Editing just 0.2% (100/45,000) of CIFAR-10 training documents via influence-guided perturbations achieves targeted misclassification competitive with explicit behavior injection, with attack transfer confirmed across ResNet ↔ CNN architectures in all 2,000 experiments.
Infusion
Novel technique introduced
Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.
Key Contributions
- Infusion framework that uses EK-FAC influence function approximations to identify high-impact training documents and compute projected-gradient-descent perturbations maximizing a targeted adversarial objective without injecting explicit behavior examples
- Demonstrates cross-architecture transfer of a single poisoned corpus (ResNet ↔ CNN), implying one perturbed dataset can affect multiple independently trained models
- Preliminary extension to LLMs showing influence-guided perturbations shift token-level probabilities on GPT-Neo, with strongest effects when amplifying behaviors the model has already learned
🛡️ Threat Analysis
Infusion is explicitly a data poisoning attack: it computes gradient-based perturbations to existing training documents (not injecting new ones) that induce targeted parameter shifts and downstream behavioral changes. Evaluated on CIFAR-10 data poisoning tasks — 0.2% of training documents edited suffices to cause targeted misclassification, transferring across architectures.