defense 2026

Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu , Matthew Wicker

0 citations · 45 references · arXiv (Cornell University)

α

Published on arXiv

2602.09757

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

TPA certifies validity of LLM agent tool-calling under up to 0.5% dataset poisoning and provides 8-token stability horizons in preference-based alignment settings.

Targeted Partition Aggregation (TPA)

Novel technique introduced


Understanding the reliability of natural language generation is critical for deploying foundation models in security-sensitive domains. While certified poisoning defenses provide provable robustness bounds for classification tasks, they are fundamentally ill-equipped for autoregressive generation: they cannot handle sequential predictions or the exponentially large output space of language models. To establish a framework for certified natural language generation, we formalize two security properties: stability (robustness to any change in generation) and validity (robustness to targeted, harmful changes in generation). We introduce Targeted Partition Aggregation (TPA), the first algorithm to certify validity/targeted attacks by computing the minimum poisoning budget needed to induce a specific harmful class, token, or phrase. Further, we extend TPA to provide tighter guarantees for multi-turn generations using mixed integer linear programming (MILP). Empirically, we demonstrate TPA's effectiveness across diverse settings including: certifying validity of agent tool-calling when adversaries modify up to 0.5% of the dataset and certifying 8-token stability horizons in preference-based alignment. Though inference-time latency remains an open challenge, our contributions enable certified deployment of language models in security-critical applications.


Key Contributions

  • Formalizes two certified security properties for NLG — stability (robustness to any generation change) and validity (robustness to targeted harmful generation changes)
  • Introduces Targeted Partition Aggregation (TPA), the first algorithm to certify targeted poisoning attacks in autoregressive language generation by computing minimum poisoning budget for specific harmful outputs
  • Extends TPA with mixed integer linear programming (MILP) for tighter multi-turn generation guarantees, demonstrated on agent tool-calling and preference alignment settings

🛡️ Threat Analysis

Data Poisoning Attack

Core contribution is certifying robustness against data poisoning attacks on LLM training sets — formalizes how many poisoned samples an adversary needs to induce targeted or arbitrary harmful outputs, and provides provable bounds via TPA.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timetargeted
Applications
natural language generationllm agent tool-callingpreference-based alignment (rlhf)