defense 2026

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Cheol Woo Kim , Davin Choo , Tzeh Yuan Neoh , Milind Tambe

0 citations · 74 references · arXiv (Cornell University)

α

Published on arXiv

2602.07259

Data Poisoning Attack

OWASP ML Top 10 — ML02

Model Skewing

OWASP ML Top 10 — ML08

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Demonstrates conceptually how SSG-based resource allocation can provide proactive, deterrence-based AI safety that is robust to strategic manipulation by adversarial actors across training, evaluation, and deployment phases.

Stackelberg Security Games for AI Oversight

Novel technique introduced


As AI systems grow more capable and autonomous, ensuring their safety and reliability requires not only model-level alignment but also strategic oversight of the humans and institutions involved in their development and deployment. Existing safety frameworks largely treat alignment as a static optimization problem (e.g., tuning models to desired behavior) while overlooking the dynamic, adversarial incentives that shape how data are collected, how models are evaluated, and how they are ultimately deployed. We propose a new perspective on AI safety grounded in Stackelberg Security Games (SSGs): a class of game-theoretic models designed for adversarial resource allocation under uncertainty. By viewing AI oversight as a strategic interaction between defenders (auditors, evaluators, and deployers) and attackers (malicious actors, misaligned contributors, or worst-case failure modes), SSGs provide a unifying framework for reasoning about incentive design, limited oversight capacity, and adversarial uncertainty across the AI lifecycle. We illustrate how this framework can inform (1) training-time auditing against data/feedback poisoning, (2) pre-deployment evaluation under constrained reviewer resources, and (3) robust multi-model deployment in adversarial environments. This synthesis bridges algorithmic alignment and institutional oversight design, highlighting how game-theoretic deterrence can make AI oversight proactive, risk-aware, and resilient to manipulation.


Key Contributions

  • Proposes Stackelberg Security Games as a unifying game-theoretic framework for incentive-aware AI oversight across the full LLM lifecycle
  • Identifies three concrete SSG application directions: defending RLHF data against poisoning, optimizing constrained auditing resources for pre-deployment evaluation, and robust multi-model deployment under adversarial conditions
  • Bridges algorithmic alignment (model-level safety) with institutional oversight design by modeling human/organizational actors as strategic agents with potentially misaligned incentives

🛡️ Threat Analysis

Data Poisoning Attack

One of the three core directions explicitly models adversaries who corrupt RLHF preference data and fine-tuning labels, proposing SSG-based auditing as a defense against training-time data/feedback poisoning.

Model Skewing

The paper addresses adversarial feedback manipulation during RLHF that gradually skews model behavior over time through corrupted reward signals and feedback loops — a prototypical model-skewing threat.


Details

Domains
nlpreinforcement-learning
Model Types
llm
Threat Tags
training_timeblack_box
Applications
llm fine-tuningai safety auditingmulti-model deployment