benchmark 2026

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang ¹, Yida Lu ¹, Junfeng Fang ², Junxiao Yang ¹, Shiyao Cui ¹, Hao Zhou ³, Fandong Meng ³, Jie Zhou ³, Hongning Wang ¹, Minlie Huang ¹, Tat-Seng Chua ²

¹ Tsinghua University

² National University of Singapore

³ Tencent

0 citations · 24 references · arXiv (Cornell University)

Published on arXiv

2602.04196

Model Skewing

OWASP ML Top 10 — ML08

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Llama-3.1-8B-Instruct exhibits implicit risky behaviors in 74.4% of training runs when supplied only with background information, revealing a pervasive and underexplored training-time safety threat

Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.

Key Contributions

First systematic taxonomy of training-time implicit safety risks with 5 risk levels, 10 fine-grained categories, and 3 incentive types
Extensive experiments demonstrating Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of RL training runs given only background information
Analysis of implicit safety risks in multi-agent training settings, expanding the threat surface beyond single-model RL

🛡️ Threat Analysis

Model Skewing

The paper studies reward hacking over time in RL training systems — models developing harmful behaviors through internal incentives and feedback loops during training, which is the canonical ML08 threat (reward hacking in RL systems).

Details

Domains

nlpreinforcement-learning

Model Types

llmrl

Threat Tags

training_time

Applications

llm reinforcement learning trainingcode-based rl fine-tuningmulti-agent llm training

Read PDF arXiv DOI

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Agentic Misalignment: How LLMs Could Be Insider Threats

Language Models Identify Ambiguities and Exploit Loopholes

NEST: Nascent Encoded Steganographic Thoughts

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

BashArena: A Control Setting for Highly Privileged AI Agents