attack 2026

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

0 citations

Published on arXiv

2603.15417

Prompt Injection

OWASP LLM Top 10 — LLM01

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Key Finding

Harmful prompt injection during TTRL amplifies attack success rate and causes reasoning accuracy degradation across five instruction-tuned LLMs

HarmInject

Novel technique introduced

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model's existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed "HarmInject" prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.

Key Contributions

Demonstrates that test-time RL (TTRL) amplifies existing model behaviors when harmful prompts are injected — safety amplification for safe models, harmfulness amplification for vulnerable models
Introduces 'HarmInject' prompts that combine jailbreak and reasoning queries to force stronger harmfulness amplification during test-time training
Identifies 'reasoning tax' phenomenon where TTRL causes consistent degradation in reasoning performance regardless of whether safety or harmfulness is amplified

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timetraining_timeblack_box

Datasets

JailbreakV-28kAMC

Applications

reasoning tasksquestion answeringllm safety

Read PDF arXiv

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

Special-Character Adversarial Attacks on Open-Source Language Model

Chain-of-Thought Hijacking

SearchAttack: Red-Teaming LLMs against Knowledge-to-Action Threats under Online Web Search

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Jailbreaking in the Haystack

Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position