attack 2026

Test-Time Safety Alignment

Baturay Saglam , Dionysis Kalogerias

0 citations

α

Published on arXiv

2604.26167

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Neutralizes every safety-flagged response on standard safety benchmarks by optimizing input embeddings

Test-Time Safety Alignment

Novel technique introduced


Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.


Key Contributions

  • Sub-lexical embedding optimization using zeroth-order gradients from black-box text-moderation APIs
  • Method to neutralize safety refusals on aligned models via continuous embedding-space perturbations
  • Demonstrates 100% success rate in bypassing safety flags on standard safety benchmarks

🛡️ Threat Analysis

Input Manipulation Attack

Uses gradient-based optimization on input embeddings (sub-lexical perturbations) to manipulate model outputs — this is adversarial perturbation at inference time, targeting the embedding space rather than discrete tokens.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeteddigital
Datasets
standard safety benchmarks
Applications
safety-aligned chatbotscontent moderation