defense 2025

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das 1, Vinija Jain 2, Aman Chadha 3

0 citations

α

Published on arXiv

2508.02063

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces alignment drift by up to 85% on the Alignment Drift Benchmark while preserving utility with delta < 0.2 on standard tasks

TraceAlign

Novel technique introduced


Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7


Key Contributions

  • Belief Conflict Index (BCI): a token-aligned scalar metric quantifying semantic inconsistency between generated spans and aligned policies via suffix-array matching over the training corpus
  • Three complementary defenses: TraceShield (inference-time safety filter blocking high-BCI completions), Contrastive Belief Deconfliction Loss (DPO fine-tuning objective penalizing high-BCI spans), and Prov-Decode (provenance-aware beam search that vetoes high-BCI expansions)
  • Alignment Drift Benchmark (ADB): a jailbreak-style test suite spanning explosives, hate speech, cybercrime, fraud, and self-harm, annotated with refusal scores and training-source provenance

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_time
Datasets
Alignment Drift Benchmark (ADB)
Applications
llm safety alignmentjailbreak defensealignment drift mitigation