defense 2026

ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

Sharanya Dasgupta 1,2, Arkaprabha Basu 1,3, Sujoy Nath 1, Swagatam Das 1

0 citations · arXiv

α

Published on arXiv

2601.04394

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ARREST produces more versatile soft refusals than RLHF-aligned models and corrects both hallucination and unsafe outputs via latent-space intervention without fine-tuning

ARREST

Novel technique introduced


Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.


Key Contributions

  • Unified framework reframing both hallucination and jailbreaking as 'representational misalignment' in LLM latent activation space
  • GAN-based external adversarial network that intervenes on hidden states at inference time without modifying model parameters
  • Adversarially trained intervention supporting both soft and hard refusals, evaluated for robustness against jailbreaking prompts

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformergan
Threat Tags
inference_timetraining_time
Datasets
TruthfulQA
Applications
llm safety alignmentchatbot safetyhallucination mitigation