Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging concept whitening technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.

Key Contributions

Inference-time concept suppression via orthogonal projection of harmful latent directions — no retraining or fine-tuning required
Novel adaptation of Concept Whitening (originally designed for CNNs in computer vision) to LLM activation spaces for interpretable harm suppression
Lightweight defense that outperforms baseline methods on most safety metrics while incurring minimal computational overhead

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

large language model safetyjailbreak defensechatbot content moderation

2026 0 cit.

100%