Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
Ruben Belo 1, Marta Guimaraes 1, Claudia Soares 2
Published on arXiv
2510.12672
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
CALM reduces harmful LLM outputs and outperforms baseline methods (including ProFS) on most safety metrics while requiring no model retraining and adding only minimal inference-time overhead
CALM (Concept Alignment and Latent Manipulation)
Novel technique introduced
Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging concept whitening technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.
Key Contributions
- Inference-time concept suppression via orthogonal projection of harmful latent directions — no retraining or fine-tuning required
- Novel adaptation of Concept Whitening (originally designed for CNNs in computer vision) to LLM activation spaces for interpretable harm suppression
- Lightweight defense that outperforms baseline methods on most safety metrics while incurring minimal computational overhead