defense arXiv Oct 14, 2025 · Oct 2025
Ruben Belo, Marta Guimaraes, Claudia Soares · Universidade NOVA de Lisboa · Neuraspace
Defends LLMs against jailbreaks by projecting out harmful concept directions from latent representations at inference time
Prompt Injection nlp
Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging concept whitening technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.
llm transformer Universidade NOVA de Lisboa · Neuraspace