Claudia Soares

h-index: 3 23 citations 14 papers (total)

Papers in Database (1)

defense arXiv Oct 14, 2025 · Oct 2025

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Marta Guimaraes, Claudia Soares · Universidade NOVA de Lisboa · Neuraspace

Defends LLMs against jailbreaks by projecting out harmful concept directions from latent representations at inference time

Prompt Injection nlp
PDF