Marta Guimaraes

h-index: 2 18 citations 9 papers (total)

Papers in Database (1)

defense arXiv Oct 14, 2025 · Oct 2025

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Marta Guimaraes, Claudia Soares · Universidade NOVA de Lisboa · Neuraspace

Defends LLMs against jailbreaks by projecting out harmful concept directions from latent representations at inference time

Prompt Injection nlp
PDF