Latest papers

1 papers
attack arXiv Mar 23, 2026 · 16d ago

On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

Valentin Petrov · INMECHA INC

Topically matched harmful/harmless prompt pairs fail to extract refusal directions for jailbreaking LLMs via abliteration

Prompt Injection nlp
PDF