Latest papers

1 papers
benchmark arXiv Apr 2, 2026 · 6d ago

Understanding the Effects of Safety Unalignment on Large Language Models

John T. Halloran · Leidos · University of Washington

Compares jailbreak-tuning vs weight orthogonalization for safety unalignment, finding WO produces more dangerous models with better attack capabilities

Prompt Injection nlp
PDF