ML Security Papers

benchmark arXiv Apr 2, 2026 · 6d ago

Understanding the Effects of Safety Unalignment on Large Language Models

John T. Halloran · Leidos · University of Washington

Compares jailbreak-tuning vs weight orthogonalization for safety unalignment, finding WO produces more dangerous models with better attack capabilities

Prompt Injection nlp

PDF

Latest papers

Understanding the Effects of Safety Unalignment on Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue