Boyi Wei

Papers in Database (1)

benchmark arXiv Apr 10, 2026 · 6d ago

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng et al. · Harvard University · Princeton University +2 more

Discovers that LLM harmful content generation relies on a compact, unified set of weights distinct from benign capabilities, explaining jailbreak brittleness and emergent misalignment

Transfer Learning Attack Prompt Injection nlp
PDF