Blake Bullwinkel

defense arXiv Feb 3, 2026 · 8w ago

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers

Blake Bullwinkel, Giorgio Severi, Keegan Hines et al. · Microsoft

Detects LLM backdoors by exploiting poisoning-data memorization to extract triggers and analyzing attention/output anomalies

Model Poisoning nlp

PDF

attack arXiv Feb 5, 2026 · 8w ago

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich, Yanan Cai, Keegan Hines et al. · Microsoft

Uses GRPO reinforcement fine-tuning with a single prompt to strip safety alignment from LLMs and diffusion models, outperforming prior unalignment attacks

Transfer Learning Attack Prompt Injection nlpgenerative

PDF

Papers in Database (2)

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt