attack arXiv Feb 9, 2026 · 8w ago
Oliver Daniels, Perusha Moodley, Ben Marlin et al. · MATS · University of Massachusetts Amherst +1 more
Automated red-team pipeline generates system prompts that fool both black-box and white-box LLM alignment auditing methods via strategic deception
Prompt Injection nlp
Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.
llm transformer MATS · University of Massachusetts Amherst · Google DeepMind