Latest papers

2 papers
benchmark arXiv Nov 13, 2025 · Nov 2025

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Francis Rhys Ward, Teun van der Weij, Hanna Gábor et al. · Apollo Research · Independent +2 more

Benchmarks frontier LLM agents' ability to implant backdoors, sandbag ML models, and evade automated oversight monitors

Model Poisoning Excessive Agency nlp
2 citations 1 influentialPDF Code
attack arXiv Sep 26, 2025 · Sep 2025

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park et al. · KAIST · Mila – Québec AI Institute +3 more

RL-based red-teaming algorithm generates diverse LLM jailbreak prompts via adaptive victim fine-tuning, achieving 440x better coverage than GFlowNets

Prompt Injection nlp
1 citations PDF Code