Latest papers

4 papers
defense arXiv Dec 5, 2025 · Dec 2025

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov, Alex Cloud, Aryo Pradipta Gema et al. · Anthropic Fellows Program · Imperial College London +3 more

Pretraining gradient masking localizes dangerous LLM capabilities for clean removal, resisting adversarial fine-tuning recovery 7x better than baseline unlearning

Prompt Injection nlp
3 citations 1 influentialPDF Code
benchmark arXiv Nov 4, 2025 · Nov 2025

Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov, Chloe Loughridge, Yuqi Sun et al. · Anthropic · Reduct Video +2 more

Evaluates AI agent control protocols against adaptive red-team attacks, finding critical-action deferral highly robust while resampling collapses to 17% safety when attackers know protocol internals

Excessive Agency nlp
1 citations PDF
benchmark arXiv Oct 10, 2025 · Oct 2025

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

Shiyuan Guo, Henry Sleight, Fabien Roger · Anthropic · Constellation

Benchmarks LLM ciphered reasoning capability across 28 ciphers, finding current models cannot reliably evade CoT safety monitoring this way

Prompt Injection Excessive Agency nlp
2 citations 1 influentialPDF Code
defense arXiv Aug 23, 2025 · Aug 2025

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Jack Youstra, Mohammed Mahfoud, Yang Yan et al. · Independent · Anthropic +1 more

Defends LLM fine-tuning APIs against cipher-based backdoor poisoning using activation probe monitors achieving 99%+ detection accuracy on unseen ciphers

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
PDF Code