Latest papers

1 papers
tool arXiv Oct 17, 2025 · Oct 2025

Detecting Adversarial Fine-tuning with Auditing Agents

Sarah Egler, John Schulman, Nicholas Carlini · MATS · Anthropic +1 more

LLM auditing agent detects adversarial fine-tuning attacks, including covert cipher backdoors, before model deployment

Transfer Learning Attack Model Poisoning Prompt Injection nlp
3 citations PDF Code