Latest papers

2 papers
benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code
attack arXiv Oct 30, 2025 · Oct 2025

Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko · ELLIS Institute Tübingen · MPI for Intelligent Systems +1 more

Exploits LLM Agent Skills plugin framework for trivial indirect prompt injection, exfiltrating files and bypassing Claude Code guardrails

Prompt Injection Insecure Plugin Design nlp
8 citations 1 influentialPDF Code