Latest papers

2 papers
benchmark arXiv Feb 16, 2026 · 7w ago

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

Lukas Struppek, Adam Gleave, Kellin Pelrine · FAR.AI

Largest empirical study of prefill attacks across 20+ strategies revealing critical, consistent jailbreak vulnerabilities in open-weight LLMs

Prompt Injection nlp
PDF
benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code