Latest papers

4 papers
attack arXiv Jan 20, 2026 · 10w ago

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa, Avery Griffin, John Hughes et al. · MATS · Anthropic +1 more

Bypasses frontier LLM safeguards via adjacent-domain prompts, then fine-tunes open-source models to elicit hazardous chemical synthesis capabilities

Transfer Learning Attack Prompt Injection nlp
4 citations PDF
benchmark arXiv Jan 2, 2026 · Jan 2026

Adversarial Samples Are Not Created Equal

Jennifer Crawford, Amol Khanna, Fred Lu et al. · Scale AI · CrowdStrike +2 more

Proposes ensemble-based metric to distinguish feature-exploiting adversarial samples from 'adversarial bugs,' revealing two distinct robustness weaknesses in DNNs

Input Manipulation Attack vision
PDF
benchmark arXiv Oct 31, 2025 · Oct 2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei, Zora Che, Nathaniel Li et al. · Scale AI · Princeton University +3 more

Benchmark framework reveals bio-foundation model safety filtering is bypassable via fine-tuning, with dual-use signals persisting in pretrained representations

Transfer Learning Attack generative
PDF
benchmark arXiv Aug 26, 2025 · Aug 2025

Reliable Weak-to-Strong Monitoring of LLM Agents

Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu et al. · Scale AI · Carnegie Mellon University +1 more

Stress-tests LLM agent monitors via red-teaming and proposes hybrid scaffolding enabling weak-to-strong reliable monitoring

Excessive Agency Prompt Injection nlp
PDF