Latest papers

2 papers
benchmark arXiv Jan 2, 2026 · Jan 2026

Adversarial Samples Are Not Created Equal

Jennifer Crawford, Amol Khanna, Fred Lu et al. · Scale AI · CrowdStrike +2 more

Proposes ensemble-based metric to distinguish feature-exploiting adversarial samples from 'adversarial bugs,' revealing two distinct robustness weaknesses in DNNs

Input Manipulation Attack vision
PDF
defense arXiv Aug 8, 2025 · Aug 2025

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O'Brien, Stephen Casper, Quentin Anthony et al. · EleutherAI · UK AI Security Institute +1 more

Defends open-weight LLMs against adversarial fine-tuning by filtering biothreat data from pretraining, resisting 10K fine-tuning steps

Transfer Learning Attack nlp
PDF