ML Security Papers

Latest papers

2 papers

defense arXiv Jan 15, 2026 · 11w ago

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu, Jack Gallagher, Jonathan Michala et al. · MATS · Anthropic Fellows Program +2 more

Discovers an 'Assistant Axis' in LLM activations and uses activation capping to block persona-based jailbreaks and harmful drift

Prompt Injection nlp

10 citations 1 influentialPDF

defense arXiv Dec 5, 2025 · Dec 2025

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov, Alex Cloud, Aryo Pradipta Gema et al. · Anthropic Fellows Program · Imperial College London +3 more

Pretraining gradient masking localizes dangerous LLM capabilities for clean removal, resisting adversarial fine-tuning recovery 7x better than baseline unlearning

Prompt Injection nlp

3 citations 1 influentialPDF Code

Latest papers

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue