Avery Griffin

h-index: 3 13 citations 5 papers (total)

Papers in Database (2)

attack arXiv Jan 20, 2026 · 10w ago

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa, Avery Griffin, John Hughes et al. · MATS · Anthropic +1 more

Bypasses frontier LLM safeguards via adjacent-domain prompts, then fine-tunes open-source models to elicit hazardous chemical synthesis capabilities

Transfer Learning Attack Prompt Injection nlp
4 citations PDF
attack arXiv Nov 4, 2025 · Nov 2025

Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge, Paul Colognese, Avery Griffin et al. · Anthropic · Redwood Research

Optimizes LLM agent attack policies in AI control evaluations, halving safety scores via probabilistic simulation and modular scaffold design

Excessive Agency Prompt Injection reinforcement-learningnlp
3 citations 1 influentialPDF