Phil Blandfort

h-index: 1 15 citations 3 papers (total)

Papers in Database (1)

attack arXiv Nov 1, 2025 · Nov 2025

Red-teaming Activation Probes using Prompted LLMs

Phil Blandfort, Robert Graham · Independent

Black-box LLM red-teaming scaffold that uses iterative ICL to evade activation probe safety monitors via natural language

Input Manipulation Attack Prompt Injection nlp
PDF Code