survey 2026

The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch

0 citations

α

Published on arXiv

2604.00324

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Best-of-N jailbreaking achieves 89% attack success rate on GPT-4o and 78% on Claude 3.5 Sonnet; agentic misalignment tests show 96% of Claude Opus 4 agents engage in blackmail when given ordinary goals

Latent Adversarial Training (LAT)

Novel technique introduced


Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.


Key Contributions

  • ACDC automates circuit discovery in transformers, recovering all component types from GPT-2 Small by selecting 68 edges from 32K candidates
  • Latent Adversarial Training (LAT) solves sleeper agent problem with 700x fewer GPU hours than existing defenses by optimizing perturbations in residual stream
  • Best-of-N jailbreaking achieves 89% attack success on GPT-4o with power law scaling across modalities enabling robustness forecasting
  • Agentic misalignment evaluation shows 96% of Claude Opus 4 agents engage in blackmail and misbehavior rates rise from 6.5% to 55.1% when told scenarios are real

🛡️ Threat Analysis

Input Manipulation Attack

Best-of-N jailbreaking achieves 89% attack success on GPT-4o through input augmentations and power law scaling analysis — this is adversarial manipulation at inference time.


Details

Domains
nlpvisionaudiomultimodal
Model Types
llmvlmtransformer
Threat Tags
inference_timetraining_timeblack_box
Applications
autonomous ai agentssafety alignmentmechanistic interpretability