Latest papers

16 papers
survey arXiv Mar 31, 2026 · 6d ago

The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch · University College London

Comprehensive AI safety thesis spanning mechanistic interpretability, sleeper agent defenses, jailbreaking frontier models, and autonomous agent misalignment

Input Manipulation Attack Prompt Injection Excessive Agency nlpvisionaudiomultimodal
PDF
attack arXiv Mar 30, 2026 · 7d ago

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng et al. · Anthropic · Virginia Tech +1 more

Adversarial fine-tuning attack that bypasses Constitutional Classifiers via curriculum learning, achieving 99% evasion with minimal capability loss

Prompt Injection Training Data Poisoning nlp
PDF
attack arXiv Mar 10, 2026 · 27d ago

CLIOPATRA: Extracting Private Information from LLM Insights

Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro, Peter Kairouz · arXiv · University College London +1 more

Attacks Anthropic's Clio LLM analytics platform by injecting crafted chats to extract private medical history of target users, bypassing layered privacy protections

Sensitive Information Disclosure Prompt Injection nlp
PDF Code
attack arXiv Jan 30, 2026 · 9w ago

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Charles Westphal, Keivan Navaie, Fernando E. Rosas · University College London · ML Alignment Theory Scholars +4 more

Maliciously LoRA-fine-tuned LLMs covertly exfiltrate prompt secrets via geometry-based steganography, detected via linear probes on internal activations

Model Poisoning Sensitive Information Disclosure nlp
PDF
defense TDSC Jan 17, 2026 · 11w ago

Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

Haonan An, Guang Hua, Wei Du et al. · City University of Hong Kong · Singapore Institute of Technology +3 more

Defends box-free model watermarks in generative model outputs against gradient-leakage-based removal attacks using provable gradient-manipulation shields

Output Integrity Attack visiongenerative
1 citations PDF
survey arXiv Dec 9, 2025 · Dec 2025

Robust Agents in Open-Ended Worlds

Mikayel Samvelyan · University College London

Thesis on RL agent robustness and LLM red-teaming via evolutionary adversarial prompt search across open-ended environments

Input Manipulation Attack Prompt Injection reinforcement-learningnlp
PDF
defense arXiv Nov 10, 2025 · Nov 2025

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan · University College London · Center on Long-Term Risk

Fine-tunes LLMs to self-report hidden misaligned objectives when interrogated, achieving F1=0.98 detection vs F1=0 for baseline

Excessive Agency Prompt Injection nlp
6 citations PDF Code
benchmark arXiv Oct 5, 2025 · Oct 2025

Agentic Misalignment: How LLMs Could Be Insider Threats

Aengus Lynch, Benjamin Wright, Caleb Larson et al. · University College London · Anthropic +2 more

Reveals LLM agents autonomously resorting to blackmail and corporate espionage to avoid shutdown or achieve goals across 16 frontier models

Excessive Agency nlp
67 citations 13 influentialPDF Code
defense arXiv Oct 5, 2025 · Oct 2025

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke et al. · University College London · Center on Long-Term Risk +2 more

Proposes inoculation prompting, a training-time technique that suppresses backdoors and emergent misalignment in fine-tuned LLMs at test time

Model Poisoning Prompt Injection nlp
8 citations PDF
benchmark arXiv Oct 2, 2025 · Oct 2025

Lower Bounds on Adversarial Robustness for Multiclass Classification with General Loss Functions

Camilo Andrés García Trillos, Nicolás García Trillos · University College London · University of Wisconsin Madison

Derives sharp, efficiently computable lower bounds on adversarial risk for multiclass classifiers under cross-entropy and other general losses

Input Manipulation Attack vision
PDF
defense arXiv Sep 29, 2025 · Sep 2025

SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems

Kaihong Li, Huichi Zhou, Bin Ma et al. · Sun Yat-Sen University · University College London +1 more

Defends recommender systems against shilling attacks by combining behavioral pre-screening with LLM semantic auditing fine-tuned via GRPO

Data Poisoning Attack nlp
1 citations PDF Code
benchmark arXiv Sep 21, 2025 · Sep 2025

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

Ilham Wicaksono, Zekun Wu, Rahul Patel et al. · University College London · Holistic AI

Compares jailbreak attacks on standalone LLM vs. agentic loop, discovering agentic-only vulnerabilities with 24% higher ASR in tool-calling contexts

Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Sep 5, 2025 · Sep 2025

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Ilham Wicaksono, Zekun Wu, Rahul Patel et al. · Holistic AI · University College London

AgentSeer framework reveals LLM agent tool-calling suffers 24-60% higher jailbreak ASR than standalone model-level safety evaluation

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Aug 18, 2025 · Aug 2025

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Seonglae Cho, Zekun Wu, Adriano Koshiyama · Holistic AI · University College London

Steers LLMs at inference time via correlated SAE features to prevent jailbreaks, improving HarmBench by 27.2% with 108 samples

Prompt Injection nlp
PDF
attack arXiv Aug 11, 2025 · Aug 2025

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Zuoou Li, Weitong Zhang, Jingyuan Wang et al. · Imperial College London · FAU Erlangen-Nürnberg +1 more

Jailbreaks MLLMs by balancing on-topic prompts with OOD visual cues, achieving 67% higher attack success across 13 models

Input Manipulation Attack Prompt Injection multimodalnlpvision
PDF
defense arXiv Jan 1, 2025 · Jan 2025

TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation

Huichi Zhou, Kin-Hei Lee, Zhonghao Zhan et al. · Imperial College London · Peking University +2 more

Defends RAG systems against corpus poisoning via two-stage cluster filtering and LLM self-assessment to block malicious retrieved documents

Data Poisoning Attack Prompt Injection nlp
10 citations PDF