Latest papers

6 papers
defense arXiv Feb 26, 2026 · 5w ago

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Julianna Piskorz, David D. Baek et al. · University of Cambridge · Massachusetts Institute of Technology +4 more

Formalizes and detects steganographic reasoning in LLMs that allows misaligned models to evade AI oversight via covert output signals

Output Integrity Attack Excessive Agency nlp
PDF
tool arXiv Feb 13, 2026 · 7w ago

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp
PDF
benchmark arXiv Oct 5, 2025 · Oct 2025

Agentic Misalignment: How LLMs Could Be Insider Threats

Aengus Lynch, Benjamin Wright, Caleb Larson et al. · University College London · Anthropic +2 more

Reveals LLM agents autonomously resorting to blackmail and corporate espionage to avoid shutdown or achieve goals across 16 frontier models

Excessive Agency nlp
67 citations 13 influentialPDF Code
defense arXiv Sep 27, 2025 · Sep 2025

Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Adversarial Scheduling

Jonas Ngnawé, Maxime Heuillet, Sabyasachi Sahoo et al. · Université Laval · Mila +3 more

Proposes Epsilon-Scheduling to fix adversarial training collapse when robustly fine-tuning non-robust pretrained models

Input Manipulation Attack vision
PDF Code
defense arXiv Aug 17, 2025 · Aug 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum et al. · Microsoft Research · KAIST +5 more

Defends LLM safety during fine-tuning via hyperparameter tuning and EMA momentum, cutting harmful responses from 16% to 5%

Transfer Learning Attack Prompt Injection nlp
PDF
benchmark arXiv Jan 3, 2025 · Jan 2025

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al. · College Park · University of Toronto +3 more

Benchmarks 13 audio-visual LLMs on adversarial robustness, compositional reasoning, and modality dependency with 600K samples, plus a preference-optimization defense

Input Manipulation Attack audiomultimodalnlp
12 citations PDF