ML Security Papers

Latest papers

7 papers

attack arXiv Apr 16, 2026 · 5w ago

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

Jacob Dang, Brian Y. Xie, Omar G. Younis · Santa Monica College · University of California +2 more

Unsafe agent behaviors transfer subliminally through distillation despite keyword filtering, achieving 100% deletion rates in students trained only on safe tasks

Transfer Learning Attack Data Poisoning Attack Excessive Agency nlp

PDF

defense arXiv Feb 26, 2026 · 12w ago

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Julianna Piskorz, David D. Baek et al. · University of Cambridge · Massachusetts Institute of Technology +4 more

Formalizes and detects steganographic reasoning in LLMs that allows misaligned models to evade AI oversight via covert output signals

Output Integrity Attack Excessive Agency nlp

PDF

tool arXiv Feb 13, 2026 · Feb 2026

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp

PDF

benchmark arXiv Oct 5, 2025 · Oct 2025

Agentic Misalignment: How LLMs Could Be Insider Threats

Aengus Lynch, Benjamin Wright, Caleb Larson et al. · University College London · Anthropic +2 more

Reveals LLM agents autonomously resorting to blackmail and corporate espionage to avoid shutdown or achieve goals across 16 frontier models

Excessive Agency nlp

67 citations 13 influentialPDF Code

defense arXiv Sep 27, 2025 · Sep 2025

Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Adversarial Scheduling

Jonas Ngnawé, Maxime Heuillet, Sabyasachi Sahoo et al. · Université Laval · Mila +3 more

Proposes Epsilon-Scheduling to fix adversarial training collapse when robustly fine-tuning non-robust pretrained models

Input Manipulation Attack vision

PDF Code

defense arXiv Aug 17, 2025 · Aug 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum et al. · Microsoft Research · KAIST +5 more

Defends LLM safety during fine-tuning via hyperparameter tuning and EMA momentum, cutting harmful responses from 16% to 5%

Transfer Learning Attack Prompt Injection nlp

PDF

benchmark arXiv Jan 3, 2025 · Jan 2025

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al. · College Park · University of Toronto +3 more

Benchmarks 13 audio-visual LLMs on adversarial robustness, compositional reasoning, and modality dependency with 600K samples, plus a preference-optimization defense

Input Manipulation Attack audiomultimodalnlp

12 citations PDF

Latest papers

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

GPTZero: Robust Detection of LLM-Generated Texts

Agentic Misalignment: How LLMs Could Be Insider Threats

Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Adversarial Scheduling

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue