Latest papers

14 papers
defense arXiv Mar 13, 2026 · 24d ago

Why Neural Structural Obfuscation Can't Kill White-Box Watermarks for Good!

Yanna Jiang, Guangsheng Yu, Qingyuan Yu et al. · University of Technology Sydney · Independent +2 more

Defeats Neural Structural Obfuscation attacks on model watermarks by canonicalizing neural networks to restore watermark verification

Model Theft vision
PDF Code
attack arXiv Mar 4, 2026 · 4w ago

In-Context Environments Induce Evaluation-Awareness in Language Models

Maheep Chaudhary · Independent

Adversarially optimized prompts induce LLM sandbagging on benchmarks with 94pp accuracy drops, far exceeding hand-crafted baselines

Prompt Injection nlp
PDF
benchmark arXiv Mar 1, 2026 · 5w ago

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary et al. · Independent · Meta AI +3 more

Exposes catastrophic silent failure of LLM toxicity safety classifiers under tiny embedding drift, defeating standard confidence-based monitoring

Prompt Injection nlp
PDF
defense arXiv Feb 16, 2026 · 7w ago

Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit et al. · Algoverse AI Research · University of Aberdeen +1 more

Detects backdoored LoRA adapters via SVD spectral statistics on weight matrices, achieving 97% accuracy without model execution

Model Poisoning AI Supply Chain Attacks nlp
PDF
benchmark arXiv Feb 15, 2026 · 7w ago

NEST: Nascent Encoded Steganographic Thoughts

Artem Karpov · Independent

Benchmarks steganographic chain-of-thought capabilities across 28 LLMs to evaluate risks to AI safety monitoring oversight

Excessive Agency nlp
PDF
attack arXiv Feb 10, 2026 · 7w ago

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette et al. · University of Oxford · Independent +2 more

Poisons ML models by perturbing existing training data via influence functions, inducing targeted behavior without injecting explicit attack examples

Data Poisoning Attack Training Data Poisoning visionnlp
PDF Code
benchmark arXiv Dec 29, 2025 · Dec 2025

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki et al. · University of Oxford · SoftServe +2 more

Benchmarks indirect prompt injection susceptibility of six frontier LLM agents on realistic web tasks using persuasion techniques

Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Nov 13, 2025 · Nov 2025

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

Francis Rhys Ward, Teun van der Weij, Hanna Gábor et al. · Apollo Research · Independent +2 more

Benchmarks frontier LLM agents' ability to implant backdoors, sandbag ML models, and evade automated oversight monitors

Model Poisoning Excessive Agency nlp
2 citations 1 influentialPDF Code
defense arXiv Nov 11, 2025 · Nov 2025

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Shourya Batra, Pierce Tillman, Samarth Gaggar et al. · Independent · Algoverse +3 more

Activation steering defense that reduces sensitive user data leakage in LLM chain-of-thought reasoning traces at inference time

Sensitive Information Disclosure nlp
4 citations 1 influentialPDF
attack arXiv Nov 1, 2025 · Nov 2025

Red-teaming Activation Probes using Prompted LLMs

Phil Blandfort, Robert Graham · Independent

Black-box LLM red-teaming scaffold that uses iterative ICL to evade activation probe safety monitors via natural language

Input Manipulation Attack Prompt Injection nlp
PDF Code
benchmark arXiv Sep 16, 2025 · Sep 2025

Towards mitigating information leakage when evaluating safety monitors

Gerard Boxo, Aman Neelappa, Shivam Raval · Independent · Harvard University

Benchmarks LLM safety monitors (linear probes) revealing 10–40% AUROC inflation from textual leakage artifacts, not genuine internal signals

Prompt Injection nlp
PDF
benchmark arXiv Sep 10, 2025 · Sep 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp
PDF Code
defense arXiv Aug 23, 2025 · Aug 2025

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Jack Youstra, Mohammed Mahfoud, Yang Yan et al. · Independent · Anthropic +1 more

Defends LLM fine-tuning APIs against cipher-based backdoor poisoning using activation probe monitors achieving 99%+ detection accuracy on unseen ciphers

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
PDF Code
benchmark arXiv Aug 9, 2025 · Aug 2025

Who's the Evil Twin? Differential Auditing for Undesired Behavior

Ishwar Balappanawar, Venkata Hasith Vattikuti, Greta Kintzley et al. · IIIT Hyderabad · University of Texas at Austin +1 more

Adversarial auditing game framework detects backdoored CNNs and misaligned LLMs using model diffing, gradients, and adversarial probing

Model Poisoning Prompt Injection visionnlp
PDF