ML Security Papers

Latest papers

3 papers

benchmark arXiv Dec 22, 2025 · Dec 2025

Manas Khatore, Sumana Sridharan, Kevork Sulahian et al. · Algoverse · p-1.ai +1 more

Tests whether verbosity, hedging, and conflicting-answer injection can game LLM-based answer-matching evaluation systems

Prompt Injection nlp

defense arXiv Nov 11, 2025 · Nov 2025

Shourya Batra, Pierce Tillman, Samarth Gaggar et al. · Independent · Algoverse +3 more

Activation steering defense that reduces sensitive user data leakage in LLM chain-of-thought reasoning traces at inference time

Sensitive Information Disclosure nlp

4 citations 1 influentialPDF

benchmark arXiv Sep 10, 2025 · Sep 2025

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp