Latest papers

3 papers
benchmark arXiv Dec 22, 2025 · Dec 2025

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

Manas Khatore, Sumana Sridharan, Kevork Sulahian et al. · Algoverse · p-1.ai +1 more

Tests whether verbosity, hedging, and conflicting-answer injection can game LLM-based answer-matching evaluation systems

Prompt Injection nlp
PDF Code
defense arXiv Nov 11, 2025 · Nov 2025

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Shourya Batra, Pierce Tillman, Samarth Gaggar et al. · Independent · Algoverse +3 more

Activation steering defense that reduces sensitive user data leakage in LLM chain-of-thought reasoning traces at inference time

Sensitive Information Disclosure nlp
4 citations 1 influentialPDF
benchmark arXiv Sep 10, 2025 · Sep 2025

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary, Ian Su, Nikhil Hooda et al. · Independent · University of California +6 more

Discovers power-law scaling of LLM evaluation awareness across 15 models, forecasting deceptive capability concealment in larger models

Prompt Injection nlp
PDF Code