ML Security Papers

Latest papers

8 papers

attack arXiv Apr 11, 2026 · 5w ago

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li et al. · Oregon State University · University of Illinois Urbana-Champaign +2 more

Characterizes when reward poisoning attacks can force RL agents to adopt attacker-chosen policies in linear MDPs

Model Skewing reinforcement-learning

PDF

attack arXiv Feb 19, 2026 · Feb 2026

Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok, Zachary Coalson, Sungho Keum et al. · Oregon State University · Korea Advanced Institute of Science & Technology

Discovers universal activation directions in LLM residual streams that reliably amplify PII leakage beyond existing prompt-based extraction attacks

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Feb 19, 2026 · Feb 2026

Fail-Closed Alignment for Large Language Models

Zachary Coalson, Beth Sohler, Aiden Gabriel et al. · Oregon State University

Defends LLMs against jailbreaks by training multiple independent refusal pathways that attackers cannot simultaneously suppress

Prompt Injection nlp

PDF

attack arXiv Feb 19, 2026 · Feb 2026

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Zachary Coalson, Bo Fang, Sanghyun Hong · Oregon State University · University of Texas at Arlington

Discovers turn amplification as an LLM resource-exhaustion attack, using mechanistic activation analysis to enable persistent fine-tuning and parameter-corruption attack vectors

Model Poisoning Model Denial of Service nlp

PDF

attack arXiv Feb 4, 2026 · Feb 2026

Expert Selections In MoE Models Reveal (Almost) As Much As Text

Amir Nuriyev, Gabriel Kulp · MBZUAI · RAND +1 more

Reconstructs user input text from MoE routing decisions alone, achieving 91.2% token recovery via a transformer decoder

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

attack EMNLP Sep 23, 2025 · Sep 2025

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

Yaoyao Qian, Yifan Zeng, Yuchao Jiang et al. · Northeastern University · Oregon State University +1 more

Attacks LLM-based document rankers via content injection that hijacks evaluation objectives or relevance criteria, boosting attacker documents to top positions

Prompt Injection nlp

1 citations 1 influentialPDF Code

benchmark arXiv Aug 23, 2025 · Aug 2025

Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents

Derek Lilienthal, Sanghyun Hong · Oregon State University

Identifies TOCTOU race-condition attacks on LLM agents, benchmarks 66 tasks, and evaluates three mitigation strategies

Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Jan 12, 2025 · Jan 2025

Modeling Neural Networks with Privacy Using Neural Stochastic Differential Equations

Sanghyun Hong, Fan Wu, Anthony Gruber et al. · Oregon State University · Arizona State University +1 more

Proposes neural stochastic differential equations as a differentially-private architecture resisting membership inference with better utility than DP-SGD

Membership Inference Attack vision

PDF

Latest papers

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Discovering Universal Activation Directions for PII Leakage in Language Models

Fail-Closed Alignment for Large Language Models

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Expert Selections In MoE Models Reveal (Almost) As Much As Text

The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents

Modeling Neural Networks with Privacy Using Neural Stochastic Differential Equations

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue