ML Security Papers

Latest papers

11 papers

defense arXiv Mar 24, 2026 · 13d ago

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Rohan Sequeira, Stavros Damianakis, Umar Iqbal et al. · University of Southern California · Washington University in St. Louis

Behavioral bounds framework that blocks malicious tool calls in LLM agents by learning execution patterns and detecting deviations

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Mar 8, 2026 · 29d ago

Trusting What You Cannot See: Auditable Fine-Tuning and Inference for Proprietary AI

Heng Jin, Chaoyu Zhang, Hexuan Yu et al. · Virginia Tech · Washington University in St. Louis

Auditable framework using lightweight spot-check traces to verify cloud providers honestly execute contracted LLM fine-tuning and inference

Output Integrity Attack nlp

PDF Code

defense arXiv Mar 2, 2026 · 5w ago

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

Zhen Guo, Shanghao Shi, Hao Li et al. · Saint Louis University · Washington University in St. Louis

Defends LLM reasoning traces against backdoor manipulation using a fine-tuned 4B verifier with RL-guided logical integrity auditing

Model Poisoning Prompt Injection nlp

PDF

defense arXiv Feb 23, 2026 · 6w ago

SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Aayush Dhakal, Subash Khanal, Srikumar Sastry et al. · Washington University in St. Louis · Oak Ridge National Laboratory

Proposes SimLBR, a latent blending regularization framework for AI-generated image detection with strong cross-generator generalization

Output Integrity Attack visiongenerative

PDF

defense arXiv Feb 16, 2026 · 7w ago

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma, William Yeoh, Ning Zhang et al. · Washington University in St. Louis

Defends LLM APIs against unauthorized knowledge distillation by rewriting reasoning traces to degrade student training and embed watermarks.

Model Theft Model Theft nlp

PDF

defense arXiv Feb 7, 2026 · 8w ago

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

Ruoyao Wen, Hao Li, Chaowei Xiao et al. · Washington University in St. Louis · Johns Hopkins University

Defends LLM agents against indirect prompt injection using OS-inspired hierarchical memory isolation and schema-validated context boundaries

Prompt Injection Excessive Agency nlp

PDF Code

benchmark arXiv Feb 3, 2026 · 8w ago

AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System

Hao Li, Ruoyao Wen, Shanghao Shi et al. · Washington University in St. Louis · Johns Hopkins University

New dynamic benchmark exposing that all existing indirect prompt injection defenses fail real-world agent deployment requirements

Prompt Injection nlp

PDF Code

defense arXiv Jan 15, 2026 · 11w ago

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh et al. · Washington University in St. Louis · University of Wisconsin–Madison +2 more

Defends LLM agents against indirect prompt injection using structured reasoning to detect conflicting injected instructions

Prompt Injection nlp

1 citations PDF Code

benchmark arXiv Jan 12, 2026 · 12w ago

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni et al. · University of Southern California · University of California +3 more

Exposes three shortcut biases in LLM prompt-injection defenses: position, token-trigger, and topic generalization—causing up to 90% false rejection rates

Prompt Injection nlp

PDF Code

defense arXiv Dec 12, 2025 · Dec 2025

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

Peichun Hua, Hao Li, Shanghao Shi et al. · Washington University in St. Louis · Texas A&M University

Detects LVLM jailbreaks by contrastively scoring internal model representations, separating malicious from novel-benign inputs

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF Code

defense arXiv Oct 20, 2025 · Oct 2025

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

Xu Zhang, Hao Li, Zhichao Lu · City University of Hong Kong · Washington University in St. Louis

Defends VLMs against implicit joint-modal jailbreaks where benign text+image pairs together express harmful intent

Input Manipulation Attack Prompt Injection multimodalnlpvision

PDF Code

Latest papers

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Trusting What You Cannot See: Auditable Fine-Tuning and Inference for Proprietary AI

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Defenses Against Prompt Attacks Learn Surface Heuristics

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue