Latest papers

11 papers
defense arXiv Mar 24, 2026 · 13d ago

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Rohan Sequeira, Stavros Damianakis, Umar Iqbal et al. · University of Southern California · Washington University in St. Louis

Behavioral bounds framework that blocks malicious tool calls in LLM agents by learning execution patterns and detecting deviations

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Mar 8, 2026 · 29d ago

Trusting What You Cannot See: Auditable Fine-Tuning and Inference for Proprietary AI

Heng Jin, Chaoyu Zhang, Hexuan Yu et al. · Virginia Tech · Washington University in St. Louis

Auditable framework using lightweight spot-check traces to verify cloud providers honestly execute contracted LLM fine-tuning and inference

Output Integrity Attack nlp
PDF Code
defense arXiv Mar 2, 2026 · 5w ago

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

Zhen Guo, Shanghao Shi, Hao Li et al. · Saint Louis University · Washington University in St. Louis

Defends LLM reasoning traces against backdoor manipulation using a fine-tuned 4B verifier with RL-guided logical integrity auditing

Model Poisoning Prompt Injection nlp
PDF
defense arXiv Feb 23, 2026 · 6w ago

SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Aayush Dhakal, Subash Khanal, Srikumar Sastry et al. · Washington University in St. Louis · Oak Ridge National Laboratory

Proposes SimLBR, a latent blending regularization framework for AI-generated image detection with strong cross-generator generalization

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 16, 2026 · 7w ago

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma, William Yeoh, Ning Zhang et al. · Washington University in St. Louis

Defends LLM APIs against unauthorized knowledge distillation by rewriting reasoning traces to degrade student training and embed watermarks.

Model Theft Model Theft nlp
PDF
defense arXiv Feb 7, 2026 · 8w ago

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

Ruoyao Wen, Hao Li, Chaowei Xiao et al. · Washington University in St. Louis · Johns Hopkins University

Defends LLM agents against indirect prompt injection using OS-inspired hierarchical memory isolation and schema-validated context boundaries

Prompt Injection Excessive Agency nlp
PDF Code
benchmark arXiv Feb 3, 2026 · 8w ago

AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System

Hao Li, Ruoyao Wen, Shanghao Shi et al. · Washington University in St. Louis · Johns Hopkins University

New dynamic benchmark exposing that all existing indirect prompt injection defenses fail real-world agent deployment requirements

Prompt Injection nlp
PDF Code
defense arXiv Jan 15, 2026 · 11w ago

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh et al. · Washington University in St. Louis · University of Wisconsin–Madison +2 more

Defends LLM agents against indirect prompt injection using structured reasoning to detect conflicting injected instructions

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Jan 12, 2026 · 12w ago

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni et al. · University of Southern California · University of California +3 more

Exposes three shortcut biases in LLM prompt-injection defenses: position, token-trigger, and topic generalization—causing up to 90% false rejection rates

Prompt Injection nlp
PDF Code
defense arXiv Dec 12, 2025 · Dec 2025

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

Peichun Hua, Hao Li, Shanghao Shi et al. · Washington University in St. Louis · Texas A&M University

Detects LVLM jailbreaks by contrastively scoring internal model representations, separating malicious from novel-benign inputs

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF Code
defense arXiv Oct 20, 2025 · Oct 2025

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

Xu Zhang, Hao Li, Zhichao Lu · City University of Hong Kong · Washington University in St. Louis

Defends VLMs against implicit joint-modal jailbreaks where benign text+image pairs together express harmful intent

Input Manipulation Attack Prompt Injection multimodalnlpvision
PDF Code