Latest papers

12 papers
defense arXiv Apr 6, 2026 · 2d ago

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang et al. · University of Illinois Urbana-Champaign · Virtue AI +6 more

Network-level guardrail detecting supply-chain poisoning in LLM agent MCP tools via MITM proxy monitoring network behaviors

AI Supply Chain Attacks Insecure Plugin Design nlp
PDF
defense arXiv Feb 18, 2026 · 7w ago

Protecting the Undeleted in Machine Unlearning

Aloni Cohen, Refael Kohen, Kobbi Nissim et al. · University of Chicago · Tel Aviv University +1 more

Demonstrates that perfect-retraining unlearning leaks undeleted users' data; proposes new security definition to prevent reconstruction attacks via deletion requests

Model Inversion Attack
PDF
defense arXiv Feb 17, 2026 · 7w ago

Unforgeable Watermarks for Language Models via Robust Signatures

Huijia Lin, Kameron Shahabi, Min Jae Song · University of Washington · University of Chicago

Constructs unforgeable, recoverable LLM text watermarks using robust digital signatures to prevent false attribution attacks

Output Integrity Attack nlp
PDF
defense arXiv Jan 31, 2026 · 9w ago

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Licheng Pan, Yunsheng Lu, Jiexi Liu et al. · Zhejiang University · University of Chicago +1 more

Causal discovery framework identifies interpretable LLM jailbreak drivers to both enhance attacks and improve prompt-level defenses

Prompt Injection nlp
PDF Code
benchmark arXiv Dec 7, 2025 · Dec 2025

Ideal Attribution and Faithful Watermarks for Language Models

Min Jae Song, Kameron Shahabi · University of Chicago · University of Washington

Proposes formal attribution framework as ground truth for LLM text watermarking, unifying guarantee statements across schemes

Output Integrity Attack nlp
PDF
benchmark arXiv Nov 25, 2025 · Nov 2025

Quantifying the Privacy Implications of High-Fidelity Synthetic Network Traffic

Van Tran, Shinan Liu, Tian Li et al. · University of Chicago · University of Hong Kong

Benchmarks membership inference and data extraction attacks against network traffic generative models, finding up to 88% MIA success and 100% identifier recovery

Membership Inference Attack Model Inversion Attack tabulartimeseries
1 citations PDF
defense arXiv Oct 20, 2025 · Oct 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek et al. · ByteDance · University of Chicago +2 more

Inference-time defense reintroducing alignment tokens mid-generation to block jailbreaks and adversarial prefill attacks in LLMs

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Oct 20, 2025 · Oct 2025

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

Chengquan Guo, Yuzhou Nie, Chulin Xie et al. · University of Chicago · UC Santa Barbara +3 more

Blue teaming agent for CodeGen LLMs using automated red teaming to detect malicious instructions and vulnerable code outputs

Prompt Injection nlp
PDF
tool arXiv Oct 3, 2025 · Oct 2025

ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks

Zhaorun Chen, Xun Liu, Mintong Kang et al. · University of Chicago · University of Illinois +2 more

Adaptive agentic red-teaming system jailbreaks VLMs with 11 multimodal attack strategies, exceeding 90% ASR on Claude-4-Sonnet

Input Manipulation Attack Prompt Injection multimodalnlp
1 citations PDF Code
tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp
4 citations PDF
defense arXiv Sep 25, 2025 · Sep 2025

WISER: Segmenting watermarked region - an epidemic change-point perspective

Soham Bonnerjee, Sayar Karmakar, Subhrajyoty Roy · University of Chicago · University of Florida +1 more

Localizes multiple LLM-watermarked text segments in mixed-source documents using an epidemic change-point algorithm with finite-sample guarantees

Output Integrity Attack nlp
PDF
defense arXiv Sep 16, 2025 · Sep 2025

Agentic JWT: A Secure Delegation Protocol for Autonomous AI Agents

Abhishek Goswami · University of Chicago

Proposes A-JWT delegation tokens to prevent LLM agents from exceeding authorized scope, blocking replay, impersonation, and prompt-injection pathways

Excessive Agency Insecure Plugin Design nlp
PDF