ML Security Papers

Latest papers

12 papers

defense arXiv Apr 6, 2026 · 2d ago

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang et al. · University of Illinois Urbana-Champaign · Virtue AI +6 more

Network-level guardrail detecting supply-chain poisoning in LLM agent MCP tools via MITM proxy monitoring network behaviors

AI Supply Chain Attacks Insecure Plugin Design nlp

PDF

defense arXiv Feb 18, 2026 · 7w ago

Protecting the Undeleted in Machine Unlearning

Aloni Cohen, Refael Kohen, Kobbi Nissim et al. · University of Chicago · Tel Aviv University +1 more

Demonstrates that perfect-retraining unlearning leaks undeleted users' data; proposes new security definition to prevent reconstruction attacks via deletion requests

Model Inversion Attack

PDF

defense arXiv Feb 17, 2026 · 7w ago

Unforgeable Watermarks for Language Models via Robust Signatures

Huijia Lin, Kameron Shahabi, Min Jae Song · University of Washington · University of Chicago

Constructs unforgeable, recoverable LLM text watermarks using robust digital signatures to prevent false attribution attacks

Output Integrity Attack nlp

PDF

defense arXiv Jan 31, 2026 · 9w ago

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Licheng Pan, Yunsheng Lu, Jiexi Liu et al. · Zhejiang University · University of Chicago +1 more

Causal discovery framework identifies interpretable LLM jailbreak drivers to both enhance attacks and improve prompt-level defenses

Prompt Injection nlp

PDF Code

benchmark arXiv Dec 7, 2025 · Dec 2025

Ideal Attribution and Faithful Watermarks for Language Models

Min Jae Song, Kameron Shahabi · University of Chicago · University of Washington

Proposes formal attribution framework as ground truth for LLM text watermarking, unifying guarantee statements across schemes

Output Integrity Attack nlp

PDF

benchmark arXiv Nov 25, 2025 · Nov 2025

Quantifying the Privacy Implications of High-Fidelity Synthetic Network Traffic

Van Tran, Shinan Liu, Tian Li et al. · University of Chicago · University of Hong Kong

Benchmarks membership inference and data extraction attacks against network traffic generative models, finding up to 88% MIA success and 100% identifier recovery

Membership Inference Attack Model Inversion Attack tabulartimeseries

1 citations PDF

defense arXiv Oct 20, 2025 · Oct 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek et al. · ByteDance · University of Chicago +2 more

Inference-time defense reintroducing alignment tokens mid-generation to block jailbreaks and adversarial prefill attacks in LLMs

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Oct 20, 2025 · Oct 2025

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

Chengquan Guo, Yuzhou Nie, Chulin Xie et al. · University of Chicago · UC Santa Barbara +3 more

Blue teaming agent for CodeGen LLMs using automated red teaming to detect malicious instructions and vulnerable code outputs

Prompt Injection nlp

PDF

tool arXiv Oct 3, 2025 · Oct 2025

ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks

Zhaorun Chen, Xun Liu, Mintong Kang et al. · University of Chicago · University of Illinois +2 more

Adaptive agentic red-teaming system jailbreaks VLMs with 11 multimodal attack strategies, exceeding 90% ASR on Claude-4-Sonnet

Input Manipulation Attack Prompt Injection multimodalnlp

1 citations PDF Code

tool arXiv Oct 2, 2025 · Oct 2025

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Chengquan Guo, Chulin Xie, Yu Yang et al. · University of Chicago · University of Illinois Urbana-Champaign +5 more

Automated red-teaming agent that adaptively combines jailbreak tools to uncover safety vulnerabilities in LLM-based code agents

Prompt Injection nlp

4 citations PDF

defense arXiv Sep 25, 2025 · Sep 2025

WISER: Segmenting watermarked region - an epidemic change-point perspective

Soham Bonnerjee, Sayar Karmakar, Subhrajyoty Roy · University of Chicago · University of Florida +1 more

Localizes multiple LLM-watermarked text segments in mixed-source documents using an epidemic change-point algorithm with finite-sample guarantees

Output Integrity Attack nlp

PDF

defense arXiv Sep 16, 2025 · Sep 2025

Agentic JWT: A Secure Delegation Protocol for Autonomous AI Agents

Abhishek Goswami · University of Chicago

Proposes A-JWT delegation tokens to prevent LLM agents from exceeding authorized scope, blocking replay, impersonation, and prompt-injection pathways

Excessive Agency Insecure Plugin Design nlp

PDF

Latest papers

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Protecting the Undeleted in Machine Unlearning

Unforgeable Watermarks for Language Models via Robust Signatures

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Ideal Attribution and Faithful Watermarks for Language Models

Quantifying the Privacy Implications of High-Fidelity Synthetic Network Traffic

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

WISER: Segmenting watermarked region - an epidemic change-point perspective

Agentic JWT: A Secure Delegation Protocol for Autonomous AI Agents

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue