ML Security Papers

Stats

Latest papers

11 papers

benchmark arXiv Mar 21, 2026 · 16d ago

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Hung Yun Tseng, Wuzhen Li, Blerina Gkotse et al. · University of Wisconsin–Madison

Systematic benchmark evaluating LLM jailbreak robustness across 76 crime categories grounded in U.S. legal frameworks

Prompt Injection nlp

PDF Code

defense arXiv Feb 18, 2026 · 6w ago

Policy Compiler for Secure Agentic Systems

Nils Palumbo, Sarthak Choudhary, Jihye Choi et al. · University of Wisconsin–Madison · Langroid

Compiles LLM agent implementations into policy-compliant systems using dependency graphs, Datalog rules, and a reference monitor to block violations

Excessive Agency Prompt Injection nlp

PDF

benchmark arXiv Jan 30, 2026 · 9w ago

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

Enyi Shi, Pengyang Shao, Yanxin Zhang et al. · Nanjing University of Science and Technology · National University of Singapore +3 more

Multilingual multimodal safety benchmark revealing cross-lingual asymmetries in VLLM jailbreak susceptibility across 10 languages and 11 models

Prompt Injection multimodalnlp

PDF Code

attack arXiv Jan 29, 2026 · 9w ago

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

Xiaogeng Liu, Xinyan Wang, Yechao Zhang et al. · Johns Hopkins University · NVIDIA +4 more

RL-trained attacker generates short natural prompts that force LRMs into pathologically long reasoning, achieving 286x amplification and >98% detection bypass

Model Denial of Service nlpreinforcement-learning

PDF

defense arXiv Jan 15, 2026 · 11w ago

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh et al. · Washington University in St. Louis · University of Wisconsin–Madison +2 more

Defends LLM agents against indirect prompt injection using structured reasoning to detect conflicting injected instructions

Prompt Injection nlp

1 citations PDF Code

defense arXiv Jan 15, 2026 · 11w ago

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen et al. · Zhejiang University · University of Wisconsin–Madison +4 more

Preserves LLM jailbreak resistance through fine-tuning by projecting utility gradients away from the low-rank safety subspace

Transfer Learning Attack Prompt Injection nlp

PDF Code

defense arXiv Jan 1, 2026 · Jan 2026

Unknown Aware AI-Generated Content Attribution

Ellie Thieu, Jifan Zhang, Haoyue Bai · University of Wisconsin–Madison

Attributes AI-generated images to specific source models using constrained fine-tuning on unlabeled web data for open-world robustness

Output Integrity Attack visiongenerative

PDF

survey IACR ePrint Dec 1, 2025 · Dec 2025

Systems Security Foundations for Agentic Computing

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda et al. · Google · University of California +5 more

Surveys agentic AI security through a systems-security lens, covering prompt injection, tool-use risks, and 11 real-world attack case studies

Prompt Injection Insecure Plugin Design Excessive Agency nlp

3 citations PDF

defense arXiv Nov 24, 2025 · Nov 2025

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen et al. · William & Mary · Independent Researcher +2 more

Self-adversarial training framework for unified multimodal models that perturbs shared visual tokens to improve adversarial and OOD robustness

Input Manipulation Attack multimodalvisionnlp

PDF Code

attack arXiv Nov 18, 2025 · Nov 2025

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

Le Yu, Zhengyue Zhao, Yawen Zheng et al. · Sichuan University · University of Wisconsin–Madison +2 more

Breaks RVLM safety alignment via QLoRA fine-tuning on self-generated harmful CoT traces with 499 samples in under 3 hours

Transfer Learning Attack Prompt Injection multimodalnlp

PDF

benchmark arXiv Oct 8, 2025 · Oct 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu et al. · University of Georgia · University of Wisconsin–Madison +6 more

Benchmarks LLM-powered agents' ability to execute end-to-end enterprise intrusions aligned with MITRE ATT&CK TTPs

Excessive Agency Prompt Injection nlpmultimodal

4 citations PDF Code

Latest papers

LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Policy Compiler for Secure Agentic Systems

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Understanding and Preserving Safety in Fine-Tuned LLMs

Unknown Aware AI-Generated Content Attribution

Systems Security Foundations for Agentic Computing

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue