ML Security Papers

Latest papers

9 papers

defense arXiv Mar 31, 2026 · 6d ago

Robust Multimodal Safety via Conditional Decoding

Anurag Kumar, Raghuveer Peri, Jon Burnsky et al. · The Ohio State University · AWS

Conditional decoding defense using internal safety classification that blocks multimodal jailbreaks across text, image, and audio inputs

Input Manipulation Attack Prompt Injection multimodalnlpvisionaudio

PDF

defense arXiv Feb 28, 2026 · 5w ago

Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents

Linxi Jiang, Zhijie Liu, Haotian Luo et al. · The Ohio State University

Discovers and mitigates TOCTOU vulnerabilities in LLM browser agents where adversarial pages change state between planning and execution

Prompt Injection Excessive Agency nlpvisionmultimodal

PDF

defense arXiv Feb 9, 2026 · 8w ago

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Yuting Ning, Jaylen Jones, Zhehao Zhang et al. · The Ohio State University · Amazon AGI

Guardrail system detects and corrects misaligned actions in computer-use agents, reducing indirect prompt injection attack success by 90%+

Prompt Injection Excessive Agency nlpmultimodal

PDF Code

defense arXiv Feb 4, 2026 · 8w ago

Trust The Typical

Debargha Ganguly, Sreehari Sankar, Biyao Zhang et al. · Case Western Reserve University · University of Pittsburgh +2 more

Defends LLMs against jailbreaks via OOD detection on safe prompts, reducing false positives by 40x over specialized safety models

Prompt Injection nlp

1 citations PDF

benchmark arXiv Dec 9, 2025 · Dec 2025

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

Jinghao Wang, Ping Zhang, Carter Yagemann · The Ohio State University

Proposes reproducible, consumer-hardware benchmark for evaluating jailbreaking and privacy extraction attacks on medical LLMs across clinical specialties

Prompt Injection Sensitive Information Disclosure nlp

PDF

attack arXiv Nov 14, 2025 · Nov 2025

A Systematic Study of Model Extraction Attacks on Graph Foundation Models

Haoyan Xu, Ruizhi Qian, Jiate Li et al. · University of Southern California · Florida State University +2 more

Systematically extracts Graph Foundation Models via black-box embedding regression, cloning victim models at 0.07% of original training cost

Model Theft graphmultimodal

PDF

Graph machine learning has advanced rapidly in tasks such as link prediction, anomaly detection, and node classification. As models scale up, pretrained graph models have become valuable intellectual assets because they encode extensive computation and domain expertise. Building on these advances, Graph Foundation Models (GFMs) mark a major step forward by jointly pretraining graph and text encoders on massive and diverse data. This unifies structural and semantic understanding, enables zero-shot inference, and supports applications such as fraud detection and biomedical analysis. However, the high pretraining cost and broad cross-domain knowledge in GFMs also make them attractive targets for model extraction attacks (MEAs). Prior work has focused only on small graph neural networks trained on a single graph, leaving the security implications for large-scale and multimodal GFMs largely unexplored. This paper presents the first systematic study of MEAs against GFMs. We formalize a black-box threat model and define six practical attack scenarios covering domain-level and graph-specific extraction goals, architectural mismatch, limited query budgets, partial node access, and training data discrepancies. To instantiate these attacks, we introduce a lightweight extraction method that trains an attacker encoder using supervised regression of graph embeddings. Even without contrastive pretraining data, this method learns an encoder that stays aligned with the victim text encoder and preserves its zero-shot inference ability on unseen graphs. Experiments on seven datasets show that the attacker can approximate the victim model using only a tiny fraction of its original training cost, with almost no loss in accuracy. These findings reveal that GFMs greatly expand the MEA surface and highlight the need for deployment-aware security defenses in large-scale graph learning systems.

gnn transformer multimodal University of Southern California · Florida State University · The Ohio State University +1 more

PDF arXiv DOI

defense arXiv Oct 30, 2025 · Oct 2025

Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park et al. · The Ohio State University · Microsoft Research +1 more

Trains LLMs via RL on instruction-hierarchy data to resist jailbreaks and prompt injection, cutting attack success rates by 20%

Prompt Injection nlp

1 citations PDF Code

defense arXiv Sep 29, 2025 · Sep 2025

A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory

Qianshan Wei, Tengchao Yang, Yaochen Wang et al. · Nanyang Technological University · Independent Researcher +3 more

Defends LLM agent memory from indirect injection attacks using consensus-based validation and a dual-memory lesson structure

Prompt Injection Excessive Agency nlp

11 citations 2 influentialPDF Code

attack arXiv Aug 16, 2025 · Aug 2025

Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions

Xuyang Guo, Zekai Huang, Zhao Song et al. · Guilin University of Electronic Technology · The Ohio State University +1 more

Demonstrates indirect prompt injection via PDF-hidden instructions fools LLMs even on trivial arithmetic judge tasks

Prompt Injection nlp

PDF

Latest papers

Robust Multimodal Safety via Conditional Decoding

Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Trust The Typical

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

A Systematic Study of Model Extraction Attacks on Graph Foundation Models

Reasoning Up the Instruction Ladder for Controllable Language Models

A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory

Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue