Latest papers

9 papers
attack arXiv Feb 18, 2026 · 6w ago

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan, Shixiong Jiang, Mengyu Liu et al. · University of Notre Dame · Washington State University

Black-box adversarial attack on Safe RL policies using inverse constrained RL to induce safety violations without victim gradient access

Input Manipulation Attack reinforcement-learning
PDF
defense arXiv Jan 22, 2026 · 10w ago

CodeGuard: Improving LLM Guardrails in CS Education

Nishat Raihan, Noah Erdachew, Jayoti Devi et al. · George Mason University · University of Oklahoma +1 more

Defends educational LLM coding assistants from unsafe prompts via PromptShield, a fine-tuned guardrail achieving 0.93 F1

Prompt Injection nlp
PDF Code
benchmark arXiv Jan 20, 2026 · 10w ago

An Empirical Study on Remote Code Execution in Machine Learning Model Hosting Ecosystems

Mohammed Latif Siddiq, Tanzim Hossain Romel, Natalie Sekerak et al. · University of Notre Dame · IQVIA Inc

First large-scale empirical study of RCE risks from trust_remote_code in model-sharing platforms like HuggingFace

AI Supply Chain Attacks
PDF
benchmark arXiv Jan 11, 2026 · 12w ago

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Zheyuan Liu, Dongwhi Kim, Yixin Wan et al. · University of Notre Dame · University of California +2 more

Benchmarks multimodal LLM contextual safety against escalating and context-switch jailbreaks across 15 models and 5 guardrails

Prompt Injection multimodalnlpvision
PDF Code
defense arXiv Oct 10, 2025 · Oct 2025

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Yue Huang, Hang Hua, Yujun Zhou et al. · University of Notre Dame · MIT-IBM Watson AI Lab +3 more

Proposes Safiron, a pre-execution guardrail that detects, categorizes, and explains risky LLM agent action plans before they execute

Excessive Agency nlp
5 citations 1 influentialPDF
defense arXiv Sep 27, 2025 · Sep 2025

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Han Yan, Zheyuan Liu, Meng Jiang · University of Notre Dame · The Chinese University of Hong Kong

Defends LLM unlearning against jailbreak and relearning attacks via dual-space smoothness in representation and parameter spaces

Prompt Injection Sensitive Information Disclosure nlp
1 citations PDF Code
defense arXiv Sep 24, 2025 · Sep 2025

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao et al. · Northwestern University · University of Notre Dame +1 more

Hardens LLM unlearning against relearning attacks by steering parameters toward flat loss minima via adversarial neighborhood-aware optimization

Sensitive Information Disclosure Prompt Injection nlp
1 citations PDF
survey arXiv Aug 27, 2025 · Aug 2025

Intellectual Property in Graph-Based Machine Learning as a Service: Attacks and Defenses

Lincan Li, Bolin Shen, Chenxi Zhao et al. · Florida State University · Northeastern University +3 more

Survey of model theft, data reconstruction, and membership inference attacks and defenses for graph ML-as-a-service, with open-source evaluation library PyGIP

Model Theft Model Inversion Attack Membership Inference Attack graph
PDF Code
survey arXiv Aug 20, 2025 · Aug 2025

A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives

Kaixiang Zhao, Lincan Li, Kaize Ding et al. · University of Notre Dame · Florida State University +3 more

Surveys model extraction attacks and defenses across MLaaS platforms, proposing a taxonomy of attack mechanisms and computing environments

Model Theft visionnlptabular
PDF Code