ML Security Papers

Latest papers

7 papers

benchmark arXiv Apr 20, 2026 · 4w ago

Owner-Harm: A Missing Threat Model for AI Agent Safety

Dongcheng Zhang, Yiqing Jiang · BlueFocus Communication Group · Tongji University

Defines owner-harm threat model for AI agents and shows existing defenses fail on prompt-injection attacks targeting deployers

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Mar 18, 2026 · 9w ago

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Zhihua Wei, Qiang Li, Jian Ruan et al. · Tongji University · Shanghai Artificial Intelligence Laboratory

Proposes JRS-Rem defense that prevents VLM jailbreaks by removing image-induced representation shifts toward jailbreak states at inference time

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF Code

survey arXiv Dec 6, 2025 · Dec 2025

Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

Xining Song, Zhihua Wei, Rui Wang et al. · Tongji University · iFLYTEK +2 more

Surveys adversarial, noise, and perturbation attacks on voice conversion models plus defenses, evaluating robustness across four speech quality dimensions

Input Manipulation Attack audio

1 citations PDF

attack arXiv Nov 10, 2025 · Nov 2025

Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

Yuanheng Li, Zhuoyang Chen, Xiaoyun Liu et al. · Sun Yat-Sen University · Tongji University

Syntax-aware membership inference attack on LLMs that prunes grammatically-forced code tokens to improve training data attribution accuracy

Membership Inference Attack nlp

PDF

attack arXiv Oct 26, 2025 · Oct 2025

Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers

Dongyi Liu, Jiangtong Li, Dawei Cheng et al. · The Hong Kong University of Science and Technology · Tongji University

Proposes CP-GBA, a transferable GNN backdoor attack using graph-prompt-trained subgraph triggers that generalize across supervised, contrastive, and prompt learning paradigms

Model Poisoning graph

PDF Code

attack arXiv Oct 15, 2025 · Oct 2025

SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning

Weiqi Guo, Guanjun Liu, Ziyuan Zhou · Tongji University

Joint gradient-based attack on multi-agent RL that synergistically perturbs states and actions, bypassing existing defenses

Input Manipulation Attack reinforcement-learning

PDF

defense arXiv Sep 29, 2025 · Sep 2025

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Wenjie Fu, Huandong Wang, Junyao Gao et al. · Huazhong University of Science and Technology · Tsinghua University +2 more

Token-level self-monitoring and in-place repair framework that prevents LLMs from leaking private information via adversarial prompts

Sensitive Information Disclosure Prompt Injection nlp

PDF Code

Latest papers

Owner-Harm: A Missing Threat Model for AI Agent Safety

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers

SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue