ML Security Papers

Stats

Latest papers

34 papers

benchmark arXiv Apr 1, 2026 · 7d ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp

PDF Code

defense arXiv Mar 24, 2026 · 15d ago

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Rohan Sequeira, Stavros Damianakis, Umar Iqbal et al. · University of Southern California · Washington University in St. Louis

Behavioral bounds framework that blocks malicious tool calls in LLM agents by learning execution patterns and detecting deviations

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Mar 19, 2026 · 20d ago

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao · University of Southern California

Defense training against prompt injection destroys LLM agent tool-use competence, causing 99% timeout rates and 73-86% attack bypass

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Mar 6, 2026 · 4w ago

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin et al. · Sahara AI · University of Southern California

Proposes TEE-based cryptographic proof that AI agent responses passed a specific safety guardrail, preventing false safety claims

Output Integrity Attack Excessive Agency nlp

PDF Code

benchmark arXiv Feb 24, 2026 · 6w ago

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu, Yuchen Zhou, Muduo Xu et al. · Duke University · New York University +3 more

Benchmark revealing that all existing detectors fail to detect diffusion-model-inpainted forgeries in financial documents

Output Integrity Attack vision

1 citations PDF

defense arXiv Feb 16, 2026 · 7w ago

Differentially Private Retrieval-Augmented Generation

Tingting Tang, James Flemings, Yongqin Wang et al. · University of Southern California

Differentially private RAG algorithm that blocks adversarial extraction of sensitive documents from LLM knowledge bases via keyword-based DP output sanitization

Sensitive Information Disclosure nlp

PDF

benchmark arXiv Feb 10, 2026 · 8w ago

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Zhisheng Qi, Utkarsh Sahu, Li Ma et al. · University of Oregon · Michigan State University +6 more

First systematic benchmark comparing knowledge-extraction attacks and defenses on RAG systems under unified evaluation protocols

Sensitive Information Disclosure nlp

PDF Code

attack arXiv Feb 6, 2026 · 8w ago

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao et al. · Harvard University · Daegu Gyeongbuk Institute of Science and Technology +1 more

RL-based black-box jailbreak framework that reweights historical vulnerability signals to attack LLMs more efficiently

Prompt Injection nlp

PDF

attack arXiv Jan 30, 2026 · 9w ago

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Jiate Li, Defu Cao, Li Li et al. · University of Southern California · Adobe Research +1 more

Black-box query-agnostic adversarial token injection attack manipulates document rankings in RAG and LLM-based retrieval systems using surrogate LLMs

Input Manipulation Attack Prompt Injection nlp

1 citations PDF

attack arXiv Jan 30, 2026 · 9w ago

Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

Suprim Nakarmi, Junggab Son, Yue Zhao et al. · University of Nevada · University of Southern California

Gradient-based attack infers private label class proportions of federated GNN clients from shared gradients without accessing raw data

Model Inversion Attack graphfederated-learning

PDF Code

Graph Neural Networks (GNNs) have been intensively studied for their expressive representation and learning performance on graph-structured data, enabling effective modeling of complex relational dependencies among nodes and edges in various domains. However, the standalone GNNs can unleash threat surfaces and privacy implications, as some sensitive graph-structured data is collected and processed in a centralized setting. To solve this issue, Federated Graph Neural Networks (FedGNNs) are proposed to facilitate collaborative learning over decentralized local graph data, aiming to preserve user privacy. Yet, emerging research indicates that even in these settings, shared model updates, particularly gradients, can unintentionally leak sensitive information of local users. Numerous privacy inference attacks have been explored in traditional federated learning and extended to graph settings, but the problem of label distribution inference in FedGNNs remains largely underexplored. In this work, we introduce Fed-Listing (Federated Label Distribution Inference in GNNs), a novel gradient-based attack designed to infer the private label statistics of target clients in FedGNNs without access to raw data or node features. Fed-Listing only leverages the final-layer gradients exchanged during training to uncover statistical patterns that reveal class proportions in a stealthy manner. An auxiliary shadow dataset is used to generate diverse label partitioning strategies, simulating various client distributions, on which the attack model is obtained. Extensive experiments on four benchmark datasets and three GNN architectures show that Fed-Listing significantly outperforms existing baselines, including random guessing and Decaf, even under challenging non-i.i.d. scenarios. Moreover, applying defense mechanisms can barely reduce our attack performance, unless the model's utility is severely degraded.

gnn federated University of Nevada · University of Southern California

PDF arXiv DOI Code

attack arXiv Jan 29, 2026 · 9w ago

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Bakman, Duygu Nur Yaldiz, Salman Avestimehr et al. · University of Southern California

Proves static black-box alignment guarantees nothing post-update; constructs LLMs hiding latent jailbreak misalignment triggered by one benign gradient step

Model Poisoning Prompt Injection nlp

1 citations PDF

attack arXiv Jan 20, 2026 · 11w ago

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

Bingxin Xu, Yuzhang Shang, Binghui Wang et al. · University of Southern California · University of Central Florida +1 more

Backdoor attack on VLA robotic models exploiting action chunking to inject stealthy malicious trajectories with 93% ASR

Model Poisoning Data Poisoning Attack visionmultimodalreinforcement-learning

1 citations PDF

attack arXiv Jan 18, 2026 · 11w ago

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du, Chenxiao Yu, Haoyan Xu et al. · Georgetown University · University of Southern California +2 more

Jointly optimizes adversarial image perturbations and gradient-based text suffixes to manipulate VLM-based product search rankings

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

benchmark arXiv Jan 12, 2026 · 12w ago

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni et al. · University of Southern California · University of California +3 more

Exposes three shortcut biases in LLM prompt-injection defenses: position, token-trigger, and topic generalization—causing up to 90% false rejection rates

Prompt Injection nlp

PDF Code

benchmark arXiv Dec 18, 2025 · Dec 2025

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

Jacob Choi, Shuying Cao, Xingjian Dong et al. · University of Southern California

Canary-insertion auditing framework that measures worst-case information leakage from private in-context learning methods against DP guarantees

Membership Inference Attack Sensitive Information Disclosure nlp

3 citations 1 influentialPDF

defense arXiv Dec 7, 2025 · Dec 2025

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Jehyeok Yeon, Federico Cinus, Yifan Wu et al. · University of Illinois Urbana-Champaign · University of Southern California +1 more

Proposes graph-regularized sparse autoencoders to capture distributed LLM safety representations for adaptive jailbreak defense with 82% refusal rate

Prompt Injection nlp

1 citations PDF

benchmark arXiv Dec 4, 2025 · Dec 2025

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Jinbo Liu, Defu Cao, Yifei Wei et al. · University of Southern California · Florida State University +1 more

Benchmarks PII leakage in multi-agent LLM systems across six topologies, showing dense connectivity and proximity amplify adversarial memory extraction

Sensitive Information Disclosure nlp

1 citations 1 influentialPDF

defense arXiv Nov 18, 2025 · Nov 2025

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna et al. · University of Southern California · Amazon AGI

Reveals that narrow refusal unlearning on LLMs triggers emergent misalignment in unrelated safety domains, and proposes a retain-data defense to contain it

Transfer Learning Attack Prompt Injection nlp

3 citations PDF

attack arXiv Nov 16, 2025 · Nov 2025

Whose Narrative is it Anyway? A KV Cache Manipulation Attack

Mukkesh Ganesh, Kaushik Iyer, Arun Baalaaji Sankar Ananthan · University of Southern California

Hijacks LLM conversation narratives mid-generation by overwriting KV cache segments with precomputed cache from an unrelated topic

Output Integrity Attack Prompt Injection nlp

PDF

attack arXiv Nov 14, 2025 · Nov 2025

A Systematic Study of Model Extraction Attacks on Graph Foundation Models

Haoyan Xu, Ruizhi Qian, Jiate Li et al. · University of Southern California · Florida State University +2 more

Systematically extracts Graph Foundation Models via black-box embedding regression, cloning victim models at 0.07% of original training cost

Model Theft graphmultimodal

PDF

Graph machine learning has advanced rapidly in tasks such as link prediction, anomaly detection, and node classification. As models scale up, pretrained graph models have become valuable intellectual assets because they encode extensive computation and domain expertise. Building on these advances, Graph Foundation Models (GFMs) mark a major step forward by jointly pretraining graph and text encoders on massive and diverse data. This unifies structural and semantic understanding, enables zero-shot inference, and supports applications such as fraud detection and biomedical analysis. However, the high pretraining cost and broad cross-domain knowledge in GFMs also make them attractive targets for model extraction attacks (MEAs). Prior work has focused only on small graph neural networks trained on a single graph, leaving the security implications for large-scale and multimodal GFMs largely unexplored. This paper presents the first systematic study of MEAs against GFMs. We formalize a black-box threat model and define six practical attack scenarios covering domain-level and graph-specific extraction goals, architectural mismatch, limited query budgets, partial node access, and training data discrepancies. To instantiate these attacks, we introduce a lightweight extraction method that trains an attacker encoder using supervised regression of graph embeddings. Even without contrastive pretraining data, this method learns an encoder that stays aligned with the victim text encoder and preserves its zero-shot inference ability on unseen graphs. Experiments on seven datasets show that the attacker can approximate the victim model using only a tiny fraction of its original training cost, with almost no loss in accuracy. These findings reveal that GFMs greatly expand the MEA surface and highlight the need for deployment-aware security defenses in large-scale graph learning systems.

gnn transformer multimodal University of Southern California · Florida State University · The Ohio State University +1 more

PDF arXiv DOI

Loading more papers…

Latest papers

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Agent-Sentry: Bounding LLM Agents via Execution Provenance

The Autonomy Tax: Defense Training Breaks LLM Agents

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Differentially Private Retrieval-Augmented Generation

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Defenses Against Prompt Attacks Learn Surface Heuristics

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

Whose Narrative is it Anyway? A KV Cache Manipulation Attack

A Systematic Study of Model Extraction Attacks on Graph Foundation Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue