ML Security Papers

Latest papers

18 papers

attack arXiv Apr 11, 2026 · 5w ago

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li et al. · Oregon State University · University of Illinois Urbana-Champaign +2 more

Characterizes when reward poisoning attacks can force RL agents to adopt attacker-chosen policies in linear MDPs

Model Skewing reinforcement-learning

PDF

benchmark arXiv Apr 10, 2026 · 5w ago

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng et al. · Harvard University · Princeton University +2 more

Discovers that LLM harmful content generation relies on a compact, unified set of weights distinct from benign capabilities, explaining jailbreak brittleness and emergent misalignment

Transfer Learning Attack Prompt Injection nlp

PDF

defense arXiv Feb 23, 2026 · 12w ago

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek et al. · Duke University · Princeton University +3 more

Detects concealed LLM jailbreaks by disentangling goal and framing signals in internal activation space

Prompt Injection nlp

PDF

attack arXiv Nov 21, 2025 · Nov 2025

MURMUR: Using cross-user chatter to break collaborative language agents in groups

Atharv Singh Patlan, Peiyao Sheng, S. Ashwin Hebbar et al. · Princeton University · Sentient

Discovers cross-user poisoning: adversarial messages in shared LLM agent history hijack actions of other users at inference time

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Nov 11, 2025 · Nov 2025

3D Guard-Layer: An Integrated Agentic AI Safety System for Edge Artificial Intelligence

Eren Kurshan, Yuan Xie, Paul Franzon · Princeton University · Hong Kong University of Science and Technology +1 more

Proposes 3D-integrated hardware safety layer for edge AI systems that dynamically detects and mitigates inference-time network attacks

Input Manipulation Attack Excessive Agency visionnlp

PDF

attack IEEE IoT-J Nov 10, 2025 · Nov 2025

Adversarial Node Placement in Decentralized Federated Learning: Maximum Spanning-Centrality Strategy and Performance Analysis

Adam Piaseczny, Eric Ruzomberka, Rohit Parasnis et al. · Purdue University · Princeton University +1 more

Proposes MaxSpAN-FL, a hybrid topology-aware strategy for placing Byzantine nodes in decentralized FL to maximize model degradation

Data Poisoning Attack federated-learning

PDF

benchmark arXiv Oct 31, 2025 · Oct 2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei, Zora Che, Nathaniel Li et al. · Scale AI · Princeton University +3 more

Benchmark framework reveals bio-foundation model safety filtering is bypassable via fine-tuning, with dual-use signals persisting in pretrained representations

Transfer Learning Attack generative

PDF

defense arXiv Oct 24, 2025 · Oct 2025

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa et al. · Virginia Tech · Princeton University +1 more

Defends LLMs against novel jailbreaks by training on diverse compositions of adversarial skill primitives extracted from 32 prior attacks

Prompt Injection nlp

1 citations PDF

defense arXiv Oct 23, 2025 · Oct 2025

Adversary-Aware Private Inference over Wireless Channels

Mohamed Seif, Malcolm Egan, Andrea J. Goldsmith et al. · Princeton University · Inria +1 more

Defends against adversarial inversion of ML feature embeddings during wireless transmission using differential privacy and channel-aware encoding

Model Inversion Attack vision

PDF

defense arXiv Oct 15, 2025 · Oct 2025

Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks

Jianzhu Yao, Hongxu Su, Taobo Liao et al. · Princeton University · HKUST (GZ) +1 more

Verifiable inference protocol for cloud ML that detects model swaps and computation tampering with 0.3% overhead using IEEE-754 bounds and Merkle-anchored dispute games

Output Integrity Attack visionnlpgenerative

2 citations PDF

Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present NAO: a Nondeterministic tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. NAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement NAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. NAO reconciles scalability with verifiability for real-world heterogeneous ML compute.

cnn transformer llm diffusion Princeton University · HKUST (GZ) · University of Illinois Urbana-Champaign

PDF arXiv DOI

defense arXiv Oct 2, 2025 · Oct 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif et al. · University of Minnesota · Princeton University +2 more

Combinatorial vocabulary-partitioning watermark for LLM text that detects and localizes post-generation edits and spoofing attacks

Output Integrity Attack nlp

1 citations PDF

benchmark arXiv Sep 30, 2025 · Sep 2025

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Zeshi Dai, Zimo Peng, Zerui Cheng et al. · Surf AI · Princeton University

Benchmarks 17 LLM agents against adversarial financial misinformation, revealing systematic tool-selection failures and indirect prompt injection via SEO-poisoned web search

Prompt Injection Excessive Agency nlp

3 citations PDF

attack arXiv Sep 30, 2025 · Sep 2025

Are Robust LLM Fingerprints Adversarially Robust?

Anshul Nasery, Edoardo Contente, Alkin Kaz et al. · University of Washington · Sentient +1 more

Adaptive attacks bypass ten LLM fingerprinting schemes with near-perfect success by exploiting four systemic vulnerabilities in ownership verification

Model Theft Model Theft nlp

3 citations PDF

defense arXiv Sep 27, 2025 · Sep 2025

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu et al. · Princeton University · NVIDIA

Defends RAG-based search against corpus poisoning using graph-theoretic document reliability filtering with provable robustness guarantees

Input Manipulation Attack Prompt Injection nlp

2 citations PDF

benchmark arXiv Sep 26, 2025 · Sep 2025

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Xingyu Fu, Siyi Liu, Yinuo Xu et al. · Princeton University · University of Pennsylvania +1 more

Introduces a spatiotemporally grounded benchmark and multimodal reward model for detecting human-perceived traces of AI-generated video fakeness

Output Integrity Attack visionmultimodalgenerative

2 citations 2 influentialPDF

defense bioRxiv Sep 13, 2025 · Sep 2025

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

Meiyin Meng, Zaixi Zhang · Imperial College London · Princeton University

Lifecycle LLM biosecurity defense combining data sanitization, DPO alignment, and runtime guardrails to cut jailbreak ASR from 59.7% to 3.0%

Prompt Injection nlp

PDF

benchmark arXiv Sep 3, 2025 · Sep 2025

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin et al. · Peking University · Stanford University +3 more

Red-teams protein foundation models via multimodal prompt engineering and beam search, achieving 70% jailbreak success rate bypassing ESM3 safety filters

Prompt Injection nlpgenerative

PDF Code

benchmark arXiv Aug 15, 2025 · Aug 2025

Assessing User Privacy Leakage in Synthetic Packet Traces: An Attack-Grounded Approach

Minhao Jin, Hongyu He, Maria Apostolaki · Princeton University

Benchmarks privacy of ML-based synthetic traffic generators via novel membership inference attack using contrastive learning, revealing critical user-level leakage

Membership Inference Attack generativetimeseries

PDF

Latest papers

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

MURMUR: Using cross-user chatter to break collaborative language agents in groups

3D Guard-Layer: An Integrated Agentic AI Safety System for Edge Artificial Intelligence

Adversarial Node Placement in Decentralized Federated Learning: Maximum Spanning-Centrality Strategy and Performance Analysis

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Adversary-Aware Private Inference over Wireless Channels

Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Are Robust LLM Fingerprints Adversarially Robust?

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Assessing User Privacy Leakage in Synthetic Packet Traces: An Attack-Grounded Approach

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue