ML Security Papers

Stats

Latest papers

17 papers

defense arXiv Feb 15, 2026 · 7w ago

Online LLM watermark detection via e-processes

Weijie Su, Ruodu Wang, Zinan Zhao · University of Pennsylvania · University of Waterloo +1 more

Proposes anytime-valid e-process framework for sequential LLM watermark detection with theoretical power guarantees

Output Integrity Attack nlp

PDF

defense arXiv Feb 1, 2026 · 9w ago

Improve the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

Weiqing He, Xiang Li, Li Shen et al. · University of Pennsylvania

Achieves maximal LLM output watermark strength while preserving speculative sampling efficiency via pseudorandom draft-token acceptance

Output Integrity Attack nlp

PDF Code

attack arXiv Jan 30, 2026 · 9w ago

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Luze Sun, Alina Oprea, Eric Wong · Northeastern University · University of Pennsylvania

Carrier-constrained GCG attacks evade LLM-based code vulnerability detectors using behavior-preserving code transformations that transfer to black-box APIs

Input Manipulation Attack nlp

PDF Code

attack arXiv Jan 3, 2026 · Jan 2026

Aggressive Compression Enables LLM Weight Theft

Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more

Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks

Model Theft Model Theft nlp

PDF

defense arXiv Dec 3, 2025 · Dec 2025

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao, Zhiwei Steven Wu, Adam Block · University of Pennsylvania · Carnegie Mellon University +1 more

Fine-tuning framework that embeds robust watermarks into open-weight LLM weights, closing the quality-detectability gap with inference-time schemes

Output Integrity Attack nlp

PDF Code

defense arXiv Nov 26, 2025 · Nov 2025

TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

Yizhou Zhao, Xiang Li, Peter Song et al. · University of Pennsylvania · University of Michigan

DFT-based frequency-domain watermarking for AI-generated tabular data enabling robust provenance tracing against post-processing attacks

Output Integrity Attack tabulargenerative

PDF Code

defense CCS Nov 14, 2025 · Nov 2025

Armadillo: Robust Single-Server Secure Aggregation for Federated Learning with Input Validation

Yiping Ma, Yue Guo, Harish Karthikeyan et al. · University of Pennsylvania · UC Berkeley +1 more

Byzantine-robust federated learning aggregation protocol using ZKPs and input validation, completing in just 3 rounds

Data Poisoning Attack Model Inversion Attack federated-learning

1 citations PDF

defense arXiv Nov 3, 2025 · Nov 2025

Watermarking Discrete Diffusion Language Models

Avi Bagchi, Akhil Bhimaraju, Moulik Choraria et al. · University of Pennsylvania · University of Illinois Urbana–Champaign +1 more

Embeds distortion-free Gumbel-max watermarks in discrete diffusion LM outputs with provably exponential false-positive decay

Output Integrity Attack nlpgenerative

PDF

defense arXiv Oct 24, 2025 · Oct 2025

Optimal Detection for Language Watermarks with Pseudorandom Collision

T. Tony Cai, Xiang Li, Qi Long et al. · University of Pennsylvania · Yale University

Derives minimax-optimal detection rules for LLM text watermarks under pseudorandom collisions with rigorous Type I error control

Output Integrity Attack nlp

PDF

benchmark arXiv Oct 22, 2025 · Oct 2025

Machine Text Detectors are Membership Inference Attacks

Ryuto Koike, Liam Dugan, Masahiro Kaneko et al. · Institute of Science Tokyo · University of Pennsylvania +1 more

Proves MIAs and machine text detectors share the same optimal metric, demonstrating strong cross-task transferability with a unified evaluation suite.

Membership Inference Attack Output Integrity Attack nlp

1 citations 1 influentialPDF Code

benchmark arXiv Oct 6, 2025 · Oct 2025

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Yining She, Daniel W. Peterson, Marianne Menglin Liu et al. · Carnegie Mellon University · Oracle Cloud Infrastructure +1 more

Benign RAG-retrieved documents flip LLM safety guardrail judgments ~11% of the time, exposing a context-robustness gap attackers could exploit

Prompt Injection nlp

PDF

attack arXiv Oct 5, 2025 · Oct 2025

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Buyun Liang, Liangzu Peng, Jinqi Luo et al. · University of Pennsylvania

Elicits LLM hallucinations via semantically equivalent prompt rewrites using zeroth-order black-box optimization

Prompt Injection nlp

PDF Code

benchmark arXiv Oct 4, 2025 · Oct 2025

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Weiqing He, Xiang Li, Tianqi Shang et al. · University of Pennsylvania

Benchmarks eight goodness-of-fit tests for LLM text watermark detection, finding they outperform existing detectors at low temperatures

Output Integrity Attack nlp

1 citations PDF Code

attack arXiv Oct 2, 2025 · Oct 2025

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar et al. · Georgia Institute of Technology · Oracle AI +1 more

RL + tree search framework discovers multi-turn jailbreak strategies achieving 81.5% ASR across 12 LLMs including Claude-4-Sonnet

Prompt Injection nlp

PDF

benchmark arXiv Sep 26, 2025 · Sep 2025

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Xingyu Fu, Siyi Liu, Yinuo Xu et al. · Princeton University · University of Pennsylvania +1 more

Introduces a spatiotemporally grounded benchmark and multimodal reward model for detecting human-perceived traces of AI-generated video fakeness

Output Integrity Attack visionmultimodalgenerative

2 citations 2 influentialPDF

defense arXiv Sep 23, 2025 · Sep 2025

Algorithms for Adversarially Robust Deep Learning

Alexander Robey · University of Pennsylvania

PhD thesis proposing new adversarial robustness algorithms for vision models and LLM jailbreak attacks and defenses

Input Manipulation Attack Prompt Injection visionnlp

1 citations PDF

defense arXiv Aug 4, 2025 · Aug 2025

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi et al. · New York University · Inc. +2 more

PCA-based OOD detection gate blocks adversarial and off-topic queries from reaching RAG-backed LLMs in high-stakes domains

Prompt Injection nlp

PDF Code

Latest papers

Online LLM watermark detection via e-processes

Improve the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Aggressive Compression Enables LLM Weight Theft

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

Armadillo: Robust Single-Server Secure Aggregation for Federated Learning with Input Validation

Watermarking Discrete Diffusion Language Models

Optimal Detection for Language Watermarks with Pseudorandom Collision

Machine Text Detectors are Membership Inference Attacks

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Algorithms for Adversarially Robust Deep Learning

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue