Latest papers

17 papers
defense arXiv Feb 15, 2026 · 7w ago

Online LLM watermark detection via e-processes

Weijie Su, Ruodu Wang, Zinan Zhao · University of Pennsylvania · University of Waterloo +1 more

Proposes anytime-valid e-process framework for sequential LLM watermark detection with theoretical power guarantees

Output Integrity Attack nlp
PDF
defense arXiv Feb 1, 2026 · 9w ago

Improve the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

Weiqing He, Xiang Li, Li Shen et al. · University of Pennsylvania

Achieves maximal LLM output watermark strength while preserving speculative sampling efficiency via pseudorandom draft-token acceptance

Output Integrity Attack nlp
PDF Code
attack arXiv Jan 30, 2026 · 9w ago

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Luze Sun, Alina Oprea, Eric Wong · Northeastern University · University of Pennsylvania

Carrier-constrained GCG attacks evade LLM-based code vulnerability detectors using behavior-preserving code transformations that transfer to black-box APIs

Input Manipulation Attack nlp
PDF Code
attack arXiv Jan 3, 2026 · Jan 2026

Aggressive Compression Enables LLM Weight Theft

Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more

Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks

Model Theft Model Theft nlp
PDF
defense arXiv Dec 3, 2025 · Dec 2025

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao, Zhiwei Steven Wu, Adam Block · University of Pennsylvania · Carnegie Mellon University +1 more

Fine-tuning framework that embeds robust watermarks into open-weight LLM weights, closing the quality-detectability gap with inference-time schemes

Output Integrity Attack nlp
PDF Code
defense arXiv Nov 26, 2025 · Nov 2025

TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

Yizhou Zhao, Xiang Li, Peter Song et al. · University of Pennsylvania · University of Michigan

DFT-based frequency-domain watermarking for AI-generated tabular data enabling robust provenance tracing against post-processing attacks

Output Integrity Attack tabulargenerative
PDF Code
defense CCS Nov 14, 2025 · Nov 2025

Armadillo: Robust Single-Server Secure Aggregation for Federated Learning with Input Validation

Yiping Ma, Yue Guo, Harish Karthikeyan et al. · University of Pennsylvania · UC Berkeley +1 more

Byzantine-robust federated learning aggregation protocol using ZKPs and input validation, completing in just 3 rounds

Data Poisoning Attack Model Inversion Attack federated-learning
1 citations PDF
defense arXiv Nov 3, 2025 · Nov 2025

Watermarking Discrete Diffusion Language Models

Avi Bagchi, Akhil Bhimaraju, Moulik Choraria et al. · University of Pennsylvania · University of Illinois Urbana–Champaign +1 more

Embeds distortion-free Gumbel-max watermarks in discrete diffusion LM outputs with provably exponential false-positive decay

Output Integrity Attack nlpgenerative
PDF
defense arXiv Oct 24, 2025 · Oct 2025

Optimal Detection for Language Watermarks with Pseudorandom Collision

T. Tony Cai, Xiang Li, Qi Long et al. · University of Pennsylvania · Yale University

Derives minimax-optimal detection rules for LLM text watermarks under pseudorandom collisions with rigorous Type I error control

Output Integrity Attack nlp
PDF
benchmark arXiv Oct 22, 2025 · Oct 2025

Machine Text Detectors are Membership Inference Attacks

Ryuto Koike, Liam Dugan, Masahiro Kaneko et al. · Institute of Science Tokyo · University of Pennsylvania +1 more

Proves MIAs and machine text detectors share the same optimal metric, demonstrating strong cross-task transferability with a unified evaluation suite.

Membership Inference Attack Output Integrity Attack nlp
1 citations 1 influentialPDF Code
benchmark arXiv Oct 6, 2025 · Oct 2025

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Yining She, Daniel W. Peterson, Marianne Menglin Liu et al. · Carnegie Mellon University · Oracle Cloud Infrastructure +1 more

Benign RAG-retrieved documents flip LLM safety guardrail judgments ~11% of the time, exposing a context-robustness gap attackers could exploit

Prompt Injection nlp
PDF
attack arXiv Oct 5, 2025 · Oct 2025

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Buyun Liang, Liangzu Peng, Jinqi Luo et al. · University of Pennsylvania

Elicits LLM hallucinations via semantically equivalent prompt rewrites using zeroth-order black-box optimization

Prompt Injection nlp
PDF Code
benchmark arXiv Oct 4, 2025 · Oct 2025

On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Weiqing He, Xiang Li, Tianqi Shang et al. · University of Pennsylvania

Benchmarks eight goodness-of-fit tests for LLM text watermark detection, finding they outperform existing detectors at low temperatures

Output Integrity Attack nlp
1 citations PDF Code
attack arXiv Oct 2, 2025 · Oct 2025

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar et al. · Georgia Institute of Technology · Oracle AI +1 more

RL + tree search framework discovers multi-turn jailbreak strategies achieving 81.5% ASR across 12 LLMs including Claude-4-Sonnet

Prompt Injection nlp
PDF
benchmark arXiv Sep 26, 2025 · Sep 2025

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Xingyu Fu, Siyi Liu, Yinuo Xu et al. · Princeton University · University of Pennsylvania +1 more

Introduces a spatiotemporally grounded benchmark and multimodal reward model for detecting human-perceived traces of AI-generated video fakeness

Output Integrity Attack visionmultimodalgenerative
2 citations 2 influentialPDF
defense arXiv Sep 23, 2025 · Sep 2025

Algorithms for Adversarially Robust Deep Learning

Alexander Robey · University of Pennsylvania

PhD thesis proposing new adversarial robustness algorithms for vision models and LLM jailbreak attacks and defenses

Input Manipulation Attack Prompt Injection visionnlp
1 citations PDF
defense arXiv Aug 4, 2025 · Aug 2025

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi et al. · New York University · Inc. +2 more

PCA-based OOD detection gate blocks adversarial and off-topic queries from reaching RAG-backed LLMs in high-stakes domains

Prompt Injection nlp
PDF Code