Online LLM watermark detection via e-processes
Weijie Su, Ruodu Wang, Zinan Zhao · University of Pennsylvania · University of Waterloo +1 more
Weijie Su, Ruodu Wang, Zinan Zhao · University of Pennsylvania · University of Waterloo +1 more
Proposes anytime-valid e-process framework for sequential LLM watermark detection with theoretical power guarantees
Watermarking for large language models (LLMs) has emerged as an effective tool for distinguishing AI-generated text from human-written content. Statistically, watermark schemes induce dependence between generated tokens and a pseudo-random sequence, reducing watermark detection to a hypothesis testing problem on independence. We develop a unified framework for LLM watermark detection based on e-processes, providing anytime-valid guarantees for online testing. We propose various methods to construct empirically adaptive e-processes that can enhance the detection power. In addition, theoretical results are established to characterize the power properties of the proposed procedures. Some experiments demonstrate that the proposed framework achieves competitive performance compared to existing watermark detection methods.
Weiqing He, Xiang Li, Li Shen et al. · University of Pennsylvania
Achieves maximal LLM output watermark strength while preserving speculative sampling efficiency via pseudorandom draft-token acceptance
Watermarking is a principled approach for tracing the provenance of large language model (LLM) outputs, but its deployment in practice is hindered by inference inefficiency. Speculative sampling accelerates inference, with efficiency improving as the acceptance rate between draft and target models increases. Yet recent work reveals a fundamental trade-off: higher watermark strength reduces acceptance, preventing their simultaneous achievement. We revisit this trade-off and show it is not absolute. We introduce a quantitative measure of watermark strength that governs statistical detectability and is maximized when tokens are deterministic functions of pseudorandom numbers. Using this measure, we fully characterize the trade-off as a constrained optimization problem and derive explicit Pareto curves for two existing watermarking schemes. Finally, we introduce a principled mechanism that injects pseudorandomness into draft-token acceptance, ensuring maximal watermark strength while maintaining speculative sampling efficiency. Experiments further show that this approach improves detectability without sacrificing efficiency. Our findings uncover a principle that unites speculative sampling and watermarking, paving the way for their efficient and practical deployment.
Luze Sun, Alina Oprea, Eric Wong · Northeastern University · University of Pennsylvania
Carrier-constrained GCG attacks evade LLM-based code vulnerability detectors using behavior-preserving code transformations that transfer to black-box APIs
LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.
Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more
Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks
As frontier AIs become more powerful and costly to develop, adversaries have increasing incentives to steal model weights by mounting exfiltration attacks. In this work, we consider exfiltration attacks where an adversary attempts to sneak model weights out of a datacenter over a network. While exfiltration attacks are multi-step cyber attacks, we demonstrate that a single factor, the compressibility of model weights, significantly heightens exfiltration risk for large language models (LLMs). We tailor compression specifically for exfiltration by relaxing decompression constraints and demonstrate that attackers could achieve 16x to 100x compression with minimal trade-offs, reducing the time it would take for an attacker to illicitly transmit model weights from the defender's server from months to days. Finally, we study defenses designed to reduce exfiltration risk in three distinct ways: making models harder to compress, making them harder to 'find,' and tracking provenance for post-attack analysis using forensic watermarks. While all defenses are promising, the forensic watermark defense is both effective and cheap, and therefore is a particularly attractive lever for mitigating weight-exfiltration risk.
Yizhou Zhao, Zhiwei Steven Wu, Adam Block · University of Pennsylvania · Carnegie Mellon University +1 more
Fine-tuning framework that embeds robust watermarks into open-weight LLM weights, closing the quality-detectability gap with inference-time schemes
Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
Yizhou Zhao, Xiang Li, Peter Song et al. · University of Pennsylvania · University of Michigan
DFT-based frequency-domain watermarking for AI-generated tabular data enabling robust provenance tracing against post-processing attacks
The rise of generative AI has enabled the production of high-fidelity synthetic tabular data across fields such as healthcare, finance, and public policy, raising growing concerns about data provenance and misuse. Watermarking offers a promising solution to address these concerns by ensuring the traceability of synthetic data, but existing methods face many limitations: they are computationally expensive due to reliance on large diffusion models, struggle with mixed discrete-continuous data, or lack robustness to post-modifications. To address them, we propose TAB-DRW, an efficient and robust post-editing watermarking scheme for generative tabular data. TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. To further enhance robustness and efficiency, we introduce a novel rank-based pseudorandom bit generation method that enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against common post-processing attacks, while preserving high data fidelity and fully supporting mixed-type features.
Yiping Ma, Yue Guo, Harish Karthikeyan et al. · University of Pennsylvania · UC Berkeley +1 more
Byzantine-robust federated learning aggregation protocol using ZKPs and input validation, completing in just 3 rounds
This paper presents a secure aggregation system Armadillo that has disruptive resistance against adversarial clients, such that any coalition of malicious clients (within the tolerated threshold) can affect the aggregation result only by misreporting their private inputs in a pre-defined legitimate range. Armadillo is designed for federated learning setting, where a single powerful server interacts with many weak clients iteratively to train models on client's private data. While a few prior works consider disruption resistance under such setting, they either incur high per-client cost (Chowdhury et al. CCS '22) or require many rounds (Bell et al. USENIX Security '23). Although disruption resistance can be achieved generically with zero-knowledge proof techniques (which we also use in this paper), we realize an efficient system with two new designs: 1) a simple two-layer secure aggregation protocol that requires only simple arithmetic computation; 2) an agreement protocol that removes the effect of malicious clients from the aggregation with low round complexity. With these techniques, Armadillo completes each secure aggregation in 3 rounds while keeping the server and clients computationally lightweight.
Avi Bagchi, Akhil Bhimaraju, Moulik Choraria et al. · University of Pennsylvania · University of Illinois Urbana–Champaign +1 more
Embeds distortion-free Gumbel-max watermarks in discrete diffusion LM outputs with provably exponential false-positive decay
Watermarking has emerged as a promising technique to track AI-generated content and differentiate it from authentic human creations. While prior work extensively studies watermarking for autoregressive large language models (LLMs) and image diffusion models, it remains comparatively underexplored for discrete diffusion language models (DDLMs), which are becoming popular due to their high inference throughput. In this paper, we introduce one of the first watermarking methods for DDLMs. Our approach applies a distribution-preserving Gumbel-max sampling trick at every diffusion step and seeds the randomness by sequence position to enable reliable detection. We empirically demonstrate reliable detectability on LLaDA, a state-of-the-art DDLM. We also analytically prove that the watermark is distortion-free, with a false detection probability that decays exponentially in the sequence length. A key practical advantage is that our method realizes desired watermarking properties with no expensive hyperparameter tuning, making it straightforward to deploy and scale across models and benchmarks.
T. Tony Cai, Xiang Li, Qi Long et al. · University of Pennsylvania · Yale University
Derives minimax-optimal detection rules for LLM text watermarks under pseudorandom collisions with rigorous Type I error control
Text watermarking plays a crucial role in ensuring the traceability and accountability of large language model (LLM) outputs and mitigating misuse. While promising, most existing methods assume perfect pseudorandomness. In practice, repetition in generated text induces collisions that create structured dependence, compromising Type I error control and invalidating standard analyses. We introduce a statistical framework that captures this structure through a hierarchical two-layer partition. At its core is the concept of minimal units -- the smallest groups treatable as independent across units while permitting dependence within. Using minimal units, we define a non-asymptotic efficiency measure and cast watermark detection as a minimax hypothesis testing problem. Applied to Gumbel-max and inverse-transform watermarks, our framework produces closed-form optimal rules. It explains why discarding repeated statistics often improves performance and shows that within-unit dependence must be addressed unless degenerate. Both theory and experiments confirm improved detection power with rigorous Type I error control. These results provide the first principled foundation for watermark detection under imperfect pseudorandomness, offering both theoretical insight and practical guidance for reliable tracing of model outputs.
Ryuto Koike, Liam Dugan, Masahiro Kaneko et al. · Institute of Science Tokyo · University of Pennsylvania +1 more
Proves MIAs and machine text detectors share the same optimal metric, demonstrating strong cross-task transferability with a unified evaluation suite.
Although membership inference attacks (MIAs) and machine-generated text detection target different goals, their methods often exploit similar signals based on a language model's probability distribution, and the two tasks have been studied independently. This can result in conclusions that overlook stronger methods and valuable insights from the other task. In this work, we theoretically and empirically demonstrate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. We prove that the metric achieving asymptotically optimal performance is identical for both tasks. We unify existing methods under this optimal metric and hypothesize that the accuracy with which a method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments demonstrate very strong rank correlation ($ρ\approx 0.7$) in cross-task performance. Notably, we also find that a machine text detector achieves the strongest performance among evaluated methods on both tasks, demonstrating the practical impact of transferability. To facilitate cross-task development and fair evaluation, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, implementing 15 recent methods from both tasks.
Yining She, Daniel W. Peterson, Marianne Menglin Liu et al. · Carnegie Mellon University · Oracle Cloud Infrastructure +1 more
Benign RAG-retrieved documents flip LLM safety guardrail judgments ~11% of the time, exposing a context-robustness gap attackers could exploit
With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
Buyun Liang, Liangzu Peng, Jinqi Luo et al. · University of Pennsylvania
Elicits LLM hallucinations via semantically equivalent prompt rewrites using zeroth-order black-box optimization
Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often exhibit hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks to elicit hallucinations in LLMs, but these methods often rely on unrealistic prompts, either by inserting nonsensical tokens or by altering the original semantic intent. Consequently, such approaches provide limited insight into how hallucinations arise in real-world settings. In contrast, adversarial attacks in computer vision typically involve realistic modifications to input images. However, the problem of identifying realistic adversarial prompts for eliciting LLM hallucinations remains largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA), which elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no semantic equivalence or semantic coherence errors compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.
Weiqing He, Xiang Li, Tianqi Shang et al. · University of Pennsylvania
Benchmarks eight goodness-of-fit tests for LLM text watermark detection, finding they outperform existing detectors at low temperatures
Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i.i.d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.
Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar et al. · Georgia Institute of Technology · Oracle AI +1 more
RL + tree search framework discovers multi-turn jailbreak strategies achieving 81.5% ASR across 12 LLMs including Claude-4-Sonnet
Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
Xingyu Fu, Siyi Liu, Yinuo Xu et al. · Princeton University · University of Pennsylvania +1 more
Introduces a spatiotemporally grounded benchmark and multimodal reward model for detecting human-perceived traces of AI-generated video fakeness
Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.
Alexander Robey · University of Pennsylvania
PhD thesis proposing new adversarial robustness algorithms for vision models and LLM jailbreak attacks and defenses
Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.
Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi et al. · New York University · Inc. +2 more
PCA-based OOD detection gate blocks adversarial and off-topic queries from reaching RAG-backed LLMs in high-stakes domains
Retrieval-Augmented Generation (RAG) systems are increasingly deployed in high-stakes domains, where safety depends not only on how a system answers, but also on whether a query should be answered given a knowledge base (KB). Out-of-domain (OOD) queries can cause dense retrieval to surface weakly related context and lead the generator to produce fluent but unjustified responses. We study lightweight, KB-aligned OOD detection as an always-on gate for RAG systems. Our approach applies PCA to KB embeddings and scores queries in a compact subspace selected either by explained-variance retention (EVR) or by a separability-driven t-test ranking. We evaluate geometric semantic-search rules and lightweight classifiers across 16 domains, including high-stakes COVID-19 and Substance Use KBs, and stress-test robustness using both LLM-generated attacks and an in-the-wild 4chan attack. We find that low-dimensional detectors achieve competitive OOD performance while being faster, cheaper, and more interpretable than prompted LLM-based judges. Finally, human and LLM-based evaluations show that OOD queries primarily degrade the relevance of RAG outputs, showing the need for efficient external OOD detection to maintain safe, in-scope behavior.