benchmark arXiv Oct 24, 2025 · Oct 2025
Sarah Ball, Niki Hasrati, Alexander Robey et al. · Ludwig-Maximilians-Universität München · Carnegie Mellon University +1 more
Analyzes why gradient-optimized adversarial suffixes transfer across LLMs using refusal-direction geometry in activation space
Input Manipulation Attack Prompt Injection nlp
Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable -- succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model's internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.
llm transformer Ludwig-Maximilians-Universität München · Carnegie Mellon University · University of Maryland
defense arXiv Feb 3, 2026 · 8w ago
Yixuan Even Xu, John Kirchenbauer, Yash Savani et al. · Carnegie Mellon University · University of Maryland
Fingerprints LLM outputs to detect unauthorized distillation using gradient-aligned token perturbations that transfer to student models
Model Theft Model Theft nlp
Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.
llm transformer Carnegie Mellon University · University of Maryland