HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection
Qing Wen, Haohao Li, Zhongjie Ba et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Qing Wen, Haohao Li, Zhongjie Ba et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Hypergraph-based audio deepfake detector modeling high-order feature interactions for superior cross-domain generalization
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
Guang Yang, Xing Hu, Xiang Chen et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +1 more
Inference-time backdoor defense for LLMs suppresses trojan triggers in Verilog code generation via semantic consensus decoding
Large language models (LLMs) for Verilog code generation are increasingly adopted in hardware design, yet remain vulnerable to backdoor attacks where adversaries inject malicious triggers during training to induce vulnerable hardware designs. Unlike patchable software vulnerabilities, hardware trojans become irreversible once fabricated, making remediation extremely costly or impossible. Existing active defenses require access to training data, impractical for third-party LLM users, while passive defenses struggle against semantically stealthy triggers that naturally blend into design specifications. In this paper, we hypothesize that under the requirements of both effectiveness and stealthiness, attackers are strongly biased toward embedding triggers in non-functional requirements (e.g., style modifiers, quality descriptors) rather than functional specifications that determine hardware behavior. Exploiting this insight, we propose Semantic Consensus Decoding (SCD), an inference-time passive defense with two key components: (1) functional requirement extraction that identifies essential requirements from user specifications, and (2) consensus decoding that adaptively fuses output distributions based on full user specifications and extracted functional requirements. When these distributions diverge significantly, SCD automatically suppresses suspicious components. Extensive experiments with three representative backdoor attacks demonstrate that SCD reduces average attack success rate from 89% to under 3% with negligible impact on generation quality.
Xinzhe Huang, Wenjing Hu, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +3 more
Gradient-based untargeted jailbreak attack maximizes LLM unsafety probability without fixed response targets, achieving 80% ASR in 100 iterations
Existing gradient-based jailbreak attacks on Large Language Models (LLMs) typically optimize adversarial suffixes to align the LLM output with predefined target responses. However, restricting the objective as inducing fixed targets inherently constrains the adversarial search space, limiting the overall attack efficacy. Furthermore, existing methods typically require numerous optimization iterations to fulfill the large gap between the fixed target and the original LLM output, resulting in low attack efficiency. To overcome these limitations, we propose the first gradient-based untargeted jailbreak attack (UJA), which relies on an untargeted objective to maximize the unsafety probability of the LLM output, without enforcing any response patterns. For tractable optimization, we further decompose this objective into two differentiable sub-objectives to search the optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to existing attacks, UJA's unrestricted objective significantly expands the search space, enabling more flexible and efficient exploration of LLM vulnerabilities. Extensive evaluations show that UJA achieves over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks by over 30\%.
Yu He, Yifei Chen, Yiming Li et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +1 more
Proposes SECRET, an adaptive jailbreak-plus-retrieval-trigger attack that extracts RAG knowledge base contents verbatim from leading commercial LLMs
In recent years, RAG has emerged as a key paradigm for enhancing large language models (LLMs). By integrating externally retrieved information, RAG alleviates issues like outdated knowledge and, crucially, insufficient domain expertise. While effective, RAG introduces new risks of external data extraction attacks (EDEAs), where sensitive or copyrighted data in its knowledge base may be extracted verbatim. These risks are particularly acute when RAG is used to customize specialized LLM applications with private knowledge bases. Despite initial studies exploring these risks, they often lack a formalized framework, robust attack performance, and comprehensive evaluation, leaving critical questions about real-world EDEA feasibility unanswered. In this paper, we present the first comprehensive study to formalize EDEAs against retrieval-augmented LLMs. We first formally define EDEAs and propose a unified framework decomposing their design into three components: extraction instruction, jailbreak operator, and retrieval trigger, under which prior attacks can be considered instances within our framework. Guided by this framework, we develop SECRET: a Scalable and EffeCtive exteRnal data Extraction aTtack. Specifically, SECRET incorporates (1) an adaptive optimization process using LLMs as optimizers to generate specialized jailbreak prompts for EDEAs, and (2) cluster-focused triggering, an adaptive strategy that alternates between global exploration and local exploitation to efficiently generate effective retrieval triggers. Extensive evaluations across 4 models reveal that SECRET significantly outperforms previous attacks, and is highly effective against all 16 tested RAG instances. Notably, SECRET successfully extracts 35% of the data from RAG powered by Claude 3.7 Sonnet for the first time, whereas other attacks yield 0% extraction. Our findings call for attention to this emerging threat.
Kedong Xiu, Churui Zeng, Tianhang Zheng et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +3 more
Gradient-based jailbreak attack using adaptive harmful-response sampling as optimization targets, achieving 87% ASR on safety-aligned LLMs in 200 iterations
Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response, e.g., ``Sure, here is...''. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM's output distribution. Due to the substantial discrepancy between the fixed target and the output distribution, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response. To address this limitation, we propose Dynamic Target Attack (DTA), which leverages the target LLM's own responses as adaptive targets. In each optimization round, DTA samples multiple candidates from the output distribution conditioned on the current prompt, and selects the most harmful one as a temporary target for prompt optimization. Extensive experiments demonstrate that, under the white-box setting, DTA achieves over 87% average attack success rate (ASR) within 200 optimization iterations on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15% and reducing wall-clock time by 2-26x. Under the black-box setting, DTA employs a white-box LLM as a surrogate model for gradient-based optimization, achieving an average ASR of 77.5% against black-box models, exceeding prior transfer-based attacks by over 12%.
Chao Shuai, Gaojian Wang, Kun Pan et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
Proposes morphological multi-scale fusion for deepfake detection that jointly localizes manipulated regions with noise suppression
While the pursuit of higher accuracy in deepfake detection remains a central goal, there is an increasing demand for precise localization of manipulated regions. Despite the remarkable progress made in classification-based detection, accurately localizing forged areas remains a significant challenge. A common strategy is to incorporate forged region annotations during model training alongside manipulated images. However, such approaches often neglect the complementary nature of local detail and global semantic context, resulting in suboptimal localization performance. Moreover, an often-overlooked aspect is the fusion strategy between local and global predictions. Naively combining the outputs from both branches can amplify noise and errors, thereby undermining the effectiveness of the localization. To address these issues, we propose a novel approach that independently predicts manipulated regions using both local and global perspectives. We employ morphological operations to fuse the outputs, effectively suppressing noise while enhancing spatial coherence. Extensive experiments reveal the effectiveness of each module in improving the accuracy and robustness of forgery localization.
Shuo Shao, Yiming Li, Yu He et al. · Zhejiang University · Nanyang Technological University +3 more
Surveys LLM fingerprinting for copyright auditing and benchmarks 13 post-development robustness techniques across 149 model instances
The broad capabilities and substantial resources required to train Large Language Models (LLMs) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. LLM fingerprinting, a non-intrusive technique that compares the distinctive features (i.e., fingerprint) of LLMs to identify whether an LLM is derived from another, offers a promising solution to copyright auditing. However, its reliability remains uncertain due to the prevalence of diverse model modifications and the lack of standardized evaluation. In this SoK, we present the first comprehensive study of the emerging LLM fingerprinting. We introduce a unified framework and taxonomy that structures the field: white-box methods are classified based on their feature source as static, forward-pass, or backward-pass fingerprinting, while black-box methods are distinguished by their query strategy as either untargeted or targeted. Furthermore, we propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios. Built upon 7 mainstream foundation models and comprising 149 distinct model instances, LeaFBench integrates 13 representative post-development techniques, spanning both parameter-altering methods (e.g., fine-tuning, quantization) and parameter-independent techniques (e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the strengths and weaknesses of existing methods, thereby outlining future research directions and critical open problems in this emerging field. The code is available at https://github.com/shaoshuo-ss/LeaFBench.
Weiwei Qi, Shuo Shao, Wei Gu et al. · Zhejiang University · Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security +1 more
Markov-chain jailbreak framework combines diverse disguise strategies adaptively, achieving 90%+ ASR on GPT-4o in under 15 queries
Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks, which can elicit harmful content from the models by manipulating the input prompts. Existing black-box jailbreaking techniques primarily rely on static prompts crafted with a single, non-adaptive strategy, or employ rigid combinations of several underperforming attack methods, which limits their adaptability and generalization. To address these limitations, we propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies. MAJIC first establishes a ``Disguise Strategy Pool'' by refining existing strategies and introducing several innovative approaches. To further improve the attack performance and efficiency, MAJIC formulate the sequential selection and fusion of strategies in the pool as a Markov chain. Under this formulation, MAJIC initializes and employs a Markov matrix to guide the strategy composition, where transition probabilities between strategies are dynamically adapted based on attack outcomes, thereby enabling MAJIC to learn and discover effective attack pathways tailored to the target model. Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash, achieving over 90\% attack success rate with fewer than 15 queries per attempt on average.
Zhifan Luo, Shuo Shao, Su Zhang et al. · Zhejiang University · Huawei +1 more
Adversaries reconstruct private user prompts from LLM KV-cache via inversion, collision, and injection attacks; KV-Cloak defends with reversible matrix obfuscation
The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment.