attack arXiv Mar 1, 2026 · 11w ago
Huajie Chen, Tianqing Zhu, Yuchen Zhong et al. · City University of Macau · CISPA Helmholtz Center for Information Security +2 more
Reveals that dataset distillation leaks training data via three-stage attack: architecture inference, membership inference, and model inversion
Model Inversion Attack Membership Inference Attack vision
Dataset distillation compresses a large real dataset into a small synthetic one, enabling models trained on the synthetic data to achieve performance comparable to those trained on the real data. Although synthetic datasets are assumed to be privacy-preserving, we show that existing distillation methods can cause severe privacy leakage because synthetic datasets implicitly encode the weight trajectories of the distilled model, they become over-informative and exploitable by adversaries. To expose this risk, we introduce the Information Revelation Attack (IRA) against state-of-the-art distillation techniques. Experiments show that IRA accurately predicts both the distillation algorithm and model architecture, and can successfully infer membership and recover sensitive samples from the real dataset.
cnn diffusion City University of Macau · CISPA Helmholtz Center for Information Security · University of Technology Sydney +1 more
defense arXiv Apr 17, 2026 · 4w ago
Wai Man Si, Mingjie Li, Michael Backes et al. · CISPA Helmholtz Center for Information Security
Prunes model parameters responsible for unsafe LLM outputs, reducing harmful generations and jailbreak success with minimal utility loss
Prompt Injection nlpmultimodal
Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.
llm vlm transformer CISPA Helmholtz Center for Information Security
attack arXiv Mar 1, 2026 · 11w ago
Huajie Chen, Tianqing Zhu, Hailin Yang et al. · City University of Macau · CISPA Helmholtz Center for Information Security +1 more
Pixel-wise reconstruction attack removes AI-image watermarks without querying detectors or knowing the watermarking scheme
Output Integrity Attack visiongenerative
Watermarking has emerged as a key defense against the misuse of machine-generated images (MGIs). Yet the robustness of these protections remains underexplored. To reveal the limits of SOTA proactive image watermarking defenses, we propose HIDE&SEEK (HS), a suite of versatile and cost-effective attacks that reliably remove embedded watermarks while preserving high visual fidelity.
diffusion gan City University of Macau · CISPA Helmholtz Center for Information Security · Guangzhou University
benchmark arXiv Mar 12, 2026 · 10w ago
Junjie Chu, Yiting Qu, Ye Leng et al. · CISPA Helmholtz Center for Information Security · Delft University of Technology
Benchmarks LLM safety alignment failures when harmful content is embedded in benign tasks like translation, revealing a content-level ethical blind spot
Prompt Injection nlp
Large Language Models (LLMs) are increasingly trained to align with human values, primarily focusing on task level, i.e., refusing to execute directly harmful tasks. However, a subtle yet crucial content-level ethical question is often overlooked: when performing a seemingly benign task, will LLMs -- like morally conscious human beings -- refuse to proceed when encountering harmful content in user-provided material? In this study, we aim to understand this content-level ethical question and systematically evaluate its implications for mainstream LLMs. We first construct a harmful knowledge dataset (i.e., non-compliant with OpenAI's usage policy) to serve as the user-supplied harmful content, with 1,357 entries across ten harmful categories. We then design nine harmless tasks (i.e., compliant with OpenAI's usage policy) to simulate the real-world benign tasks, grouped into three categories according to the extent of user-supplied content required: extensive, moderate, and limited. Leveraging the harmful knowledge dataset and the set of harmless tasks, we evaluate how nine LLMs behave when exposed to user-supplied harmful content during the execution of benign tasks, and further examine how the dynamics between harmful knowledge categories and tasks affect different LLMs. Our results show that current LLMs, even the latest GPT-5.2 and Gemini-3-Pro, often fail to uphold human-aligned ethics by continuing to process harmful content in harmless tasks. Furthermore, external knowledge from the ``Violence/Graphic'' category and the ``Translation'' task is more likely to elicit harmful responses from LLMs. We also conduct extensive ablation studies to investigate potential factors affecting this novel misuse vulnerability. We hope that our study could inspire enhanced safety measures among stakeholders to mitigate this overlooked content-level ethical risk.
llm CISPA Helmholtz Center for Information Security · Delft University of Technology
benchmark arXiv Apr 17, 2026 · 4w ago
Chaoshuo Zhang, Yibo Liang, Mengke Tian et al. · Xi’an Jiaotong University · CISPA Helmholtz Center for Information Security
Benchmark evaluating compositional safety vulnerabilities in text-to-image models when benign concepts combine to create unsafe outputs
Input Manipulation Attack visiongenerative
Despite the remarkable synthesis capabilities of text-to-image (T2I) models, safeguarding them against content violations remains a persistent challenge. Existing safety alignments primarily focus on explicit malicious concepts, often overlooking the subtle yet critical risks of compositional semantics. To address this oversight, we identify and formalize a novel vulnerability: Multi-Concept Compositional Unsafety (MCCU), where unsafe semantics stem from the implicit associations of individually benign concepts. Based on this formulation, we introduce TwoHamsters, a comprehensive benchmark comprising 17.5k prompts curated to probe MCCU vulnerabilities. Through a rigorous evaluation of 10 state-of-the-art models and 16 defense mechanisms, our analysis yields 8 pivotal insights. In particular, we demonstrate that current T2I models and defense mechanisms face severe MCCU risks: on TwoHamsters, FLUX achieves an MCCU generation success rate of 99.52%, while LLaVA-Guard only attains a recall of 41.06%, highlighting a critical limitation of the current paradigm for managing hazardous compositional generation.
diffusion multimodal Xi’an Jiaotong University · CISPA Helmholtz Center for Information Security
benchmark arXiv Apr 9, 2026 · 6w ago
Rui Zhang, Hongwei Li, Yun Shen et al. · University of Electronic Science and Technology of China · Flexera +2 more
Evaluates six fine-tuning methods for both misaligning safety-aligned LLMs and realigning them, revealing asymmetric attack-defense dynamics
Transfer Learning Attack Prompt Injection Training Data Poisoning nlp
The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The-Art-of-Mis-alignment.
llm transformer University of Electronic Science and Technology of China · Flexera · CISPA Helmholtz Center for Information Security +1 more
benchmark arXiv Aug 28, 2025 · Aug 2025
Junjie Chu, Mingjie Li, Ziqing Yang et al. · CISPA Helmholtz Center for Information Security · Xi’an Jiaotong University
Benchmark framework using decompositional scoring to evaluate LLM jailbreak success, achieving 98.5% human agreement and exposing attack overestimation
Prompt Injection nlp
Accurately determining whether a jailbreak attempt has succeeded is a fundamental yet unresolved challenge. Existing evaluation methods rely on misaligned proxy indicators or naive holistic judgments. They frequently misinterpret model responses, leading to inconsistent and subjective assessments that misalign with human perception. To address this gap, we introduce JADES (Jailbreak Assessment via Decompositional Scoring), a universal jailbreak evaluation framework. Its key mechanism is to automatically decompose an input harmful question into a set of weighted sub-questions, score each sub-answer, and weight-aggregate the sub-scores into a final decision. JADES also incorporates an optional fact-checking module to strengthen the detection of hallucinations in jailbreak responses. We validate JADES on JailbreakQR, a newly introduced benchmark proposed in this work, consisting of 400 pairs of jailbreak prompts and responses, each meticulously annotated by humans. In a binary setting (success/failure), JADES achieves 98.5% agreement with human evaluators, outperforming strong baselines by over 9%. Re-evaluating five popular attacks on four LLMs reveals substantial overestimation (e.g., LAA's attack success rate on GPT-3.5-Turbo drops from 93% to 69%). Our results show that JADES could deliver accurate, consistent, and interpretable evaluations, providing a reliable basis for measuring future jailbreak attacks.
llm transformer CISPA Helmholtz Center for Information Security · Xi’an Jiaotong University