attack arXiv Apr 20, 2026 · 4w ago
Yuan Fang, Yiming Luo, Aimin Zhou et al. · East China Normal University · Shanghai Innovation Institute
Automated red-teaming framework generating diverse toxic datasets via inverted constitutional AI to test LLM safety mechanisms
Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp
Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.
llm transformer East China Normal University · Shanghai Innovation Institute
defense arXiv Sep 18, 2025 · Sep 2025
Zhuokang Shen, Kaisen Zhang, Bohan Jia et al. · East China Normal University · Sanming University +1 more
Novel VLM-based synthetic image detector using knowledge injection and self-reflection to exceed expert model accuracy with interpretability
Output Integrity Attack visionmultimodal
With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a novel and effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first mines latent knowledge from the MLLM itself and then injects it into the model via fine-tuning. During inference, conflict signals arising from the model's predictions activate a self-reflection process, leading to the final refined responses. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.
vlm transformer East China Normal University · Sanming University · The 27th Research Institute of CETC