Latest papers

10 papers
defense arXiv Mar 6, 2026 · 4w ago

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

Feiran Li, Qianqian Xu, Shilong Bao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +4 more

Black-box backdoor detector for text-to-image diffusion models using semantic instruction-response deviation across varied prompts

Model Poisoning visiongenerativemultimodal
PDF Code
defense arXiv Mar 3, 2026 · 4w ago

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang et al. · University of Chinese Academy of Sciences · Institute of Information Engineering +1 more

Defends LLMs against adversarial prefix jailbreaks by causal probing to pin malicious intent across autoregressive generation

Prompt Injection nlp
PDF
attack arXiv Feb 7, 2026 · 8w ago

Reverse-Engineering Model Editing on Language Models

Zhiyu Sun, Minrui Luo, Yu Wang et al. · Shanghai Qi Zhi Institute · East China Normal University +3 more

Recovers private edited data from LLM parameter update matrices using spectral analysis and entropy-based prompt reconstruction

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
defense arXiv Feb 2, 2026 · 9w ago

Backdoor Sentinel: Detecting and Detoxifying Backdoors in Diffusion Models via Temporal Noise Consistency

Bingzheng Wang, Xiaoyan Gu, Hongbo Xu et al. · Institute of Information Engineering

Detects and detoxifies backdoors in diffusion models by exploiting temporal noise inconsistency patterns introduced by triggers across denoising timesteps

Model Poisoning visiongenerative
PDF
defense arXiv Feb 2, 2026 · 9w ago

WorldCup Sampling for Multi-bit LLM Watermarking

Yidan Wang, Yubing Ren, Yanan Cao et al. · Institute of Information Engineering · University of Chinese Academy of Sciences

Proposes WorldCup, a multi-bit LLM output watermarking scheme embedding provenance bits directly into token sampling via hierarchical competition

Output Integrity Attack nlp
PDF
defense arXiv Jan 27, 2026 · 9w ago

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang et al. · Shanghai Artificial Intelligence Laboratory · Institute of Information Engineering +1 more

Automates LLM jailbreak guardrail hardening via iterative red-blue adversarial game without model parameter updates

Prompt Injection nlp
PDF
attack arXiv Nov 3, 2025 · Nov 2025

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +1 more

Indirect prompt injection hidden inside academic papers hijacks LLM-based AI reviewers into awarding perfect scores

Prompt Injection nlp
1 citations PDF
attack arXiv Sep 14, 2025 · Sep 2025

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

Yibo Zhang, Liang Lin · Beijing University of Posts and Telecommunications · Institute of Information Engineering

Genetic algorithm optimizes real-world background noise into adversarial audio that jailbreaks Audio Large Models with 95% success rate

Input Manipulation Attack Prompt Injection audionlp
PDF
tool arXiv Sep 8, 2025 · Sep 2025

NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

Yilin Li, Guozhu Meng, Mingyang Sun et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +1 more

Decompiles on-device DNN executables to recover model architecture and weights, enabling model theft from edge deployments

Model Theft vision
PDF
defense arXiv Aug 3, 2025 · Aug 2025

DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization

Siran Peng, Haoyuan Zhang, Li Gao et al. · Institute of Automation · University of Chinese Academy of Sciences +4 more

Diffusion-based encoder-decoder detects face forgeries and localizes artifacts jointly for improved explainability

Output Integrity Attack visiongenerative
PDF