Latest papers

8 papers
defense arXiv Jan 16, 2026 · 11w ago

Attesting Model Lineage by Consisted Knowledge Evolution with Fine-Tuning Trajectory

Zhuoyi Shang, Jiasen Li, Pengzhen Chen et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Verifies model lineage via knowledge evolution trajectories and parameter edits to detect unauthorized redistribution of fine-tuned models

Model Theft visionnlpgenerative
1 citations PDF
attack arXiv Dec 24, 2025 · Dec 2025

CoTDeceptor:Adversarial Code Obfuscation Against CoT-Enhanced LLM Code Agents

Haoyang Li, Mingjin Li, Jinxin Zuo et al. · Beijing University of Posts and Telecommunications · Chinese Academy of Sciences +3 more

Adversarial code obfuscation framework that exploits CoT reasoning chain weaknesses to evade LLM-based vulnerability detectors

Input Manipulation Attack Prompt Injection nlp
PDF Code
defense arXiv Nov 16, 2025 · Nov 2025

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Haotian Jin, Yang Li, Haihui Fan et al. · Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense +1 more

Defends LLMs against backdoor attacks by detecting abnormal inter-head attention similarity and realigning contaminated attention heads via fine-tuning

Model Poisoning nlp
1 citations PDF
defense arXiv Nov 12, 2025 · Nov 2025

DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks

Yunfei Yang, Xiaojun Chen, Yuexin Xuan et al. · Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense +2 more

Embeds coupled watermarks in models that adversaries inevitably carry over when stealing via query-based extraction attacks

Model Theft vision
1 citations PDF Code
defense arXiv Nov 12, 2025 · Nov 2025

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

Xin Zhao, Xiaojun Chen, Bingshan Liu et al. · Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense +1 more

Defends text-to-image models from jailbreak prompts via LLM-driven zero-shot prompt rewriting with cultural and intent-aware safety checks

Prompt Injection multimodalgenerativenlp
1 citations PDF
attack arXiv Oct 15, 2025 · Oct 2025

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Xin Zhao, Xiaojun Chen, Bingshan Liu et al. · Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense +2 more

Backdoor attack exploiting MoE routing preferences in LLMs to hijack expert pathways with up to 100% attack success rate

Model Poisoning nlp
PDF
attack arXiv Oct 1, 2025 · Oct 2025

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Xiangfang Li, Yu Wang, Bo Li · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Backdoor-based fine-tuning attack that jailbreaks GPT-4o and GPT-4.1 at 97%+ ASR by evading data filters, defensive fine-tuning, and safety audits

Model Poisoning Prompt Injection nlp
2 citations PDF Code
defense arXiv Sep 11, 2025 · Sep 2025

Towards Confidential and Efficient LLM Inference with Dual Privacy Protection

Honglan Yu, Yibin Wang, Feifei Dai et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Defends LLM inference-time user input privacy from a curious server using TEE partitioning and optimized differential-privacy text sanitization.

Sensitive Information Disclosure nlp
PDF Code