Latest papers

78 papers
defense arXiv Mar 28, 2026 · 9d ago

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

Jinhu Fu, Yihang Lou, Qingyi Si et al. · Beijing University of Posts and Telecommunications · Chongqing University of Posts and Telecommunications +2 more

Identifies and repairs unsafe neural pathways in VLMs using causal mediation analysis and dual-modal safety subspace projection

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF
defense The IEEE/CVF Conference on Com... Mar 25, 2026 · 12d ago

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng et al. · Wuhan University · Peking University +2 more

Reinforcement learning curriculum that dynamically weights training samples to improve deepfake detector generalization against unseen attacks

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Mar 19, 2026 · 18d ago

Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend

Yige Liu, Dexuan Xu, Zimai Guo et al. · Peking University · Zhongguancun Laboratory

Reveals label inference attacks in VFL succeed due to feature-label alignment, proposes zero-overhead cut layer defense

Model Inversion Attack federated-learning
PDF
attack arXiv Mar 16, 2026 · 21d ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Yihao Zhang, Zeming Wei, Xiaokun Luan et al. · Peking University · Sun Yat-Sen University +3 more

Self-replicating worm attack on LLM agent ecosystems achieving autonomous propagation through configuration hijacking and broadcast infection

AI Supply Chain Attacks Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Mar 13, 2026 · 24d ago

Purify Once, Edit Freely: Breaking Image Protections under Model Mismatch

Qichen Zhao, Shengfang Zhai, Xinjian Bai et al. · Peking University · National University of Singapore +1 more

Defeats image protection schemes via purification attacks, removing adversarial perturbations to restore full editability under model mismatch

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 27, 2026 · 5w ago

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

Chung-ju Huang, Huiqiang Zhao, Yuanpeng He et al. · Peking University · Tencent +1 more

Defends LLM client prompts from cloud-provider reconstruction via CVM partitioning and reversible masking, cutting token inference accuracy from 97.5% to 1.34%

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Feb 25, 2026 · 5w ago

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Zheyuan Gu, Qingsong Zhao, Yusong Wang et al. · China Telecom · Peking University +1 more

Proposes FAQ benchmark to evaluate VLMs on temporal deepfake detection via three-level forensic reasoning hierarchy

Output Integrity Attack visionmultimodal
PDF
defense arXiv Feb 24, 2026 · 5w ago

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang et al. · Peking University · Nanyang Technological University +2 more

Defends LLM agents against indirect prompt injection via latent-space probing and attention steering without over-refusal

Prompt Injection nlpmultimodal
PDF
attack arXiv Feb 24, 2026 · 5w ago

Is the Trigger Essential? A Feature-Based Triggerless Backdoor Attack in Vertical Federated Learning

Yige Liu, Yiwei Lou, Che Wang et al. · Peking University · Zhongguancun Laboratory

Triggerless backdoor attack in vertical federated learning that replaces embeddings at inference to hijack predictions without training-time poisoning

Model Poisoning federated-learning
PDF
attack arXiv Feb 24, 2026 · 5w ago

AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Che Wang, Jiaming Zhang, Ziqi Zhang et al. · Peking University · Nanyang Technological University +1 more

Adaptive indirect prompt injection attack on agentic LLMs that selects stealthy MCP tools and optimizes prompts to evade defenses

Prompt Injection Insecure Plugin Design nlp
PDF
defense arXiv Feb 6, 2026 · 8w ago

Exploring Specular Reflection Inconsistency for Generalizable Face Forgery Detection

Hongyan Fei, Zexi Jia, Chuanwei Huang et al. · Peking University · Tencent Inc

Detects AI-generated deepfake faces using specular reflection inconsistencies from the Phong illumination model via cross-attention

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 4, 2026 · 8w ago

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Zeming Wei, Qiaosheng Zhang, Xia Hu et al. · Shanghai AI Laboratory · Peking University

Risk-aware preference optimization framework that generalizes LRM safe reasoning against diverse jailbreak attacks without sacrificing utility

Prompt Injection nlp
PDF Code
benchmark arXiv Feb 4, 2026 · 8w ago

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Yanshu Wang, Shuaishuai Yang, Jingjing He et al. · Peking University

Reveals few-shot demonstrations boost role-oriented jailbreak defenses but degrade task-oriented defenses by up to 21% in LLMs

Prompt Injection nlp
PDF
benchmark arXiv Feb 2, 2026 · 9w ago

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Zeming Wei, Zhixin Zhang, Chengcan Wu et al. · Peking University

Coverage criteria framework using LLM internal representations to evaluate jailbreak test suite adequacy and guide attack prompt sampling

Prompt Injection nlp
PDF
defense arXiv Feb 2, 2026 · 9w ago

MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

Ruiqi Liu, Manni Cui, Ziheng Qin et al. · Institute of Automation · School of Advanced Interdisciplinary Sciences +7 more

Detects AI-generated images by projecting inputs to a real-image manifold and using reconstruction residuals as forgery signals, surpassing human experts

Output Integrity Attack visiongenerative
PDF Code
benchmark arXiv Feb 1, 2026 · 9w ago

Statistical MIA: Rethinking Membership Inference Attack for Reliable Unlearning Auditing

Jialong Sun, Zeming Wei, Jiaxuan Zou et al. · Shenzhen University of Advanced Technology · Peking University +2 more

Proposes statistical MIA framework that uses distribution tests instead of shadow models to reliably audit machine unlearning with confidence intervals

Membership Inference Attack vision
PDF
attack arXiv Jan 31, 2026 · 9w ago

Jailbreaking LLMs via Calibration

Yuxuan Lu, Yongkang Guo, Yuqing Kong · Peking University

Recasts Weak-to-Strong LLM jailbreaking as forecast aggregation, deriving optimal logit-space strategies that beat existing methods on frontier models

Prompt Injection nlp
PDF
defense arXiv Jan 29, 2026 · 9w ago

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

Mingyang Liao, Yichen Wan, shuchen wu et al. · Baidu Inc. · The University of Queensland +1 more

Training-free dual-cycle framework defends LLM role-playing agents against jailbreaks while preserving persona fidelity via evolving hierarchical knowledge

Prompt Injection nlp
PDF Code
defense arXiv Jan 26, 2026 · 10w ago

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Zhewen Tan, Wenhan Yu, Jianfeng Si et al. · Peking University · Qiyuan Tech +1 more

Closed-loop RL framework co-training LLM attacker, defender, and evaluator to iteratively improve safety alignment with minimal annotation

Prompt Injection nlpreinforcement-learning
PDF Code
defense arXiv Jan 22, 2026 · 10w ago

Explainable Deepfake Detection with RL Enhanced Self-Blended Images

Ning Jiang, Dingheng Zeng, Yanhong Liu et al. · Peking University · Ltd.

Proposes RL-enhanced MLLM deepfake detector with automated CoT data generation via Self-Blended Images and keyword-driven reward signals

Output Integrity Attack visionmultimodalreinforcement-learning
PDF Code
Loading more papers…