Latest papers

89 papers
defense arXiv Apr 30, 2026 · 21d ago

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

Xiaokun Luan, Yihao Zhang, Pengcheng Su et al. · Peking University

Privacy-preserving watermark detection protocol using VOPRF that verifies LLM-generated text without revealing content to provider

Output Integrity Attack nlp
PDF
defense arXiv Apr 30, 2026 · 21d ago

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang et al. · Johns Hopkins University · Microsoft Research Asia +2 more

Dual-encoder defense that clusters fragmented malicious prompts across anonymous LLM requests using asymmetric contrastive learning

Prompt Injection nlp
PDF
defense arXiv Apr 27, 2026 · 24d ago

AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

Zonghao Ying, Haozheng Wang, Jiangfan Liu et al. · Beihang University · 360 AI Security Lab +1 more

OS-inspired defense framework that intercepts LLM agent tool calls and enforces privilege separation to block prompt injection attacks

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Apr 20, 2026 · 4w ago

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

Xiangyu Wen, Yuang Zhao, Xiaoyu Xu et al. · The Chinese University of Hong Kong · Shanghai Jiao Tong University +3 more

Kernel-based security architecture for LLM agents that intercepts unsafe tool calls using deterministic taint tracking and dependency graphs

Insecure Plugin Design Excessive Agency nlp
PDF Code
defense arXiv Apr 15, 2026 · 5w ago

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Xixun Lin, Yang Liu, Yancheng Chen et al. · Chinese Academy of Sciences · Institute of Applied Physics and Computational Mathematics +1 more

Multi-layer security architecture embedded in LLM agent execution harnesses to defend against prompt injection and tool misuse attacks

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF
defense arXiv Apr 15, 2026 · 5w ago

QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs

Junlin Zhu, Baizhou Huang, Xiaojun Wan · Peking University

Multi-bit watermarking for LLM outputs using equal-mass quantile partitioning to ensure message-symmetric detection and quality

Output Integrity Attack nlp
PDF Code
benchmark arXiv Apr 13, 2026 · 5w ago

NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild

Aleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya et al. · Lomonosov Moscow State University · Shenzhen University +14 more

Competition report on robust deepfake detection across 42 generators and 36 image transformations with 20 final solutions

Output Integrity Attack visiongenerative
PDF
attack arXiv Apr 13, 2026 · 5w ago

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu et al. · Peking University · Sun Yat-Sen University +4 more

Multi-turn jailbreak attack that chains low-risk prompts to cumulatively bypass LLM safety guardrails across modalities

Prompt Injection nlpmultimodal
PDF
attack arXiv Apr 7, 2026 · 6w ago

Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

Yanxu Mao, Peipei Liu, Tiehan Cui et al. · Henan University · Chinese Academy of Sciences +2 more

Red-teams LLM agents by hijacking reasoning trajectories and memory retrieval without modifying user prompts, achieving cross-model jailbreaks

Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Apr 7, 2026 · 6w ago

Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Zonghao Ying, Haowen Dai, Lianyu Hu et al. · Beihang University · University of Nottingham Ningbo China +3 more

Black-box jailbreak attack coercing T2I models to render harmful text in benign images via layered prompt decomposition

Prompt Injection multimodalvisionnlp
PDF
defense arXiv Apr 6, 2026 · 6w ago

Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

Zhengcen Li, Chenyang Jiang, Hang Zhao et al. · Harbin Institute of Technology · Peng Cheng Laboratory +1 more

Vision transformer detector operating at native video resolution to preserve high-frequency forgery artifacts in AI-generated videos

Output Integrity Attack visionmultimodal
PDF
defense arXiv Mar 28, 2026 · 7w ago

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

Jinhu Fu, Yihang Lou, Qingyi Si et al. · Beijing University of Posts and Telecommunications · Chongqing University of Posts and Telecommunications +2 more

Identifies and repairs unsafe neural pathways in VLMs using causal mediation analysis and dual-modal safety subspace projection

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF
defense The IEEE/CVF Conference on Com... Mar 25, 2026 · 8w ago

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng et al. · Wuhan University · Peking University +2 more

Reinforcement learning curriculum that dynamically weights training samples to improve deepfake detector generalization against unseen attacks

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Mar 19, 2026 · 9w ago

Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend

Yige Liu, Dexuan Xu, Zimai Guo et al. · Peking University · Zhongguancun Laboratory

Reveals label inference attacks in VFL succeed due to feature-label alignment, proposes zero-overhead cut layer defense

Model Inversion Attack federated-learning
PDF
attack arXiv Mar 16, 2026 · 9w ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Yihao Zhang, Zeming Wei, Xiaokun Luan et al. · Peking University · Sun Yat-Sen University +3 more

Self-replicating worm attack on LLM agent ecosystems achieving autonomous propagation through configuration hijacking and broadcast infection

AI Supply Chain Attacks Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Mar 13, 2026 · 9w ago

Purify Once, Edit Freely: Breaking Image Protections under Model Mismatch

Qichen Zhao, Shengfang Zhai, Xinjian Bai et al. · Peking University · National University of Singapore +1 more

Defeats image protection schemes via purification attacks, removing adversarial perturbations to restore full editability under model mismatch

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 27, 2026 · 11w ago

Your Inference Request Will Become a Black Box: Confidential Inference for Cloud-based Large Language Models

Chung-ju Huang, Huiqiang Zhao, Yuanpeng He et al. · Peking University · Tencent +1 more

Defends LLM client prompts from cloud-provider reconstruction via CVM partitioning and reversible masking, cutting token inference accuracy from 97.5% to 1.34%

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Feb 25, 2026 · 12w ago

Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

Zheyuan Gu, Qingsong Zhao, Yusong Wang et al. · China Telecom · Peking University +1 more

Proposes FAQ benchmark to evaluate VLMs on temporal deepfake detection via three-level forensic reasoning hierarchy

Output Integrity Attack visionmultimodal
PDF
defense arXiv Feb 24, 2026 · 12w ago

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang et al. · Peking University · Nanyang Technological University +2 more

Defends LLM agents against indirect prompt injection via latent-space probing and attention steering without over-refusal

Prompt Injection nlpmultimodal
PDF
attack arXiv Feb 24, 2026 · 12w ago

AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Che Wang, Jiaming Zhang, Ziqi Zhang et al. · Peking University · Nanyang Technological University +1 more

Adaptive indirect prompt injection attack on agentic LLMs that selects stealthy MCP tools and optimizes prompts to evade defenses

Prompt Injection Insecure Plugin Design nlp
PDF
Loading more papers…