Latest papers

3,833 papers
defense arXiv Apr 15, 2026 · 2d ago

VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

Hui Han, Shunli Wang, Yandan Zhao et al. · Shanghai Jiao Tong University · Tencent

Combines RAG and reinforcement learning to build an MLLM deepfake detector with dynamic forgery knowledge retrieval and critical reasoning

Output Integrity Attack visionmultimodalnlp
PDF
defense arXiv Apr 15, 2026 · 2d ago

QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs

Junlin Zhu, Baizhou Huang, Xiaojun Wan · Peking University

Multi-bit watermarking for LLM outputs using equal-mass quantile partitioning to ensure message-symmetric detection and quality

Output Integrity Attack nlp
PDF Code
defense arXiv Apr 15, 2026 · 2d ago

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Xixun Lin, Yang Liu, Yancheng Chen et al. · Chinese Academy of Sciences · Institute of Applied Physics and Computational Mathematics +1 more

Multi-layer security architecture embedded in LLM agent execution harnesses to defend against prompt injection and tool misuse attacks

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF
attack arXiv Apr 14, 2026 · 3d ago

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Qingchao Shen, Zibo Xiao, Lili Huang et al. · Tianjin University · Monash University

Fuzzing framework that jailbreaks LLMs by mutating chat templates, achieving 98% attack success across open-source models

Prompt Injection nlp
PDF
defense arXiv Apr 14, 2026 · 3d ago

Parallax: Why AI Agents That Think Must Never Act

Joel Fokou

Architectural defense separating LLM reasoning from execution, blocking 98.9% of agent compromise attacks via structural isolation

Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Apr 14, 2026 · 3d ago

GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

Arya Shah, Kaveri Visavadiya, Manisha Padala · Indian Institute of Technology Gandhinagar

Certifies per-class adversarial robustness without attacks, revealing that stronger defenses often protect classes unequally across image classifiers

Input Manipulation Attack vision
PDF Code
attack arXiv Apr 14, 2026 · 3d ago

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg · Cisco

Typographic prompt injection attacks on VLMs that bypass safety filters by rendering malicious text as images

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF
attack arXiv Apr 14, 2026 · 3d ago

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu et al. · Zhejiang University · Palo Alto Networks +3 more

Stealthy LLM backdoor injection via weight editing that compiles activation steering into null-space constraints for reliable jailbreaks

Model Poisoning AI Supply Chain Attacks Prompt Injection nlp
PDF
attack arXiv Apr 14, 2026 · 3d ago

Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

Haoyang Jiang, Mingyang Yi, Shaolei Zhang et al. · Renmin University of China · Tencent Inc.

Adversarial attacks on diffusion-generated image detectors achieve near-zero detection accuracy with imperceptible perturbations across multiple architectures

Input Manipulation Attack Output Integrity Attack visiongenerative
PDF Code
defense arXiv Apr 14, 2026 · 3d ago

Boosting Robust AIGI Detection with LoRA-based Pairwise Training

Ruiyang Xia, Qi Zhang, Yaowen Xu et al. · China Telecom · Xidian University

Robust AI-generated image detector using LoRA fine-tuning and pairwise training to maintain detection accuracy under severe distortions

Output Integrity Attack Input Manipulation Attack visiongenerative
PDF
defense arXiv Apr 14, 2026 · 3d ago

RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

Jagadeesh Rachapudi, Pranav Singh, Ritali Vatsi et al. · Indian Institute of Technology Mandi

User-driven LLM unlearning via natural language prompts using training-free activation steering to remove harmful knowledge at inference time

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
attack arXiv Apr 14, 2026 · 3d ago

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

Yingying Zhao, Chengyin Hu, Qike Zhang et al.

Physical adversarial lighting attack on VLMs that degrades CLIP classification and induces semantic hallucinations in LLaVA/BLIP

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF
defense arXiv Apr 14, 2026 · 3d ago

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Miao Liu, Fangda Wei, Jing Wang et al. · Beijing Institute of Technology · University of Science and Technology Beijing

Detects deepfakes in listening scenarios using motion analysis and audio-guided fusion, outperforming speaking-focused detectors

Output Integrity Attack multimodalvisionaudio
PDF Code
defense arXiv Apr 14, 2026 · 3d ago

Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

Tianshuo Zhang, Haoyuan Zhang, Siran Peng et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Continual deepfake detection via distribution-level replay that condenses forgery cues into compact maps, avoiding raw image storage

Output Integrity Attack visiongenerative
PDF
defense arXiv Apr 14, 2026 · 3d ago

Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling

Zida Li, Jun Li, Yuzhe Sha et al. · Nanjing University of Information Science and Technology

Detects backdoor triggers in text-to-image diffusion models by analyzing cross-attention scaling response patterns during inference

Model Poisoning visiongenerative
PDF
defense arXiv Apr 14, 2026 · 3d ago

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Shaopeng Fu, Di Wang · King Abdullah University of Science and Technology

Proves why continuous adversarial training defends LLMs against jailbreaks and proposes embedding regularization for better robustness

Input Manipulation Attack Prompt Injection nlp
PDF Code
defense arXiv Apr 14, 2026 · 3d ago

Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

Haifeng Zhang, Qinghui He, Xiuli Bi et al. · Chongqing University of Posts and Telecommunications · University of Macau +1 more

Adversarial feature learning framework that suppresses pattern and content biases to improve AI-generated image detection across unseen generative models

Output Integrity Attack visiongenerative
PDF
defense arXiv Apr 14, 2026 · 3d ago

WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

Yulin Chen, Tri Cao, Haoran Li et al. · National University of Singapore · HKUST

Reasoning-driven multimodal guard model that detects prompt injection attacks in VLM-based web agents via parallel execution

Prompt Injection multimodalnlp
PDF
attack arXiv Apr 14, 2026 · 3d ago

CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

Qi Li, Cheng-Long Wang, Yinzhi Cao et al. · King Abdullah University of Science and Technology · National University of Singapore +1 more

Membership inference attacks on subset-trained models revealing both training membership and selection participation across data pipelines

Membership Inference Attack visionnlp
PDF
defense arXiv Apr 14, 2026 · 3d ago

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng et al. · Hunan Normal University · Chinese Academy of Sciences +1 more

Couples weight subspace constraints with activation regularization to prevent safety degradation during LLM fine-tuning

Prompt Injection nlp
PDF
Loading more papers…