Latest papers

1,734 papers
defense arXiv Apr 2, 2026 · 4d ago

Combating Data Laundering in LLM Training

Muxing Li, Zesheng Ye, Sharon Li et al. · University of Melbourne · University of Wisconsin-Madison

Detects unauthorized LLM training data use even when original data has been laundered through style transformations

Membership Inference Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Apr 2, 2026 · 4d ago

Diffusion-Guided Adversarial Perturbation Injection for Generalizable Defense Against Facial Manipulations

Yue Li, Linying Xue, Kaiqing Lin et al. · National Huaqiao University · Shenzhen University +2 more

Diffusion-guided adversarial perturbation defense protecting facial images from deepfake manipulation in both white-box and black-box settings

Input Manipulation Attack visiongenerative
PDF
defense arXiv Apr 2, 2026 · 4d ago

Moiré Video Authentication: A Physical Signature Against AI Video Generation

Yuan Qing, Kunyu Zheng, Lingxiao Li et al. · Boston University

Physics-based video authentication using Moiré interference patterns that real cameras produce but AI generators cannot faithfully reproduce

Output Integrity Attack visiongenerative
PDF
defense arXiv Apr 2, 2026 · 4d ago

From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

Yiheng Huang, Zhijia Zhao, Bihuan Chen et al. · Fudan University

Constructs dataset of 114 malicious MCP servers exploiting LLM tool-calling and proposes behavioral deviation detector achieving 94.6% F1

Insecure Plugin Design nlp
PDF
defense arXiv Apr 1, 2026 · 5d ago

TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

Awais Khan, Muhammad Umar Farooq, Kutub Uddin et al. · University of Michigan-Flint

Training-free partial audio deepfake detector using speech foundation model embedding dynamics, achieving competitive performance without labeled data

Output Integrity Attack audio
PDF
defense arXiv Apr 1, 2026 · 5d ago

RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems

KrishnaSaiReddy Patil

Defense-in-depth framework using cryptographic provenance verification to block knowledge base poisoning attacks in government RAG systems

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Apr 1, 2026 · 5d ago

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Xinyu Sun, Wanwei Liu, Haoang Chi et al. · National University of Defense Technology · Nanjing University +1 more

Interpretable DNN repair using Shapley-guided fault localization and derivative-free optimization for backdoor removal, adversarial defense, and fairness

Input Manipulation Attack Model Poisoning vision
PDF
defense arXiv Apr 1, 2026 · 5d ago

WARP: Guaranteed Inner-Layer Repair of NLP Transformers

Hsin-Ling Hsu, Min-Yu Chen, Nai-Chia Chen et al. · National Chengchi University

Constraint-based model repair framework providing provable guarantees for correcting adversarial misclassifications in NLP Transformers

Input Manipulation Attack nlp
PDF
defense arXiv Apr 1, 2026 · 5d ago

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Zikai Zhang, Rui Hu, Olivera Kotevska et al. · University of Nevada · Oak Ridge National Laboratory

Detects LLM jailbreak attacks using logit distributions over numerical tokens, achieving 22.66% ASR reduction with minimal overhead

Prompt Injection nlp
PDF
defense arXiv Apr 1, 2026 · 5d ago

AgentWatcher: A Rule-based Prompt Injection Monitor

Yanting Wang, Wei Zou, Runpeng Geng et al. · The Pennsylvania State University

Rule-based prompt injection detector using causal attribution to identify malicious context segments in long-context LLM agents

Prompt Injection Excessive Agency nlp
PDF Code
defense arXiv Apr 1, 2026 · 5d ago

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

Jingning Xu, Haochen Luo, Chen Liu · City University of Hong Kong

Training-free defense using text augmentation to protect VLMs against diverse adversarial image perturbations at inference time

Input Manipulation Attack multimodalvisionnlp
PDF
defense arXiv Mar 31, 2026 · 6d ago

Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

Hiba Adil Al-kharsan, Róbert Rajkó

Defends handwritten digit classifiers against adversarial examples using diffusion-based feature-space denoising with hybrid CNN-NNMF representations

Input Manipulation Attack vision
PDF
defense arXiv Mar 31, 2026 · 6d ago

Robust Multimodal Safety via Conditional Decoding

Anurag Kumar, Raghuveer Peri, Jon Burnsky et al. · The Ohio State University · AWS

Conditional decoding defense using internal safety classification that blocks multimodal jailbreaks across text, image, and audio inputs

Input Manipulation Attack Prompt Injection multimodalnlpvisionaudio
PDF
defense arXiv Mar 31, 2026 · 6d ago

CIPHER: Counterfeit Image Pattern High-level Examination via Representation

Kyeonghun Kim, Youngung Han, Seoyoung Ju et al. · OUTTA · Seoul National University

Deepfake detector reusing GAN/diffusion discriminators to identify synthetic faces across nine generative models with 74% F1-score

Output Integrity Attack visiongenerative
PDF
defense arXiv Mar 31, 2026 · 6d ago

Refined Detection for Gumbel Watermarking

Tor Lattimore · Google DeepMind

Near-optimal detection test for Gumbel watermarking of LLM text outputs with problem-dependent statistical efficiency guarantees

Output Integrity Attack nlp
PDF
defense arXiv Mar 31, 2026 · 6d ago

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

Yubo Cui, Xianchao Guan, Zijun Xiong et al. · Harbin Institute of Technology · Shenzhen Loop Area Institute

Adversarial fine-tuning framework that preserves vision-language alignment while defending CLIP against adversarial perturbations in zero-shot settings

Input Manipulation Attack visionnlpmultimodal
PDF Code
defense arXiv Mar 31, 2026 · 6d ago

PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization

Jianpeng Wang, Haoyu Wang, Baoying Chen et al.

Detects and localizes prompt-based AI image edits using contrastive learning, achieving 62.5% IoU on new 350k dataset

Output Integrity Attack visionmultimodal
PDF
defense arXiv Mar 31, 2026 · 6d ago

Multi-Feature Fusion Approach for Generative AI Images Detection

Abderrezzaq Sendjasni, Mohamed-Chaker Larabi · University of Poitiers

Fuses statistical, semantic, and texture features to detect AI-generated images across diverse generative models with improved robustness

Output Integrity Attack visiongenerative
PDF
defense arXiv Mar 30, 2026 · 7d ago

CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks

KrishnaSaiReddy Patil

Seven-layer defense framework for government AI chatbots achieving 73% detection against jailbreaks with graduated human escalation

Prompt Injection nlp
PDF
defense arXiv Mar 30, 2026 · 7d ago

Lipschitz verification of neural networks through training

Simon Kuang, Yuezhu Xu, S. Sivaranjani et al. · University of California · Purdue University

Trains certifiably robust neural networks by penalizing the trivial Lipschitz bound during training, achieving tight provable robustness guarantees

Input Manipulation Attack vision
PDF
Loading more papers…