Latest papers

43 papers
defense arXiv Apr 1, 2026 · 5d ago

RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems

KrishnaSaiReddy Patil

Defense-in-depth framework using cryptographic provenance verification to block knowledge base poisoning attacks in government RAG systems

Data Poisoning Attack Training Data Poisoning nlp
PDF
attack arXiv Mar 30, 2026 · 7d ago

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng et al. · Anthropic · Virginia Tech +1 more

Adversarial fine-tuning attack that bypasses Constitutional Classifiers via curriculum learning, achieving 99% evasion with minimal capability loss

Prompt Injection Training Data Poisoning nlp
PDF
survey arXiv Mar 23, 2026 · 14d ago

Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks

Yanming Mu, Hao Hu, Feiyang Li et al. · State Key Laboratory of Mathematical Engineering and Advanced Computing · Information Engineering University +2 more

First end-to-end survey mapping RAG security threats, defenses, and benchmarks across the entire pipeline

Prompt Injection Training Data Poisoning Sensitive Information Disclosure nlp
PDF
defense arXiv Mar 17, 2026 · 20d ago

Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

Shenao Yan, Shimaa Ahmed, Shan Jin et al. · University of Connecticut · Visa Research

Black-box scanning framework detecting poisoned code generation LLMs by identifying recurring vulnerable code structures across diverse prompts

Data Poisoning Attack Model Poisoning Training Data Poisoning nlp
PDF
attack arXiv Mar 16, 2026 · 21d ago

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury et al. · Virginia Tech · Penn State University +2 more

Jailbreak injection during test-time RL amplifies LLM harmful outputs and degrades reasoning performance simultaneously

Prompt Injection Training Data Poisoning nlp
PDF
defense arXiv Mar 3, 2026 · 4w ago

Understanding and Mitigating Dataset Corruption in LLM Steering

Cullen Anderson, Narmeen Oozeer, Foad Namjoo et al. · University of Massachusetts Amherst · Martian AI +2 more

Analyzes adversarial data poisoning of LLM contrastive steering datasets and defends with robust mean estimation

Data Poisoning Attack Training Data Poisoning nlp
PDF
attack arXiv Mar 1, 2026 · 5w ago

Subliminal Signals in Preference Labels

Isotta Magistrali, Frédéric Berdoz, Sam Dauncey et al. · ETH Zürich

Biased LLM judge covertly encodes behavioral traits into student models via binary RLHF preference labels, bypassing semantic oversight

Transfer Learning Attack Data Poisoning Attack Training Data Poisoning nlp
PDF Code
attack arXiv Feb 28, 2026 · 5w ago

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Jingyuan Xie, Wenjie Wang, Ji Wu et al. · Tsinghua University · Beijing National Research Center for Information Science and Technology

Stealthy few-shot rationale poisoning attack during LLM fine-tuning degrades medical subject accuracy without detectable backdoor triggers

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Feb 28, 2026 · 5w ago

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Quoc Minh Nguyen, Trung Le, Jing Wu et al. · Monash University

Defends LLMs against harmful fine-tuning attacks by pre-aligning safety in flat loss regions and gradient-weighting poisoned samples away during fine-tuning

Data Poisoning Attack Training Data Poisoning nlp
PDF
attack arXiv Feb 17, 2026 · 6w ago

Revisiting Backdoor Threat in Federated Instruction Tuning from a Signal Aggregation Perspective

Haodong Zhao, Jinming Hu, Gongshen Liu · Shanghai Jiao Tong University

Reveals distributed backdoor attacks via low-concentration poisoned data across benign FL clients defeat all existing defenses

Model Poisoning Data Poisoning Attack Training Data Poisoning nlpfederated-learning
PDF
defense arXiv Feb 10, 2026 · 7w ago

Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu, Matthew Wicker · Imperial College London

Proposes TPA, the first certified defense against targeted data poisoning attacks for autoregressive LLMs using MILP-backed guarantees

Data Poisoning Attack Training Data Poisoning nlp
PDF
attack arXiv Feb 10, 2026 · 7w ago

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette et al. · University of Oxford · Independent +2 more

Poisons ML models by perturbing existing training data via influence functions, inducing targeted behavior without injecting explicit attack examples

Data Poisoning Attack Training Data Poisoning visionnlp
PDF Code
defense arXiv Feb 6, 2026 · 8w ago

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Cheol Woo Kim, Davin Choo, Tzeh Yuan Neoh et al. · Harvard University

Proposes Stackelberg Security Games as a unifying framework for strategic AI oversight against data poisoning, evaluation manipulation, and deployment attacks

Data Poisoning Attack Model Skewing Training Data Poisoning nlpreinforcement-learning
PDF
attack arXiv Feb 6, 2026 · 8w ago

VENOMREC: Cross-Modal Interactive Poisoning for Targeted Promotion in Multimodal LLM Recommender Systems

Guowei Guan, Yurong Hao, Jiaming Zhang et al. · Nanyang Technological University · Alibaba Group

Cross-modal synchronized data poisoning attack that steers MLLM recommender systems to promote target items via attention-guided token-patch edits

Data Poisoning Attack Training Data Poisoning multimodalnlpvision
PDF
attack arXiv Jan 27, 2026 · 9w ago

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

Harsh Chaudhari, Ethan Rathbun, Hanna Foerster et al. · Northeastern University · University of Cambridge +4 more

Poisons LLM CoT training data by corrupting reasoning traces to inject targeted behaviors into unseen domains without altering queries or answers

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Jan 12, 2026 · 12w ago

Safe-FedLLM: Delving into the Safety of Federated Large Language Models

Mingxiang Tao, Yu Tian, Wenxuan Tu et al. · Hainan University · Tsinghua University +1 more

Probe-based defense framework classifies LoRA weight updates to detect and suppress malicious clients in federated LLM fine-tuning

Model Poisoning Data Poisoning Attack Training Data Poisoning federated-learningnlp
PDF Code
defense arXiv Jan 6, 2026 · Jan 2026

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Ruihan Zhang, Jun Sun · Singapore Management University

Defends proprietary text from unauthorized LLM training by injecting alignment-triggering disclaimers that sabotage fine-tuning via persistent safety-layer activation

Data Poisoning Attack Training Data Poisoning nlp
PDF
benchmark arXiv Dec 25, 2025 · Dec 2025

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Tsogt-Ochir Enkhbayar · Mongolian Artificial Intelligence Society

Reveals warning-framed LLM training data teaches warned-against behaviors anyway; SAE analysis shows safety framing fails to separate latent features

Data Poisoning Attack Training Data Poisoning nlp
PDF
attack arXiv Dec 15, 2025 · Dec 2025

Bilevel Optimization for Covert Memory Tampering in Heterogeneous Multi-Agent Architectures (XAMT)

Akhil Sharma, Shaikh Yaser Arafat, Jai Kumar Sharma et al.

Bilevel optimization attack covertly poisons MARL replay buffers and RAG knowledge bases at sub-percent poison rates while evading detection

Data Poisoning Attack Training Data Poisoning reinforcement-learningnlp
PDF
attack arXiv Dec 10, 2025 · Dec 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley, Jorio Cocola, Dylan Feng et al. · Truthful AI · MATS Fellowship +3 more

Demonstrates inductive backdoors and persona-poisoning attacks that corrupt LLMs through narrow fine-tuning generalization

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
10 citations PDF
Loading more papers…