Latest papers

20 papers
attack arXiv Mar 30, 2026 · 7d ago

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng et al. · Anthropic · Virginia Tech +1 more

Adversarial fine-tuning attack that bypasses Constitutional Classifiers via curriculum learning, achieving 99% evasion with minimal capability loss

Prompt Injection Training Data Poisoning nlp
PDF
attack arXiv Mar 16, 2026 · 21d ago

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury et al. · Virginia Tech · Penn State University +2 more

Jailbreak injection during test-time RL amplifies LLM harmful outputs and degrades reasoning performance simultaneously

Prompt Injection Training Data Poisoning nlp
PDF
defense arXiv Mar 8, 2026 · 29d ago

Trusting What You Cannot See: Auditable Fine-Tuning and Inference for Proprietary AI

Heng Jin, Chaoyu Zhang, Hexuan Yu et al. · Virginia Tech · Washington University in St. Louis

Auditable framework using lightweight spot-check traces to verify cloud providers honestly execute contracted LLM fine-tuning and inference

Output Integrity Attack nlp
PDF Code
defense arXiv Feb 26, 2026 · 5w ago

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov, Rufiz Bayramov, Suad Gafarli et al. · Virginia Tech · ADA University

Retrieval-augmented multi-agent framework enforces LLM safety policies via adversarial debate without fine-tuning, generalizing zero-shot to new governance rules

Prompt Injection nlp
PDF
attack arXiv Feb 25, 2026 · 5w ago

Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

Xavier Pleimling, Sifat Muhammad Abdullah, Gunjan Balde et al. · Virginia Tech · IIT Kharagpur +1 more

Off-the-shelf image-to-image diffusion models repurposed as denoisers defeat 6 adversarial image protection schemes across 8 case studies

Input Manipulation Attack visiongenerative
PDF Code
benchmark arXiv Feb 23, 2026 · 6w ago

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Kunal Mukherjee · Virginia Tech

Red-teams Claude Opus and ChatGPT as TEE security advisors, finding transferable prompt-induced failures and proposing an evaluation benchmark

Prompt Injection nlp
1 citations PDF
defense arXiv Feb 9, 2026 · 8w ago

Reinforcement Learning with Backtracking Feedback

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis et al. · Google · Virginia Tech +1 more

Trains LLMs to self-correct safety violations mid-generation via RL and a 'backtrack by x tokens' signal, reducing GCG and jailbreak attack success rates

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Feb 4, 2026 · 8w ago

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Jafar Isbarov, Murat Kantarcioglu · Virginia Tech

Gradient-optimized adversarial strings weaponize LLM agents as delivery proxies to bypass monitoring-based prompt injection defenses

Input Manipulation Attack Prompt Injection nlp
PDF Code
attack arXiv Jan 30, 2026 · 9w ago

Optimal Transport-Guided Adversarial Attacks on Graph Neural Network-Based Bot Detection

Kunal Mukherjee, Zulfikar Alom, Tran Gia Bao Ngo et al. · Virginia Tech · University of Toledo +2 more

Optimal transport-guided adversarial graph attacks evade GNN-based bot detectors via realistic edge edits and node injection

Input Manipulation Attack graph
2 citations PDF
attack arXiv Jan 29, 2026 · 9w ago

LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models

Alvi Md Ishmam, Najibul Haque Sarker, Zaber Ibn Abdul Hakim et al. · Virginia Tech

Black-box UAP attack on multi-image MLLMs using attention disruption and cross-image contagious token spreading

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
attack arXiv Jan 20, 2026 · 10w ago

SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

Aafiya Hussain, Gaurav Srivastava, Alvi Ishmam et al. · Virginia Tech

Audio-only adversarial perturbations achieve 96% attack success rate on trimodal audio-video-language models via six gradient-based objectives

Input Manipulation Attack audiomultimodalnlp
PDF
defense arXiv Jan 15, 2026 · 11w ago

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen et al. · Zhejiang University · University of Wisconsin–Madison +4 more

Preserves LLM jailbreak resistance through fine-tuning by projecting utility gradients away from the low-rank safety subspace

Transfer Learning Attack Prompt Injection nlp
PDF Code
defense arXiv Jan 5, 2026 · Jan 2026

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Jiawen Zhang, Lipeng He, Kejia Chen et al. · Zhejiang University · University of Waterloo +2 more

Recovers LLM safety alignment after harmful fine-tuning using a single safety example via low-rank gradient structure

Transfer Learning Attack Prompt Injection nlp
1 citations PDF
defense arXiv Dec 18, 2025 · Dec 2025

BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Muhammad Zeeshan Karamat, Sadman Saif, Christiana Chamon Garcia · Virginia Tech

Defends LLMs against hardware bit-flip parameter corruption by localizing faults and recovering performance without fine-tuning

Model Poisoning nlp
PDF
defense arXiv Dec 15, 2025 · Dec 2025

MURIM: Multidimensional Reputation-based Incentive Mechanism for Federated Learning

Sindhuja Madabushi, Dawood Wasif, Jin-Hee Cho · Virginia Tech

Reputation-based FL incentive mechanism that defends against Byzantine poisoning and privacy attacks by detecting unreliable clients

Data Poisoning Attack federated-learning
PDF
defense arXiv Dec 14, 2025 · Dec 2025

PRIVEE: Privacy-Preserving Vertical Federated Learning Against Feature Inference Attacks

Sindhuja Madabushi, Ahmad Faraz Khan, Haider Ali et al. · Virginia Tech · US DEVCOM Army Research Laboratory +2 more

Defends against feature inference attacks in VFL by obfuscating confidence scores while preserving ranking and inter-score distances

Model Inversion Attack federated-learningtabular
PDF
attack Asia-Pacific Computer Systems ... Dec 1, 2025 · Dec 2025

Physical ID-Transfer Attacks against Multi-Object Tracking via Adversarial Trajectory

Chenyi Wang, Yanmao Man, Raymond Muller et al. · University of Arizona · HERE Technologies +3 more

Physical adversarial trajectory attack that transfers tracked IDs between objects in MOT systems, bypassing object detection with 100% white-box success

Input Manipulation Attack vision
1 citations PDF
attack EMNLP Oct 27, 2025 · Oct 2025

Retracing the Past: LLMs Emit Training Data When They Get Lost

Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen et al. · Virginia Tech · Cisco Research

Extracts verbatim LLM training data by optimizing prompts to spike token entropy, achieving 22% extraction rate on Llama 2-70B

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Oct 24, 2025 · Oct 2025

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa et al. · Virginia Tech · Princeton University +1 more

Defends LLMs against novel jailbreaks by training on diverse compositions of adversarial skill primitives extracted from 32 prior attacks

Prompt Injection nlp
1 citations PDF
defense arXiv Aug 30, 2025 · Aug 2025

Enabling Trustworthy Federated Learning via Remote Attestation for Mitigating Byzantine Threats

Chaoyu Zhang, Heng Jin, Shanghao Shi et al. · Virginia Tech

TEE-based remote attestation system verifies FL client training integrity to block Byzantine data and model poisoning attacks

Data Poisoning Attack Model Poisoning federated-learning
PDF