Latest papers

19 papers
attack arXiv Mar 26, 2026 · 11d ago

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak · Columbia University

System prompt engineering creates exploitable phishing detection vulnerabilities in LLM email agents despite strong benchmark performance

Input Manipulation Attack Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Mar 6, 2026 · 4w ago

When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

Qitong Wang, Haoran Dai, Haotian Zhang et al. · University of Delaware · Illinois Institute of Technology +1 more

Introduces metrics revealing that multimodal backdoor attacks collapse to single-modality dominance rather than exploiting modalities synergistically

Model Poisoning multimodalgenerative
PDF
defense arXiv Feb 19, 2026 · 6w ago

Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

Arka Pal, Louai Zahran, William Gvozdjak et al. · Ritual · MIT +1 more

Leverages SMPC/FHE privacy mechanisms to build cheap verifiable LLM inference protocols, replacing costly zero-knowledge proofs

Output Integrity Attack nlp
PDF Code
defense arXiv Feb 18, 2026 · 6w ago

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

Justin Albrethsen, Yash Datta, Kunal Kumar et al. · Highflame · Columbia University

Stateful RNN monitors multi-turn LLM conversations to detect gradual jailbreak intent drift, achieving F1=0.84

Prompt Injection nlp
PDF
defense TNNLS Jan 29, 2026 · 9w ago

ZK-HybridFL: Zero-Knowledge Proof-Enhanced Hybrid Ledger for Federated Learning

Amirhossein Taherpour, Xiaodong Wang · Columbia University

Defends federated learning against Byzantine adversarial nodes using DAG ledger, sidechains, and zero-knowledge proofs for privacy-preserving update validation

Data Poisoning Attack Model Inversion Attack federated-learningvisionnlp
PDF
attack arXiv Jan 26, 2026 · 10w ago

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

Gabriel Lee Jun Rong, Christos Korgialas, Dion Jia Xu Ho et al. · Singapore Institute of Technology · Aristotle University of Thessaloniki +3 more

Agentic VLM/LLM system orchestrates CW, JSMA, and STA attacks to evade deepfake detectors with improved black-box transfer

Input Manipulation Attack visionmultimodalnlp
PDF
defense arXiv Dec 17, 2025 · Dec 2025

TrajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training

Mukur Gupta, Niharika Gupta, Saifur Rahman et al. · Columbia University · Vellore Institute of Technology +1 more

Defends FL models against adversarial attacks by synthesizing server-side training data from client model trajectories, enabling adversarial training without client data access

Input Manipulation Attack visionfederated-learning
PDF
defense IACR ePrint Dec 9, 2025 · Dec 2025

Improved Pseudorandom Codes from Permuted Puzzles

Miranda Christ, Noah Golowich, Sam Gunn et al. · Columbia University · Microsoft Research +5 more

Constructs provably robust LLM watermarks with subexponential security, surviving worst-case edits and detection-key-aware adversaries

Output Integrity Attack nlp
PDF
defense arXiv Dec 3, 2025 · Dec 2025

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao, Zhiwei Steven Wu, Adam Block · University of Pennsylvania · Carnegie Mellon University +1 more

Fine-tuning framework that embeds robust watermarks into open-weight LLM weights, closing the quality-detectability gap with inference-time schemes

Output Integrity Attack nlp
PDF Code
benchmark arXiv Oct 22, 2025 · Oct 2025

Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

Reya Vir, Sarvesh Bhatnagar · Columbia University · University of Michigan

Quantifies subliminal data poisoning in LLM fine-tuning: finds sharp alignment-failure phase transition, not gradual degradation

Data Poisoning Attack Training Data Poisoning nlp
2 citations PDF
attack arXiv Oct 21, 2025 · Oct 2025

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

Kuai Yu, Xiaoyu Wu, Peishen Yan et al. · Columbia University · Shanghai Jiao Tong University +4 more

Uses reinforcement learning to optimize layer selection for stealthy backdoor attacks in federated learning, beating SOTA defenses by 40%

Model Poisoning federated-learning
PDF
attack arXiv Oct 14, 2025 · Oct 2025

MS-GAGA: Metric-Selective Guided Adversarial Generation Attack

Dion J. X. Ho, Gabriel Lee Jun Rong, Niharika Shrivastava et al. · Columbia University · Singapore Institute of Technology +1 more

Dual-stream PGD attack crafts transferable, imperceptible adversarial examples that evade black-box deepfake detectors by 27% over SOTA

Input Manipulation Attack vision
2 citations PDF
defense arXiv Oct 6, 2025 · Oct 2025

Proactive defense against LLM Jailbreak

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi et al. · Columbia University

Proactive LLM defense generates spurious jailbreak-success signals to terminate attacker optimization loops prematurely

Prompt Injection nlp
2 citations PDF
defense arXiv Oct 5, 2025 · Oct 2025

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

Guangyu Shen, Siyuan Cheng, Xiangzhe Xu et al. · Purdue University · Columbia University

Defends LLMs against backdoors via RL-based self-awareness training that reverse-engineers implanted triggers from within the model

Model Poisoning nlp
PDF
attack arXiv Sep 30, 2025 · Sep 2025

Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities

Simin Chen, Yixin He, Suman Jana et al. · Columbia University · University of Southern California

Indirect prompt injection via adversarial GitHub issues tricks LLM repair agents into generating correct-but-vulnerable patches

Prompt Injection Excessive Agency nlp
2 citations PDF
defense arXiv Sep 15, 2025 · Sep 2025

DARD: Dice Adversarial Robustness Distillation against Adversarial Attacks

Jing Zou, Shungeng Zhang, Meikang Qiu et al. · Augusta University · Columbia University

Distills adversarial robustness from large teacher models to compact students, eliminating the standard accuracy trade-off of adversarial training

Input Manipulation Attack vision
PDF
attack arXiv Sep 14, 2025 · Sep 2025

Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers

Simin Chen, Jinjun Peng, Yixin He et al. · Columbia University · University of Southern California

Exploits official DL compiler inconsistencies to inject backdoors into benign models at compile time, evading all state-of-the-art detectors

Model Poisoning AI Supply Chain Attacks visionnlp
PDF
tool arXiv Aug 21, 2025 · Aug 2025

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

Andreas D. Kellas, Neophytos Christou, Wenxin Jiang et al. · Columbia University · Brown University +4 more

Defends against malicious pickle-based ML models on Hugging Face via static analysis and dynamic policy enforcement at load time

AI Supply Chain Attacks
PDF
attack arXiv Aug 13, 2025 · Aug 2025

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li, Beining Xu, Simin Chen et al. · Shanghai Jiao Tong University · Columbia University +3 more

Multi-target backdoor attack on VLM visual grounding using dynamic text-conditioned UNet triggers to hijack object localization

Model Poisoning visionmultimodalnlp
PDF Code