ML Security Papers

Latest papers

19 papers

attack arXiv Mar 26, 2026 · 11d ago

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak · Columbia University

System prompt engineering creates exploitable phishing detection vulnerabilities in LLM email agents despite strong benchmark performance

Input Manipulation Attack Prompt Injection Excessive Agency nlp

PDF

benchmark arXiv Mar 6, 2026 · 4w ago

When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

Qitong Wang, Haoran Dai, Haotian Zhang et al. · University of Delaware · Illinois Institute of Technology +1 more

Introduces metrics revealing that multimodal backdoor attacks collapse to single-modality dominance rather than exploiting modalities synergistically

Model Poisoning multimodalgenerative

PDF

defense arXiv Feb 19, 2026 · 6w ago

Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

Arka Pal, Louai Zahran, William Gvozdjak et al. · Ritual · MIT +1 more

Leverages SMPC/FHE privacy mechanisms to build cheap verifiable LLM inference protocols, replacing costly zero-knowledge proofs

Output Integrity Attack nlp

PDF Code

defense arXiv Feb 18, 2026 · 6w ago

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

Justin Albrethsen, Yash Datta, Kunal Kumar et al. · Highflame · Columbia University

Stateful RNN monitors multi-turn LLM conversations to detect gradual jailbreak intent drift, achieving F1=0.84

Prompt Injection nlp

PDF

defense TNNLS Jan 29, 2026 · 9w ago

ZK-HybridFL: Zero-Knowledge Proof-Enhanced Hybrid Ledger for Federated Learning

Amirhossein Taherpour, Xiaodong Wang · Columbia University

Defends federated learning against Byzantine adversarial nodes using DAG ledger, sidechains, and zero-knowledge proofs for privacy-preserving update validation

Data Poisoning Attack Model Inversion Attack federated-learningvisionnlp

PDF

attack arXiv Jan 26, 2026 · 10w ago

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

Gabriel Lee Jun Rong, Christos Korgialas, Dion Jia Xu Ho et al. · Singapore Institute of Technology · Aristotle University of Thessaloniki +3 more

Agentic VLM/LLM system orchestrates CW, JSMA, and STA attacks to evade deepfake detectors with improved black-box transfer

Input Manipulation Attack visionmultimodalnlp

PDF

defense arXiv Dec 17, 2025 · Dec 2025

TrajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training

Mukur Gupta, Niharika Gupta, Saifur Rahman et al. · Columbia University · Vellore Institute of Technology +1 more

Defends FL models against adversarial attacks by synthesizing server-side training data from client model trajectories, enabling adversarial training without client data access

Input Manipulation Attack visionfederated-learning

PDF

defense IACR ePrint Dec 9, 2025 · Dec 2025

Improved Pseudorandom Codes from Permuted Puzzles

Miranda Christ, Noah Golowich, Sam Gunn et al. · Columbia University · Microsoft Research +5 more

Constructs provably robust LLM watermarks with subexponential security, surviving worst-case edits and detection-key-aware adversaries

Output Integrity Attack nlp

PDF

defense arXiv Dec 3, 2025 · Dec 2025

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao, Zhiwei Steven Wu, Adam Block · University of Pennsylvania · Carnegie Mellon University +1 more

Fine-tuning framework that embeds robust watermarks into open-weight LLM weights, closing the quality-detectability gap with inference-time schemes

Output Integrity Attack nlp

PDF Code

benchmark arXiv Oct 22, 2025 · Oct 2025

Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

Reya Vir, Sarvesh Bhatnagar · Columbia University · University of Michigan

Quantifies subliminal data poisoning in LLM fine-tuning: finds sharp alignment-failure phase transition, not gradual degradation

Data Poisoning Attack Training Data Poisoning nlp

2 citations PDF

attack arXiv Oct 21, 2025 · Oct 2025

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

Kuai Yu, Xiaoyu Wu, Peishen Yan et al. · Columbia University · Shanghai Jiao Tong University +4 more

Uses reinforcement learning to optimize layer selection for stealthy backdoor attacks in federated learning, beating SOTA defenses by 40%

Model Poisoning federated-learning

PDF

attack arXiv Oct 14, 2025 · Oct 2025

MS-GAGA: Metric-Selective Guided Adversarial Generation Attack

Dion J. X. Ho, Gabriel Lee Jun Rong, Niharika Shrivastava et al. · Columbia University · Singapore Institute of Technology +1 more

Dual-stream PGD attack crafts transferable, imperceptible adversarial examples that evade black-box deepfake detectors by 27% over SOTA

Input Manipulation Attack vision

2 citations PDF

defense arXiv Oct 6, 2025 · Oct 2025

Proactive defense against LLM Jailbreak

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi et al. · Columbia University

Proactive LLM defense generates spurious jailbreak-success signals to terminate attacker optimization loops prematurely

Prompt Injection nlp

2 citations PDF

defense arXiv Oct 5, 2025 · Oct 2025

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

Guangyu Shen, Siyuan Cheng, Xiangzhe Xu et al. · Purdue University · Columbia University

Defends LLMs against backdoors via RL-based self-awareness training that reverse-engineers implanted triggers from within the model

Model Poisoning nlp

PDF

attack arXiv Sep 30, 2025 · Sep 2025

Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities

Simin Chen, Yixin He, Suman Jana et al. · Columbia University · University of Southern California

Indirect prompt injection via adversarial GitHub issues tricks LLM repair agents into generating correct-but-vulnerable patches

Prompt Injection Excessive Agency nlp

2 citations PDF

defense arXiv Sep 15, 2025 · Sep 2025

DARD: Dice Adversarial Robustness Distillation against Adversarial Attacks

Jing Zou, Shungeng Zhang, Meikang Qiu et al. · Augusta University · Columbia University

Distills adversarial robustness from large teacher models to compact students, eliminating the standard accuracy trade-off of adversarial training

Input Manipulation Attack vision

PDF

attack arXiv Sep 14, 2025 · Sep 2025

Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers

Simin Chen, Jinjun Peng, Yixin He et al. · Columbia University · University of Southern California

Exploits official DL compiler inconsistencies to inject backdoors into benign models at compile time, evading all state-of-the-art detectors

Model Poisoning AI Supply Chain Attacks visionnlp

PDF

tool arXiv Aug 21, 2025 · Aug 2025

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

Andreas D. Kellas, Neophytos Christou, Wenxin Jiang et al. · Columbia University · Brown University +4 more

Defends against malicious pickle-based ML models on Hugging Face via static analysis and dynamic policy enforcement at load time

AI Supply Chain Attacks

PDF

attack arXiv Aug 13, 2025 · Aug 2025

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Junxian Li, Beining Xu, Simin Chen et al. · Shanghai Jiao Tong University · Columbia University +3 more

Multi-target backdoor attack on VLM visual grounding using dynamic text-conditioned UNet triggers to hijack object localization

Model Poisoning visionmultimodalnlp

PDF Code

Latest papers

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

Privacy-Preserving Mechanisms Enable Cheap Verifiable Inference of LLMs

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

ZK-HybridFL: Zero-Knowledge Proof-Enhanced Hybrid Ledger for Federated Learning

ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

TrajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training

Improved Pseudorandom Codes from Permuted Puzzles

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

MS-GAGA: Metric-Selective Guided Adversarial Generation Attack

Proactive defense against LLM Jailbreak

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

Red Teaming Program Repair Agents: When Correct Patches can Hide Vulnerabilities

DARD: Dice Adversarial Robustness Distillation against Adversarial Attacks

Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue