ML Security Papers

Latest papers

26 papers

attack arXiv Mar 29, 2026 · 8d ago

Understanding Semantic Perturbations on In-Processing Generative Image Watermarks

Anirudh Nakra, Min Wu · University of Maryland

Semantic editing attacks that alter image content bypass generative watermarks while standard perturbations fail, revealing critical evaluation gaps

Output Integrity Attack visiongenerative

PDF

attack arXiv Mar 28, 2026 · 9d ago

Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models

Kaishen Wang, Heng Huang · University of Maryland

Jailbreak attack exploiting bidirectional coupling between vision understanding and image generation in unified multimodal models

Input Manipulation Attack Prompt Injection multimodalvisiongenerative

PDF

defense arXiv Feb 27, 2026 · 5w ago

Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling

Om Tailor · University of Maryland

Proposes verifier-bound admission protocol that certifiably bounds covert signaling channels between colluding LLM agents

Excessive Agency nlp

PDF

defense arXiv Feb 15, 2026 · 7w ago

MC$^2$Mark: Distortion-Free Multi-Bit Watermarking for Long Messages

Xuehao Cui, Ruibo Chen, Yihan Wu et al. · University of Maryland

Distortion-free multi-bit watermarking framework embeds long identifiers in LLM outputs for reliable AI text provenance tracing

Output Integrity Attack nlp

PDF

defense arXiv Feb 12, 2026 · 7w ago

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

Ruibo Chen, Yihan Wu, Xuehao Cui et al. · University of Maryland · National University of Singapore

Proposes weaker single-layer watermarks in LLM ensembles to preserve entropy and improve AI-generated text detectability

Output Integrity Attack nlp

PDF

defense arXiv Feb 10, 2026 · 7w ago

X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging

Pranav Kulkarni, Junfeng Guo, Heng Huang · University of Maryland

Watermarks chest X-ray training data with saliency-guided clean-label backdoor patterns to verify unauthorized dataset use in black-box settings

Output Integrity Attack vision

PDF

defense arXiv Feb 3, 2026 · 8w ago

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani et al. · Carnegie Mellon University · University of Maryland

Fingerprints LLM outputs to detect unauthorized distillation using gradient-aligned token perturbations that transfer to student models

Model Theft Model Theft nlp

PDF

benchmark ICDMW Jan 30, 2026 · 9w ago

Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

Alabi Ahmed, Vandana Janeja, Sanjay Purushotham · University of Maryland

Proposes taxonomy and new benchmark dataset for detecting multi-speaker conversational audio deepfakes, exposing major detection gaps

Output Integrity Attack audio

PDF Code

defense ICDMW Dec 21, 2025 · Dec 2025

Reliable Audio Deepfake Detection in Variable Conditions via Quantum-Kernel SVMs

Lisan Al Amin, Vandana P. Janeja · University of Maryland

Quantum-kernel SVMs cut audio deepfake false-positive rates by up to 57% over classical SVMs across four spoofing benchmarks

Output Integrity Attack audio

PDF

benchmark arXiv Nov 16, 2025 · Nov 2025

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Samuel Nathanson, Rebecca Williams, Cynthia Matuszek · Johns Hopkins University Applied Physics Laboratory · University of Maryland

Empirically quantifies how attacker-to-target size ratio predicts jailbreak success across 6,000+ multi-LLM adversarial exchanges

Prompt Injection nlp

1 citations PDF

defense arXiv Nov 11, 2025 · Nov 2025

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Shourya Batra, Pierce Tillman, Samarth Gaggar et al. · Independent · Algoverse +3 more

Activation steering defense that reduces sensitive user data leakage in LLM chain-of-thought reasoning traces at inference time

Sensitive Information Disclosure nlp

4 citations 1 influentialPDF

benchmark arXiv Oct 31, 2025 · Oct 2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei, Zora Che, Nathaniel Li et al. · Scale AI · Princeton University +3 more

Benchmark framework reveals bio-foundation model safety filtering is bypassable via fine-tuning, with dual-use signals persisting in pretrained representations

Transfer Learning Attack generative

PDF

tool arXiv Oct 27, 2025 · Oct 2025

Quantifying Document Impact in RAG-LLMs

Armin Gerami, Kazem Faghih, Ramani Duraiswami · University of Maryland

Proposes Influence Score metric to detect malicious injected documents in RAG pipelines using Partial Information Decomposition

Prompt Injection nlp

PDF

benchmark arXiv Oct 24, 2025 · Oct 2025

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Sarah Ball, Niki Hasrati, Alexander Robey et al. · Ludwig-Maximilians-Universität München · Carnegie Mellon University +1 more

Analyzes why gradient-optimized adversarial suffixes transfer across LLMs using refusal-direction geometry in activation space

Input Manipulation Attack Prompt Injection nlp

PDF Code

attack arXiv Oct 6, 2025 · Oct 2025

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov et al. · University of Maryland · Meta

Trains RL attacker from scratch to perform prompt injection, achieving 98% ASR against GPT-4o and bypassing Instruction Hierarchy and SecAlign defenses

Prompt Injection nlp

9 citations PDF Code

benchmark arXiv Oct 6, 2025 · Oct 2025

Trade-off in Estimating the Number of Byzantine Clients in Federated Learning

Ziyi Chen, Su Zhang, Heng Huang · University of Maryland

Theoretically proves a fundamental trade-off in estimating Byzantine client count for robust federated learning aggregators

Data Poisoning Attack federated-learning

PDF

benchmark arXiv Oct 5, 2025 · Oct 2025

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Om Tailor · University of Maryland

Benchmark and auditing framework detecting steganographic covert collusion between LLM agents in market and governance workflows

Output Integrity Attack Excessive Agency nlp

3 citations PDF

defense arXiv Oct 3, 2025 · Oct 2025

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Jingqi Zhang, Ruibo Chen, Yingqing Yang et al. · National University of Singapore · University of Maryland +2 more

Watermarks LLM fine-tuning datasets with distortion-free signals to enable black-box detection of copyrighted dataset usage

Output Integrity Attack nlp

5 citations PDF Code

defense ACM MM Oct 3, 2025 · Oct 2025

Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations

Naresh Kumar Devulapally, Shruti Agarwal, Tejas Gokhale et al. · The State University of New York · Adobe Research +1 more

Defends user images from unauthorized diffusion model personalization via imperceptible latent-space trajectory-shifted poisoning perturbations

Data Poisoning Attack Output Integrity Attack visiongenerative

PDF Code

attack arXiv Sep 29, 2025 · Sep 2025

GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh et al. · The University of Melbourne · University of Maryland

Gradient-based adversarial image synthesis that induces object hallucinations in multimodal LLMs via diffusion-guided embedding-space optimization

Input Manipulation Attack Prompt Injection visionnlpmultimodalgenerative

PDF

Loading more papers…

Latest papers

Understanding Semantic Perturbations on In-Processing Generative Image Watermarks

Unsafe by Reciprocity: How Generation-Understanding Coupling Undermines Safety in Unified Multimodal Models

Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling

MC$^2$Mark: Distortion-Free Multi-Bit Watermarking for Long Messages

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging

Antidistillation Fingerprinting

Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

Reliable Audio Deepfake Detection in Variable Conditions via Quantum-Kernel SVMs

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Quantifying Document Impact in RAG-LLMs

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Trade-off in Estimating the Number of Byzantine Clients in Federated Learning

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations

GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue