ML Security Papers

Latest papers

17 papers

attack arXiv Apr 24, 2026 · 27d ago

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea et al. · NVIDIA

Trains general-purpose LLM red teaming models that generalize to arbitrary adversarial goals without pre-existing evaluators

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF

attack arXiv Mar 13, 2026 · 9w ago

DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

Ridwan Arefeen, Xiaoxiao Miao, Rong Tong et al. · Singapore Institute of Technology · Duke Kunshan University +1 more

Dual-stream speaker re-identification attack on anonymized voice using SSL and spectral features with staged transfer learning

Input Manipulation Attack audio

PDF

benchmark arXiv Feb 21, 2026 · 12w ago

Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

Trishita Tiwari, Ari Trachtenberg, G. Edward Suh · Cornell University · Boston University +1 more

Proposes Prior-Aware Memorization metric showing 55–90% of LLM 'memorized' sequences are actually statistically common, not genuine leakage

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Feb 14, 2026 · Feb 2026

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

Yuqi Jia, Ruiqi Wang, Xilong Wang et al. · Duke University · NVIDIA

Three-class attention-based classifier detects prompt injection by distinguishing misaligned, aligned, and non-instruction LLM inputs

Prompt Injection nlp

PDF

survey arXiv Feb 11, 2026 · Feb 2026

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Peiran Wang, Xinfeng Li, Chong Xiang et al. · UCLA · NTU +1 more

Systematizes prompt injection attacks and defenses for LLM agents, introducing AgentPI benchmark that exposes context-dependent gaps in existing evaluations

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Jan 29, 2026 · Jan 2026

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

Xiaogeng Liu, Xinyan Wang, Yechao Zhang et al. · Johns Hopkins University · NVIDIA +4 more

RL-trained attacker generates short natural prompts that force LRMs into pathologically long reasoning, achieving 286x amplification and >98% detection bypass

Model Denial of Service nlpreinforcement-learning

PDF

defense arXiv Jan 18, 2026 · Jan 2026

LR-DWM: Efficient Watermarking for Diffusion Language Models

Ofek Raban, Ethan Fetaya, Gal Chechik · Bar-Ilan University · NVIDIA

Proposes LR-DWM, an efficient watermarking scheme for Diffusion Language Models using bidirectional neighbor context with negligible overhead

Output Integrity Attack nlpgenerative

PDF

defense arXiv Jan 15, 2026 · Jan 2026

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh et al. · Washington University in St. Louis · University of Wisconsin–Madison +2 more

Defends LLM agents against indirect prompt injection using structured reasoning to detect conflicting injected instructions

Prompt Injection nlp

1 citations PDF Code

defense arXiv Nov 30, 2025 · Nov 2025

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang, Chong Xiang, Sanjay Kariyappa et al. · NVIDIA · University of Illinois Urbana-Champaign +1 more

Defends LLM agents against indirect prompt injection by analyzing whether the model intends to follow untrusted instructions, cutting attack success from 100% to 8.5%

Prompt Injection nlp

1 citations PDF

defense arXiv Nov 27, 2025 · Nov 2025

A Safety and Security Framework for Real-World Agentic Systems

Shaona Ghosh, Barnaby Simkin, Kyriacos Shiarlis et al. · NVIDIA · Lakera AI

Proposes enterprise agentic AI security framework with risk taxonomy, AI-driven red teaming, and mitigation agents for tool misuse and cascading actions

Excessive Agency Insecure Plugin Design Prompt Injection nlp

2 citations PDF Code

tool arXiv Oct 30, 2025 · Oct 2025

Detecting Data Contamination in LLMs via In-Context Learning

Michał Zawalski, Meriem Boubdir, Klaudia Bałazy et al. · NVIDIA

Detects LLM benchmark contamination by measuring how in-context examples disrupt memorization-based confidence signals

Membership Inference Attack nlp

PDF

benchmark arXiv Oct 19, 2025 · Oct 2025

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Bo-Han Feng, Chien-Feng Liu, Yu-Hsuan Li Liang et al. · National Taiwan University · NVIDIA

Reveals that speaker emotional intensity systematically jailbreaks audio-language models, with medium intensity posing the greatest safety risk

Prompt Injection audiomultimodalnlp

1 citations PDF Code

defense arXiv Oct 14, 2025 · Oct 2025

ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation

Ziyuan Luo, Yangyi Zhao, Ka Chun Cheung et al. · Hong Kong Baptist University · NVIDIA

Protects visual datasets from unauthorized RAIG use by injecting sentinel images detectable via secret random-string retrieval keys

Output Integrity Attack visiongenerative

3 citations PDF Code

defense arXiv Oct 3, 2025 · Oct 2025

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani et al. · Drexel University · NVIDIA

Defends AI videoconferencing from real-time face puppeteering by detecting identity swaps via biometric leakage in transmitted pose-expression latents

Output Integrity Attack visiongenerative

PDF

defense arXiv Sep 27, 2025 · Sep 2025

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu et al. · Princeton University · NVIDIA

Defends RAG-based search against corpus poisoning using graph-theoretic document reliability filtering with provable robustness guarantees

Input Manipulation Attack Prompt Injection nlp

2 citations PDF

attack arXiv Aug 26, 2025 · Aug 2025

SegReConcat: A Data Augmentation Method for Voice Anonymization Attack

Ridwan Arefeen, Xiaoxiao Miao, Rong Tong et al. · Singapore Institute of Technology · Duke Kunshan University +1 more

Attacks voice anonymization systems by augmenting ASV training data via word-level segment rearrangement to recover speaker identity

Output Integrity Attack audio

PDF Code

benchmark arXiv Jan 7, 2025 · Jan 2025

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen et al. · NVIDIA · National Taiwan University +1 more

Benchmark dataset (SINE) for seamless AI speech edit detection, revealing gaps in cut-and-paste-trained detectors

Output Integrity Attack audio

PDF

Latest papers

Training a General Purpose Automated Red Teaming Model

DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

LR-DWM: Efficient Watermarking for Diffusion Language Models

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

A Safety and Security Framework for Real-World Agentic Systems

Detecting Data Contamination in LLMs via In-Context Learning

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

SegReConcat: A Data Augmentation Method for Voice Anonymization Attack

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue