Latest papers

16 papers
attack arXiv Mar 13, 2026 · 24d ago

DAST: A Dual-Stream Voice Anonymization Attacker with Staged Training

Ridwan Arefeen, Xiaoxiao Miao, Rong Tong et al. · Singapore Institute of Technology · Duke Kunshan University +1 more

Dual-stream speaker re-identification attack on anonymized voice using SSL and spectral features with staged transfer learning

Input Manipulation Attack audio
PDF
benchmark arXiv Feb 21, 2026 · 6w ago

Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

Trishita Tiwari, Ari Trachtenberg, G. Edward Suh · Cornell University · Boston University +1 more

Proposes Prior-Aware Memorization metric showing 55–90% of LLM 'memorized' sequences are actually statistically common, not genuine leakage

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Feb 14, 2026 · 7w ago

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

Yuqi Jia, Ruiqi Wang, Xilong Wang et al. · Duke University · NVIDIA

Three-class attention-based classifier detects prompt injection by distinguishing misaligned, aligned, and non-instruction LLM inputs

Prompt Injection nlp
PDF
survey arXiv Feb 11, 2026 · 7w ago

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Peiran Wang, Xinfeng Li, Chong Xiang et al. · UCLA · NTU +1 more

Systematizes prompt injection attacks and defenses for LLM agents, introducing AgentPI benchmark that exposes context-dependent gaps in existing evaluations

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Jan 29, 2026 · 9w ago

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

Xiaogeng Liu, Xinyan Wang, Yechao Zhang et al. · Johns Hopkins University · NVIDIA +4 more

RL-trained attacker generates short natural prompts that force LRMs into pathologically long reasoning, achieving 286x amplification and >98% detection bypass

Model Denial of Service nlpreinforcement-learning
PDF
defense arXiv Jan 18, 2026 · 11w ago

LR-DWM: Efficient Watermarking for Diffusion Language Models

Ofek Raban, Ethan Fetaya, Gal Chechik · Bar-Ilan University · NVIDIA

Proposes LR-DWM, an efficient watermarking scheme for Diffusion Language Models using bidirectional neighbor context with negligible overhead

Output Integrity Attack nlpgenerative
PDF
defense arXiv Jan 15, 2026 · 11w ago

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Hao Li, Yankai Yang, G. Edward Suh et al. · Washington University in St. Louis · University of Wisconsin–Madison +2 more

Defends LLM agents against indirect prompt injection using structured reasoning to detect conflicting injected instructions

Prompt Injection nlp
1 citations PDF Code
defense arXiv Nov 30, 2025 · Nov 2025

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Mintong Kang, Chong Xiang, Sanjay Kariyappa et al. · NVIDIA · University of Illinois Urbana-Champaign +1 more

Defends LLM agents against indirect prompt injection by analyzing whether the model intends to follow untrusted instructions, cutting attack success from 100% to 8.5%

Prompt Injection nlp
1 citations PDF
defense arXiv Nov 27, 2025 · Nov 2025

A Safety and Security Framework for Real-World Agentic Systems

Shaona Ghosh, Barnaby Simkin, Kyriacos Shiarlis et al. · NVIDIA · Lakera AI

Proposes enterprise agentic AI security framework with risk taxonomy, AI-driven red teaming, and mitigation agents for tool misuse and cascading actions

Excessive Agency Insecure Plugin Design Prompt Injection nlp
2 citations PDF Code
tool arXiv Oct 30, 2025 · Oct 2025

Detecting Data Contamination in LLMs via In-Context Learning

Michał Zawalski, Meriem Boubdir, Klaudia Bałazy et al. · NVIDIA

Detects LLM benchmark contamination by measuring how in-context examples disrupt memorization-based confidence signals

Membership Inference Attack nlp
PDF
benchmark arXiv Oct 19, 2025 · Oct 2025

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Bo-Han Feng, Chien-Feng Liu, Yu-Hsuan Li Liang et al. · National Taiwan University · NVIDIA

Reveals that speaker emotional intensity systematically jailbreaks audio-language models, with medium intensity posing the greatest safety risk

Prompt Injection audiomultimodalnlp
1 citations PDF Code
defense arXiv Oct 14, 2025 · Oct 2025

ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation

Ziyuan Luo, Yangyi Zhao, Ka Chun Cheung et al. · Hong Kong Baptist University · NVIDIA

Protects visual datasets from unauthorized RAIG use by injecting sentinel images detectable via secret random-string retrieval keys

Output Integrity Attack visiongenerative
3 citations PDF Code
defense arXiv Oct 3, 2025 · Oct 2025

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani et al. · Drexel University · NVIDIA

Defends AI videoconferencing from real-time face puppeteering by detecting identity swaps via biometric leakage in transmitted pose-expression latents

Output Integrity Attack visiongenerative
PDF
defense arXiv Sep 27, 2025 · Sep 2025

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu et al. · Princeton University · NVIDIA

Defends RAG-based search against corpus poisoning using graph-theoretic document reliability filtering with provable robustness guarantees

Input Manipulation Attack Prompt Injection nlp
2 citations PDF
attack arXiv Aug 26, 2025 · Aug 2025

SegReConcat: A Data Augmentation Method for Voice Anonymization Attack

Ridwan Arefeen, Xiaoxiao Miao, Rong Tong et al. · Singapore Institute of Technology · Duke Kunshan University +1 more

Attacks voice anonymization systems by augmenting ASV training data via word-level segment rearrangement to recover speaker identity

Output Integrity Attack audio
PDF Code
benchmark arXiv Jan 7, 2025 · Jan 2025

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen et al. · NVIDIA · National Taiwan University +1 more

Benchmark dataset (SINE) for seamless AI speech edit detection, revealing gaps in cut-and-paste-trained detectors

Output Integrity Attack audio
PDF