ML Security Papers

Stats

Latest papers

10 papers

defense arXiv Mar 10, 2026 · 27d ago

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen · Rice University · Stony Brook University

Defends LLM safety alignment against fine-tuning-induced degradation using generative replay of synthesized safety data

Transfer Learning Attack Prompt Injection nlp

PDF Code

benchmark arXiv Feb 18, 2026 · 6w ago

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Tanqiu Jiang, Yuhui Wang, Jiacheng Liang et al. · Stony Brook University

Benchmark evaluating LLM agent susceptibility to five long-horizon attack types across 28 agentic environments and 644 test cases

Prompt Injection Excessive Agency nlp

1 citations PDF Code

attack arXiv Nov 30, 2025 · Nov 2025

Concept-Guided Backdoor Attack on Vision Language Models

Haoyu Shen, Weimin Lyu, Haotian Xu et al. · Stony Brook University

Proposes concept-level backdoor attacks on VLMs using semantic triggers instead of pixel perturbations, evading image-based defenses

Model Poisoning visionnlpmultimodal

PDF

attack arXiv Nov 14, 2025 · Nov 2025

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Guangke Chen, Yuhui Wang, Shouling Ji et al. · Stony Brook University · Zhejiang University +1 more

Jailbreaks LALM-based TTS safety alignment via semantic obfuscation and audio-modality injection to generate harmful speech

Prompt Injection audionlpmultimodal

PDF

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

llm multimodal Stony Brook University · Zhejiang University · The Hong Kong Polytechnic University

PDF arXiv DOI

defense arXiv Nov 3, 2025 · Nov 2025

Watermarking Discrete Diffusion Language Models

Avi Bagchi, Akhil Bhimaraju, Moulik Choraria et al. · University of Pennsylvania · University of Illinois Urbana–Champaign +1 more

Embeds distortion-free Gumbel-max watermarks in discrete diffusion LM outputs with provably exponential false-positive decay

Output Integrity Attack nlpgenerative

PDF

defense arXiv Oct 23, 2025 · Oct 2025

Adversary-Aware Private Inference over Wireless Channels

Mohamed Seif, Malcolm Egan, Andrea J. Goldsmith et al. · Princeton University · INRIA +1 more

Defends against adversarial inversion of ML feature embeddings during wireless transmission using differential privacy and channel-aware encoding

Model Inversion Attack vision

PDF

attack arXiv Oct 4, 2025 · Oct 2025

Cross-Modal Content Optimization for Steering Web Agent Preferences

Tanqiu Jiang, Min Bai, Nikolaos Pappas et al. · Stony Brook University · AWS AI Labs

Black-box attack jointly optimizes adversarial image perturbations and text to steer VLM web agent selection preferences

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

defense arXiv Oct 2, 2025 · Oct 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif et al. · University of Minnesota · Princeton University +2 more

Combinatorial vocabulary-partitioning watermark for LLM text that detects and localizes post-generation edits and spoofing attacks

Output Integrity Attack nlp

1 citations PDF

defense CCS Sep 26, 2025 · Sep 2025

You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

Bochuan Cao, Changjiang Li, Yuanpu Cao et al. · The Pennsylvania State University · Palo Alto Networks +1 more

Attacks GPT-4o/Claude to extract system prompts, then defends with SysVec encoding prompts as hidden internal vectors

Sensitive Information Disclosure nlp

5 citations 1 influentialPDF

defense arXiv Sep 17, 2025 · Sep 2025

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Benjamin Sterling, Yousef El-Laham, Mónica F. Bugallo · arXiv · Stony Brook University

Defends diffusion models against membership inference attacks using critically-damped higher-order Langevin dynamics to corrupt sensitive training data earlier in the diffusion process

Membership Inference Attack generativeaudio

PDF

Latest papers

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Concept-Guided Backdoor Attack on Vision Language Models

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Watermarking Discrete Diffusion Language Models

Adversary-Aware Private Inference over Wireless Channels

Cross-Modal Content Optimization for Steering Web Agent Preferences

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue