Latest papers

10 papers
defense arXiv Mar 10, 2026 · 27d ago

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen · Rice University · Stony Brook University

Defends LLM safety alignment against fine-tuning-induced degradation using generative replay of synthesized safety data

Transfer Learning Attack Prompt Injection nlp
PDF Code
benchmark arXiv Feb 18, 2026 · 6w ago

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

Tanqiu Jiang, Yuhui Wang, Jiacheng Liang et al. · Stony Brook University

Benchmark evaluating LLM agent susceptibility to five long-horizon attack types across 28 agentic environments and 644 test cases

Prompt Injection Excessive Agency nlp
1 citations PDF Code
attack arXiv Nov 30, 2025 · Nov 2025

Concept-Guided Backdoor Attack on Vision Language Models

Haoyu Shen, Weimin Lyu, Haotian Xu et al. · Stony Brook University

Proposes concept-level backdoor attacks on VLMs using semantic triggers instead of pixel perturbations, evading image-based defenses

Model Poisoning visionnlpmultimodal
PDF
attack arXiv Nov 14, 2025 · Nov 2025

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Guangke Chen, Yuhui Wang, Shouling Ji et al. · Stony Brook University · Zhejiang University +1 more

Jailbreaks LALM-based TTS safety alignment via semantic obfuscation and audio-modality injection to generate harmful speech

Prompt Injection audionlpmultimodal
PDF
defense arXiv Nov 3, 2025 · Nov 2025

Watermarking Discrete Diffusion Language Models

Avi Bagchi, Akhil Bhimaraju, Moulik Choraria et al. · University of Pennsylvania · University of Illinois Urbana–Champaign +1 more

Embeds distortion-free Gumbel-max watermarks in discrete diffusion LM outputs with provably exponential false-positive decay

Output Integrity Attack nlpgenerative
PDF
defense arXiv Oct 23, 2025 · Oct 2025

Adversary-Aware Private Inference over Wireless Channels

Mohamed Seif, Malcolm Egan, Andrea J. Goldsmith et al. · Princeton University · INRIA +1 more

Defends against adversarial inversion of ML feature embeddings during wireless transmission using differential privacy and channel-aware encoding

Model Inversion Attack vision
PDF
attack arXiv Oct 4, 2025 · Oct 2025

Cross-Modal Content Optimization for Steering Web Agent Preferences

Tanqiu Jiang, Min Bai, Nikolaos Pappas et al. · Stony Brook University · AWS AI Labs

Black-box attack jointly optimizes adversarial image perturbations and text to steer VLM web agent selection preferences

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
defense arXiv Oct 2, 2025 · Oct 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif et al. · University of Minnesota · Princeton University +2 more

Combinatorial vocabulary-partitioning watermark for LLM text that detects and localizes post-generation edits and spoofing attacks

Output Integrity Attack nlp
1 citations PDF
defense CCS Sep 26, 2025 · Sep 2025

You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

Bochuan Cao, Changjiang Li, Yuanpu Cao et al. · The Pennsylvania State University · Palo Alto Networks +1 more

Attacks GPT-4o/Claude to extract system prompts, then defends with SysVec encoding prompts as hidden internal vectors

Sensitive Information Disclosure nlp
5 citations 1 influentialPDF
defense arXiv Sep 17, 2025 · Sep 2025

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Benjamin Sterling, Yousef El-Laham, Mónica F. Bugallo · arXiv · Stony Brook University

Defends diffusion models against membership inference attacks using critically-damped higher-order Langevin dynamics to corrupt sensitive training data earlier in the diffusion process

Membership Inference Attack generativeaudio
PDF