Latest papers

10 papers
defense arXiv Mar 18, 2026 · 19d ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Haozheng Luo, Yimin Wang, Jiahao Yu et al. · Northwestern University · University of Michigan +1 more

Aligns reasoning models against jailbreaks by optimizing safety in hidden representation space using contrastive RL

Prompt Injection nlp
PDF
benchmark arXiv Mar 6, 2026 · 4w ago

When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

Qitong Wang, Haoran Dai, Haotian Zhang et al. · University of Delaware · Illinois Institute of Technology +1 more

Introduces metrics revealing that multimodal backdoor attacks collapse to single-modality dominance rather than exploiting modalities synergistically

Model Poisoning multimodalgenerative
PDF
attack arXiv Mar 3, 2026 · 4w ago

On Google's SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

Romina Omidi, Yun Dong, Binghui Wang · Illinois Institute of Technology

Theoretically analyzes SynthID-Text LLM watermarking and proposes a layer inflation attack that defeats its mean-score detection scheme.

Output Integrity Attack nlp
PDF Code
attack arXiv Jan 20, 2026 · 10w ago

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

Bingxin Xu, Yuzhang Shang, Binghui Wang et al. · University of Southern California · University of Central Florida +1 more

Backdoor attack on VLA robotic models exploiting action chunking to inject stealthy malicious trajectories with 93% ASR

Model Poisoning Data Poisoning Attack visionmultimodalreinforcement-learning
1 citations PDF
benchmark arXiv Nov 26, 2025 · Nov 2025

Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

Xinyu Liu, Xu Zhang, Can Chen et al. · Michigan State University · Illinois Institute of Technology +1 more

Uses Information Bottleneck theory to analyze backdoor training dynamics and proposes a model-level stealthiness metric for backdoor attacks

Model Poisoning vision
PDF Code
defense arXiv Oct 22, 2025 · Oct 2025

Towards Strong Certified Defense with Universal Asymmetric Randomization

Hanbin Hong, Ashish Kundu, Ali Payani et al. · University of Connecticut · Cisco Research +1 more

Certified adversarial defense using anisotropic randomized smoothing that outperforms isotropic baselines by up to 182.6% on certified accuracy

Input Manipulation Attack vision
PDF Code
defense arXiv Sep 24, 2025 · Sep 2025

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao et al. · Northwestern University · University of Notre Dame +1 more

Hardens LLM unlearning against relearning attacks by steering parameters toward flat loss minima via adversarial neighborhood-aware optimization

Sensitive Information Disclosure Prompt Injection nlp
1 citations PDF
defense arXiv Aug 5, 2025 · Aug 2025

Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

Haoran Wang, Xiongxiao Xu, Baixiang Huang et al. · Emory University · Illinois Institute of Technology

Defends RAG systems against private data extraction by injecting calibrated noise into token logits with formal DP guarantees

Sensitive Information Disclosure nlp
PDF Code
attack arXiv Aug 3, 2025 · Aug 2025

Practical, Generalizable and Robust Backdoor Attacks on Text-to-Image Diffusion Models

Haoran Dai, Jiawen Wang, Ruo Yang et al. · Illinois Institute of Technology · Samsung +2 more

Backdoor attack on text-to-image diffusion models achieving >90% success with only 10 poisoned samples and natural-language triggers

Model Poisoning Data Poisoning Attack visionnlpgenerative
PDF
defense arXiv Jan 9, 2025 · Jan 2025

Watermarking Graph Neural Networks via Explanations for Ownership Protection

Jane Downer, Ren Wang, Binghui Wang · Illinois Institute of Technology

Embeds ownership watermarks in GNN explanation behavior to prove model IP, surviving fine-tuning and pruning attacks

Model Theft graph
PDF