Latest papers

27 papers
defense arXiv Mar 12, 2026 · 25d ago

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu et al. · University of California · Johns Hopkins University +2 more

Analyzes refusal trigger mechanisms in LLM safety alignment to reduce overrefusal while maintaining jailbreak defenses

Prompt Injection nlp
PDF
benchmark arXiv Feb 10, 2026 · 7w ago

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Zhisheng Qi, Utkarsh Sahu, Li Ma et al. · University of Oregon · Michigan State University +6 more

First systematic benchmark comparing knowledge-extraction attacks and defenses on RAG systems under unified evaluation protocols

Sensitive Information Disclosure nlp
PDF Code
attack arXiv Feb 2, 2026 · 9w ago

Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

Bohan Wang, Zewen Liu, Lu Lin et al. · Emory University · The Pennsylvania State University +2 more

Adversarially decouples time series classifier predictions from explanations, enabling targeted misclassification with plausible-looking cover-up explanations

Input Manipulation Attack timeseries
PDF
attack arXiv Jan 29, 2026 · 9w ago

Jailbreaks on Vision Language Model via Multimodal Reasoning

Aarush Noheria, Yuguang Yao · Novi High School · Michigan State University

Dual-strategy VLM jailbreak combining Chain-of-Thought prompt manipulation and ReAct-driven adversarial image noising to evade safety filters

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
attack arXiv Jan 13, 2026 · 11w ago

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Fahad Shamshad, Nils Lukas, Karthik Nandakumar · MBZUAI · Michigan State University

Attacks invisible image watermarks by reformulating removal as novel view synthesis using zero-shot diffusion, defeating 15 schemes without detector access.

Output Integrity Attack visiongenerative
PDF
defense arXiv Jan 8, 2026 · 12w ago

On the Holistic Approach for Detecting Human Image Forgery

Xiao Guo, Jie Zhu, Anil Jain et al. · Michigan State University

Novel dual-branch deepfake detector unifying face forgery and full-body synthetic human detection using MLLM and frequency-domain analysis

Output Integrity Attack visionmultimodal
PDF
defense arXiv Dec 12, 2025 · Dec 2025

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Samar Fares, Nurbek Tastan, Karthik Nandakumar · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University

In-generation video watermarking via LoRA parameter displacement to track provenance of diffusion-generated videos

Output Integrity Attack visiongenerative
1 citations PDF
benchmark arXiv Dec 10, 2025 · Dec 2025

Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression

Weiyi He, Yue Xing · Michigan State University

Theoretical analysis showing positional encoding amplifies transformer adversarial vulnerability via adversarial Rademacher complexity bounds on in-context learning

Input Manipulation Attack nlp
PDF
attack arXiv Dec 5, 2025 · Dec 2025

SPOOF: Simple Pixel Operations for Out-of-Distribution Fooling

Ankit Gupta, Christoph Adami, Emily Dolson · Michigan State University

Greedy black-box attack generates high-confidence fooling images via sparse pixel edits on CNNs and ViT transformers

Input Manipulation Attack vision
PDF
defense arXiv Dec 3, 2025 · Dec 2025

Open Set Face Forgery Detection via Dual-Level Evidence Collection

Zhongyi Cai, Bryce Gernon, Wentao Bao et al. · Michigan State University · Rochester Institute of Technology

Proposes dual-level evidential uncertainty estimation to detect novel, unseen face forgery categories in open-set settings

Output Integrity Attack vision
PDF
benchmark arXiv Nov 26, 2025 · Nov 2025

Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

Xinyu Liu, Xu Zhang, Can Chen et al. · Michigan State University · Illinois Institute of Technology +1 more

Uses Information Bottleneck theory to analyze backdoor training dynamics and proposes a model-level stealthiness metric for backdoor attacks

Model Poisoning vision
PDF Code
benchmark arXiv Nov 24, 2025 · Nov 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University +1 more

Benchmarks T2I diffusion safety alignment across safety, utility, quality, and robustness after benign LoRA fine-tuning

Output Integrity Attack Transfer Learning Attack visiongenerative
PDF
defense arXiv Nov 23, 2025 · Nov 2025

Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks

Xunlei Qian, Yue Xing · Michigan State University

Defends split conformal prediction against test-time adversarial attacks by analyzing coverage monotonicity and using calibration-time perturbations to control guarantees

Input Manipulation Attack vision
PDF
defense arXiv Nov 22, 2025 · Nov 2025

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu, Xiufeng Song, Huayu Zheng et al. · Shanghai Jiao Tong University · IEEE +2 more

Novel multimodal detector combining ViT spatio-temporal features and MLLM reasoning to identify diffusion-generated videos

Output Integrity Attack visionmultimodal
PDF
benchmark arXiv Nov 7, 2025 · Nov 2025

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen et al. · University of Minnesota · Michigan State University +1 more

Exposes that all major LLM unlearning methods still leak private/hazardous training data under probabilistic sampling; introduces leak@k metric and RULE defense.

Model Inversion Attack Sensitive Information Disclosure nlp
1 citations PDF
attack arXiv Oct 22, 2025 · Oct 2025

Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection

Ariana Yi, Ce Zhou, Liyang Xiao et al. · Mission San Jose High School · Missouri University of Science and Technology +1 more

No-box adversarial attack exploiting RGBA alpha channel blending in video to fool object detectors and VLMs with 100% success rate

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF
attack arXiv Oct 19, 2025 · Oct 2025

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Bingqi Shang, Yiwei Chen, Yihua Zhang et al. · Michigan State University · National University of Singapore +1 more

Backdoors LLM unlearning via attention sink positions so models appear to forget but covertly restore knowledge when triggered

Model Poisoning nlp
1 citations PDF Code
attack arXiv Oct 12, 2025 · Oct 2025

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

Mohan Zhang, Yihua Zhang, Jinghan Jia et al. · University of North Carolina at Chapel Hill · Michigan State University +1 more

Backdoor-implanted attack on large reasoning models forcing perpetual CoT loops, achieving 100% resource exhaustion success rate

Model Poisoning Model Denial of Service nlp
1 citations PDF
benchmark arXiv Oct 8, 2025 · Oct 2025

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Chongyu Fan, Changsheng Wang, Yancheng Huang et al. · Michigan State University · IBM Research

Benchmarks 12 LLM unlearning methods on effectiveness, utility, and robustness to attacks recovering forgotten harmful behaviors

Prompt Injection nlp
PDF
benchmark arXiv Oct 8, 2025 · Oct 2025

PEAR: Planner-Executor Agent Robustness Benchmark

Shen Dong, Mingxuan Zhang, Pengfei He et al. · Michigan State University · Purdue University +1 more

Benchmark for evaluating adversarial robustness of LLM planner-executor multi-agent systems across harmful action, privacy, and DoS attacks

Prompt Injection Excessive Agency nlp
PDF Code
Loading more papers…