ML Security Papers

Latest papers

29 papers

benchmark arXiv Apr 23, 2026 · 28d ago

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

Xiaoyi Chen, Haoyuan Wang, Siyuan Tang et al. · Indiana University Bloomington · Independent Researcher +3 more

Evaluation framework exposing weaknesses in LLM privacy unlearning through three-tier attacks: direct retrieval, in-context recovery, and fine-tuning restoration

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 16, 2026 · 5w ago

Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

Yisheng Zhong, Sijia Liu, Zhuangdi Zhu · George Mason University · Michigan State University

Multi-objective LLM unlearning framework that removes hazardous knowledge while defending against adversarial probing attacks via bidirectional distillation

Model Inversion Attack Prompt Injection Sensitive Information Disclosure nlp

PDF

defense arXiv Mar 12, 2026 · 10w ago

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu et al. · University of California · Johns Hopkins University +2 more

Analyzes refusal trigger mechanisms in LLM safety alignment to reduce overrefusal while maintaining jailbreak defenses

Prompt Injection nlp

PDF

benchmark arXiv Feb 10, 2026 · Feb 2026

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Zhisheng Qi, Utkarsh Sahu, Li Ma et al. · University of Oregon · Michigan State University +6 more

First systematic benchmark comparing knowledge-extraction attacks and defenses on RAG systems under unified evaluation protocols

Sensitive Information Disclosure nlp

PDF Code

attack arXiv Feb 2, 2026 · Feb 2026

Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

Bohan Wang, Zewen Liu, Lu Lin et al. · Emory University · The Pennsylvania State University +2 more

Adversarially decouples time series classifier predictions from explanations, enabling targeted misclassification with plausible-looking cover-up explanations

Input Manipulation Attack timeseries

PDF

attack arXiv Jan 29, 2026 · Jan 2026

Jailbreaks on Vision Language Model via Multimodal Reasoning

Aarush Noheria, Yuguang Yao · Novi High School · Michigan State University

Dual-strategy VLM jailbreak combining Chain-of-Thought prompt manipulation and ReAct-driven adversarial image noising to evade safety filters

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

attack arXiv Jan 13, 2026 · Jan 2026

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

Fahad Shamshad, Nils Lukas, Karthik Nandakumar · MBZUAI · Michigan State University

Attacks invisible image watermarks by reformulating removal as novel view synthesis using zero-shot diffusion, defeating 15 schemes without detector access.

Output Integrity Attack visiongenerative

PDF

defense arXiv Jan 8, 2026 · Jan 2026

On the Holistic Approach for Detecting Human Image Forgery

Xiao Guo, Jie Zhu, Anil Jain et al. · Michigan State University

Novel dual-branch deepfake detector unifying face forgery and full-body synthetic human detection using MLLM and frequency-domain analysis

Output Integrity Attack visionmultimodal

PDF

defense arXiv Dec 12, 2025 · Dec 2025

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Samar Fares, Nurbek Tastan, Karthik Nandakumar · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University

In-generation video watermarking via LoRA parameter displacement to track provenance of diffusion-generated videos

Output Integrity Attack visiongenerative

1 citations PDF

benchmark arXiv Dec 10, 2025 · Dec 2025

Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression

Weiyi He, Yue Xing · Michigan State University

Theoretical analysis showing positional encoding amplifies transformer adversarial vulnerability via adversarial Rademacher complexity bounds on in-context learning

Input Manipulation Attack nlp

PDF

attack arXiv Dec 5, 2025 · Dec 2025

SPOOF: Simple Pixel Operations for Out-of-Distribution Fooling

Ankit Gupta, Christoph Adami, Emily Dolson · Michigan State University

Greedy black-box attack generates high-confidence fooling images via sparse pixel edits on CNNs and ViT transformers

Input Manipulation Attack vision

PDF

defense arXiv Dec 3, 2025 · Dec 2025

Open Set Face Forgery Detection via Dual-Level Evidence Collection

Zhongyi Cai, Bryce Gernon, Wentao Bao et al. · Michigan State University · Rochester Institute of Technology

Proposes dual-level evidential uncertainty estimation to detect novel, unseen face forgery categories in open-set settings

Output Integrity Attack vision

PDF

benchmark arXiv Nov 26, 2025 · Nov 2025

Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

Xinyu Liu, Xu Zhang, Can Chen et al. · Michigan State University · Illinois Institute of Technology +1 more

Uses Information Bottleneck theory to analyze backdoor training dynamics and proposes a model-level stealthiness metric for backdoor attacks

Model Poisoning vision

PDF Code

benchmark arXiv Nov 24, 2025 · Nov 2025

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad et al. · Mohamed bin Zayed University of Artificial Intelligence · Michigan State University +1 more

Benchmarks T2I diffusion safety alignment across safety, utility, quality, and robustness after benign LoRA fine-tuning

Output Integrity Attack Transfer Learning Attack visiongenerative

PDF

defense arXiv Nov 23, 2025 · Nov 2025

Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks

Xunlei Qian, Yue Xing · Michigan State University

Defends split conformal prediction against test-time adversarial attacks by analyzing coverage monotonicity and using calibration-time perturbations to control guarantees

Input Manipulation Attack vision

PDF

defense arXiv Nov 22, 2025 · Nov 2025

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu, Xiufeng Song, Huayu Zheng et al. · Shanghai Jiao Tong University · IEEE +2 more

Novel multimodal detector combining ViT spatio-temporal features and MLLM reasoning to identify diffusion-generated videos

Output Integrity Attack visionmultimodal

PDF

benchmark arXiv Nov 7, 2025 · Nov 2025

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen et al. · University of Minnesota · Michigan State University +1 more

Exposes that all major LLM unlearning methods still leak private/hazardous training data under probabilistic sampling; introduces leak@k metric and RULE defense.

Model Inversion Attack Sensitive Information Disclosure nlp

1 citations PDF

attack arXiv Oct 22, 2025 · Oct 2025

Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection

Ariana Yi, Ce Zhou, Liyang Xiao et al. · Mission San Jose High School · Missouri University of Science and Technology +1 more

No-box adversarial attack exploiting RGBA alpha channel blending in video to fool object detectors and VLMs with 100% success rate

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF

attack arXiv Oct 19, 2025 · Oct 2025

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Bingqi Shang, Yiwei Chen, Yihua Zhang et al. · Michigan State University · National University of Singapore +1 more

Backdoors LLM unlearning via attention sink positions so models appear to forget but covertly restore knowledge when triggered

Model Poisoning nlp

1 citations PDF Code

attack arXiv Oct 12, 2025 · Oct 2025

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

Mohan Zhang, Yihua Zhang, Jinghan Jia et al. · University of North Carolina at Chapel Hill · Michigan State University +1 more

Backdoor-implanted attack on large reasoning models forcing perpetual CoT loops, achieving 100% resource exhaustion success rate

Model Poisoning Model Denial of Service nlp

1 citations PDF

Loading more papers…

Latest papers

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

Jailbreaks on Vision Language Model via Multimodal Reasoning

RAVEN: Erasing Invisible Watermarks via Novel View Synthesis

On the Holistic Approach for Detecting Human Image Forgery

SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression

SPOOF: Simple Pixel Operations for Out-of-Distribution Fooling

Open Set Face Forgery Detection via Dual-Level Evidence Collection

Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue