Latest papers

23 papers
defense arXiv Mar 18, 2026 · 19d ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Haozheng Luo, Yimin Wang, Jiahao Yu et al. · Northwestern University · University of Michigan +1 more

Aligns reasoning models against jailbreaks by optimizing safety in hidden representation space using contrastive RL

Prompt Injection nlp
PDF
benchmark arXiv Mar 2, 2026 · 5w ago

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Hashim Ali, Nithin Sai Adupa, Surya Subramani et al. · University of Michigan

Benchmarks 20 SSL models for audio deepfake detection across multiple datasets and acoustic degradation conditions

Output Integrity Attack audio
PDF
attack arXiv Jan 21, 2026 · 10w ago

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim et al. · University of Michigan · LG AI Research +1 more

Crafted agent chain-of-thought reasoning inflates LLM/VLM judge false positives by up to 90% across 800 web-task trajectories

Prompt Injection nlpmultimodal
1 citations PDF
benchmark arXiv Jan 12, 2026 · 12w ago

LJ-Spoof: A Generatively Varied Corpus for Audio Anti-Spoofing and Synthesis Source Tracing

Surya Subramani, Hashim Ali, Hafiz Malik · University of Michigan

Benchmark corpus of 3M+ utterances across 500 TTS variants enabling audio deepfake detection and synthesis source tracing

Output Integrity Attack audiogenerative
PDF
tool arXiv Jan 6, 2026 · Jan 2026

DeepLeak: Privacy Enhancing Hardening of Model Explanations Against Membership Leakage

Firas Ben Hmida, Zain Sbeih, Philemon Hailemariam et al. · University of Michigan

Audits and hardens ML explanation methods against membership inference attacks, reducing leakage up to 95%

Membership Inference Attack vision
PDF
defense arXiv Dec 23, 2025 · Dec 2025

Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

Jixiao Yang, Jinyu Chen, Zixiao Huang et al. · Westcliff University · University of Washington +3 more

Defends federated learning against Byzantine poisoning attacks using Shapley-based reputation scores while minimizing multi-cloud communication costs

Data Poisoning Attack federated-learningvision
PDF
defense BigData Congress Dec 10, 2025 · Dec 2025

SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

Mohamed Afane, Abhishek Satyam, Ke Chen et al. · Fordham University · Zhejiang University +2 more

SCOUT uses token-level saliency analysis to detect contextually-blended backdoor triggers in fine-tuned NLP models, including novel domain-specific attacks.

Model Poisoning Data Poisoning Attack nlp
PDF Code
benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp
PDF
defense arXiv Nov 26, 2025 · Nov 2025

TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data

Yizhou Zhao, Xiang Li, Peter Song et al. · University of Pennsylvania · University of Michigan

DFT-based frequency-domain watermarking for AI-generated tabular data enabling robust provenance tracing against post-processing attacks

Output Integrity Attack tabulargenerative
PDF Code
defense arXiv Nov 20, 2025 · Nov 2025

PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

Oscar Chew, Po-Yi Lu, Jayden Lin et al. · Texas A&M University · National Taiwan University +1 more

Defends T2I diffusion models from backdoor triggers by rewriting prompts to be semantically distant yet visually similar, disrupting trigger tokens at inference time.

Model Poisoning visionnlpgenerative
PDF Code
attack arXiv Nov 16, 2025 · Nov 2025

GRAPHTEXTACK: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs

Jiaji Ma, Puja Trivedi, Danai Koutra · University of Michigan

Black-box evolutionary attack injects adversarial nodes with crafted text and edges to poison LLM-enhanced GNNs

Data Poisoning Attack graphnlp
PDF
benchmark arXiv Oct 22, 2025 · Oct 2025

Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

Reya Vir, Sarvesh Bhatnagar · Columbia University · University of Michigan

Quantifies subliminal data poisoning in LLM fine-tuning: finds sharp alignment-failure phase transition, not gradual degradation

Data Poisoning Attack Training Data Poisoning nlp
2 citations PDF
attack arXiv Oct 15, 2025 · Oct 2025

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

Yibo Peng, James Song, Lei Li et al. · Carnegie Mellon University · University of Michigan +3 more

Attacks LLM code agents via crafted issues to produce test-passing but security-vulnerable patches across 12 agent-model combinations

Prompt Injection nlp
PDF
defense arXiv Oct 15, 2025 · Oct 2025

NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

Nir Goren, Oren Katzir, Abhinav Nakarmi et al. · Tel Aviv University · University of Michigan

Distortion-free diffusion watermarking exploits seed-output correlation and ZK proofs to verify generated image authorship without model weights

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Oct 6, 2025 · Oct 2025

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj et al. · University of Toronto · Vector Institute +4 more

Benchmarks LLM vulnerability to sociopolitical harm requests across 585 prompts, 34 countries, revealing 97–98% attack success rates

Prompt Injection nlp
PDF Code
defense Asia-Pacific Computer Systems ... Sep 30, 2025 · Sep 2025

DeepProv: Behavioral Characterization and Repair of Neural Networks via Inference Provenance Graph Analysis

Firas Ben Hmida, Abderrahmen Amich, Ata Kaboudi et al. · University of Michigan

Defends DNNs against adversarial examples via Inference Provenance Graph analysis to identify and repair vulnerable nodes/edges

Input Manipulation Attack vision
PDF
defense ICDMW Sep 29, 2025 · Sep 2025

Lightweight and Robust Federated Data Valuation

Guojun Tang, Jiayu Zhou, Mohammad Mamun et al. · University of Calgary · University of Michigan +1 more

Defends federated learning against adversarial clients using influence-score aggregation, 450x faster than Shapley-value baselines

Data Poisoning Attack federated-learningvision
PDF
attack arXiv Sep 18, 2025 · Sep 2025

Evil Vizier: Vulnerabilities of LLM-Integrated XR Systems

Yicheng Zhang, Zijian Huang, Sophie Chen et al. · University of California · University of Michigan

Demonstrates indirect prompt injection attacks on XR-LLM systems by manipulating physical/digital environment context to corrupt AI glasses outputs

Prompt Injection Excessive Agency multimodalnlpvision
PDF
benchmark arXiv Sep 8, 2025 · Sep 2025

Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study

Kutub Uddin, Muhammad Umar Farooq, Awais Khan et al. · University of Michigan · University of Michigan-Flint

Benchmarks 12 audio deepfake detectors against statistical and gradient-based adversarial attacks across five large-scale datasets

Input Manipulation Attack Output Integrity Attack audio
PDF
attack arXiv Sep 8, 2025 · Sep 2025

Realism to Deception: Investigating Deepfake Detectors Against Face Enhancement

Muhammad Saad Saeed, Ijaz Ul Haq, Khalid Malik · University of Michigan · University of Michigan-Flint

Face enhancement filters and GANs used as anti-forensic attacks to evade deepfake detectors, achieving up to 75% attack success rate

Output Integrity Attack vision
PDF
Loading more papers…