ML Security Papers

Latest papers

22 papers

attack arXiv Mar 30, 2026 · 9d ago

Membership Inference Attacks against Large Audio Language Models

Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee · National Taiwan University · NTU Artificial Intelligence Center of Research Excellence

First systematic membership inference attack evaluation of audio language models, revealing cross-modal memorization from speaker-text binding

Membership Inference Attack audiomultimodalnlp

PDF

defense arXiv Feb 23, 2026 · 6w ago

Expanding the Role of Diffusion Models for Robust Classifier Training

Pin-Han Huang, Shang-Tse Chen, Hsuan-Tien Lin · National Taiwan University

Improves adversarial training by aligning classifier representations with diffusion model internals, boosting robustness on CIFAR-10/100 and ImageNet

Input Manipulation Attack vision

PDF

defense arXiv Feb 6, 2026 · 8w ago

Concept-Aware Privacy Mechanisms for Defending Embedding Inversion Attacks

Yu-Che Tsai, Hsiang Hsiao, Kuan-Yu Chen et al. · National Taiwan University · National Taiwan University AI Center of Research Excellence

Defends text embeddings against inversion attacks via concept-aware differentiable masking and elliptical DP noise calibrated per dimension

Model Inversion Attack nlp

PDF

benchmark arXiv Feb 2, 2026 · 9w ago

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu et al. · National Taiwan University · Independent Researcher

Reveals LLM safety miscalibration via Expected Harm metric, boosting existing jailbreak success rates by up to 2×

Prompt Injection nlp

PDF

defense arXiv Jan 7, 2026 · Jan 2026

RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection

Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin et al. · National Taiwan University

Adversarially co-trains a retrieval-augmented fake-news detector against an LLM generator using natural-language critiques to improve robustness

Output Integrity Attack nlp

PDF

benchmark arXiv Dec 28, 2025 · Dec 2025

M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou et al. · National Taiwan University · Academia Sinica

Benchmarks multimodal concept erasure in diffusion models, showing embedding/latent attacks bypass safety with >90% success; proposes IRECE defense

Input Manipulation Attack visiongenerative

PDF

benchmark arXiv Dec 11, 2025 · Dec 2025

TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou et al. · National Taiwan University · National Yang Ming Chiao Tung University +1 more

Benchmark for evaluating MLLMs on interpretable deepfake detection across perception, detection, and hallucination dimensions

Output Integrity Attack visionaudiomultimodalnlp

PDF

attack arXiv Dec 1, 2025 · Dec 2025

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen et al. · Georgia Institute of Technology · University of Illinois Urbana-Champaign +4 more

Decomposes harmful requests into innocuous sub-queries via tree search to jailbreak commercial LLM guardrails at 95%+ success

Prompt Injection nlp

1 citations PDF Code

defense arXiv Nov 20, 2025 · Nov 2025

PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

Oscar Chew, Po-Yi Lu, Jayden Lin et al. · Texas A&M University · National Taiwan University +1 more

Defends T2I diffusion models from backdoor triggers by rewriting prompts to be semantically distant yet visually similar, disrupting trigger tokens at inference time.

Model Poisoning visionnlpgenerative

PDF Code

benchmark arXiv Oct 19, 2025 · Oct 2025

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

Bo-Han Feng, Chien-Feng Liu, Yu-Hsuan Li Liang et al. · National Taiwan University · NVIDIA

Reveals that speaker emotional intensity systematically jailbreaks audio-language models, with medium intensity posing the greatest safety risk

Prompt Injection audiomultimodalnlp

1 citations PDF Code

attack Journal of Network and Compute... Oct 11, 2025 · Oct 2025

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

Guan-Yan Yang, Tzu-Yu Cheng, Ya-Wen Teng et al. · National Taiwan University · GARMIN +2 more

Two-phase black-box jailbreak uses ASCII art encoding to bypass LLM safety alignment, including GPT-4o and Claude Sonnet 3.7

Prompt Injection nlp

2 citations PDF

defense arXiv Oct 6, 2025 · Oct 2025

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan, Xuechen Liu, Wenxin Zhang et al. · University of Eastern Finland · National Institute of Informatics +4 more

Novel wavelet prompt-tuning architecture for speech deepfake detection, outperforming SOTA on two benchmarks with far fewer trainable parameters

Output Integrity Attack audio

1 citations PDF Code

attack arXiv Oct 1, 2025 · Oct 2025

Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang et al. · CyCraft · National Taiwan University

Scalable RAG poisoning attack using reusable adversarial Attention Attractors that transfer to black-box LLM systems

Input Manipulation Attack Prompt Injection nlp

PDF

defense arXiv Sep 24, 2025 · Sep 2025

ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

Tai-Ming Huang, Wei-Tung Lin, Kai-Lung Hua et al. · National Taiwan University · Academia Sinica +3 more

Detects AI-generated images via MLLM step-by-step reasoning trained with GRPO reinforcement learning, achieving strong zero-shot generalization

Output Integrity Attack visionmultimodal

3 citations 1 influentialPDF

attack arXiv Sep 15, 2025 · Sep 2025

DRAG: Data Reconstruction Attack using Guided Diffusion

Wa-Kin Lei, Jun-Cheng Chen, Shang-Tse Chen · National Taiwan University · Academia Sinica

Diffusion-guided data reconstruction attack recovers private images from vision foundation model intermediate representations in split inference

Model Inversion Attack vision

PDF Code

attack arXiv Sep 6, 2025 · Sep 2025

Yours or Mine? Overwriting Attacks Against Neural Audio Watermarking

Lingfeng Yao, Chenpei Huang, Shengyao Wang et al. · University of Houston · Waseda University +3 more

Overwriting attacks replace legitimate audio watermarks with forged ones, achieving ~100% success across white-, gray-, and black-box threat models

Output Integrity Attack audiogenerative

PDF

defense arXiv Sep 3, 2025 · Sep 2025

Enhancing Robustness in Post-Processing Watermarking: An Ensemble Attack Network Using CNNs and Transformers

Tzuhsuan Huang, Cheng Yu Yeo, Tsai-Ling Huang et al. · Academia Sinica · National Yang Ming Chiao Tung University +1 more

Adversarial training with CNN+Transformer ensemble attack networks makes post-processing image watermarks robust against regeneration and distortion attacks

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Aug 27, 2025 · Aug 2025

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee et al. · National Taiwan University

Co-evolutionary framework auto-evolves attack and defense prompts to harden LLMs against prompt injection without model fine-tuning

Prompt Injection nlp

PDF

benchmark arXiv Aug 23, 2025 · Aug 2025

Unveiling the Latent Directions of Reflection in Large Language Models

Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu · MediaTek Research · National Taiwan University

Activation steering reveals latent reflection directions in LLMs, enabling adversarial suppression for jailbreaks or enhancement as a defense

Prompt Injection nlp

PDF

defense arXiv Aug 12, 2025 · Aug 2025

Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative

Xi Xuan, Zimo Zhu, Wenxin Zhang et al. · University of Eastern Finland · University of California Santa Barbara +3 more

Novel bidirectional Mamba encoder architecture for real-time audio deepfake detection, outperforming Conformer-based SOTA on ASVspoof benchmarks

Output Integrity Attack audio

PDF Code

Loading more papers…

Latest papers

Membership Inference Attacks against Large Audio Language Models

Expanding the Role of Diffusion Models for Robust Classifier Training

Concept-Aware Privacy Mechanisms for Defending Embedding Inversion Attacks

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection

M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Eyes-on-Me: Scalable RAG Poisoning through Transferable Attention-Steering Attractors

ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection

DRAG: Data Reconstruction Attack using Guided Diffusion

Yours or Mine? Overwriting Attacks Against Neural Audio Watermarking

Enhancing Robustness in Post-Processing Watermarking: An Ensemble Attack Network Using CNNs and Transformers

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Unveiling the Latent Directions of Reflection in Large Language Models

Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue