Latest papers

59 papers
defense arXiv Mar 24, 2026 · 15d ago

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily et al. · University of Science and Technology of China · Squirrel AI Learning +4 more

Mechanistic interpretability framework identifying sparse safety circuits in LLMs for backdoor removal and alignment preservation

Model Poisoning Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Mar 23, 2026 · 16d ago

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · University of Eastern Finland · City University of Hong Kong +3 more

Disentangles speaker traits from deepfake source embeddings using Chebyshev polynomials and Riemannian geometry for robust generator verification

Output Integrity Attack audiogenerative
PDF Code
defense arXiv Mar 23, 2026 · 16d ago

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Xingyu Zhu, Beier Zhu, Shuo Wang et al. · University of Science and Technology of China · National University of Singapore +1 more

Null-space projection defense that blocks VLM jailbreaks while preserving benign performance through theoretically-grounded activation steering

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF
defense arXiv Mar 17, 2026 · 22d ago

Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models

Deng Liu, Song Chen · University of Science and Technology of China

Training-free defense using orthogonal transformations to protect quantized LLM weights from hardware bit-flip attacks

Model Poisoning nlp
PDF
defense arXiv Mar 13, 2026 · 26d ago

Lyapunov Stable Graph Neural Flow

Haoyu Chu, Xiaotong Chen, Wei Zhou et al. · China University of Mining and Technology · University of Science and Technology of China

Defends graph neural networks against adversarial topology and feature perturbations using Lyapunov stability constraints on feature dynamics

Input Manipulation Attack graph
PDF
tool arXiv Mar 9, 2026 · 4w ago

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Chao Wang, Zijin Yang, Yaofei Wang et al. · University of Science and Technology of China · Hefei University of Technology

Few-shot, training-free video attribution tool traces generated videos to source models via sliding-window reconstruction loss signals

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Feb 27, 2026 · 5w ago

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Xingyu Zhu, Beier Zhu, Junfeng Fang et al. · University of Science and Technology of China · Nanyang Technological University +2 more

Training-free defense for VLMs uses optimal transport patch detection and attention calibration to block visual jailbreaks

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
defense arXiv Feb 27, 2026 · 5w ago

SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models

Yang Yang, Xinze Zou, Zehua Ma et al. · AnHui University · University of Science and Technology of China +1 more

Embeds robust watermarks into text-to-video diffusion outputs using shuffle-key sampling and differential attention for provenance tracking

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 10, 2026 · 8w ago

Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

Kun Wang, Zherui Li, Zhenhong Zhou et al. · Nanyang Technological University · Beijing University of Posts and Telecommunications +4 more

Exposes cross-modal jailbreak vulnerabilities in omni-modal LLMs and defends via SVD-guided refusal vector amplification with lightweight adapters

Prompt Injection multimodalnlp
PDF Code
defense arXiv Feb 7, 2026 · 8w ago

UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

Jiaming He, Fuming Luo, Hongwei Li et al. · University of Electronic Science and Technology of China · Independent Researcher +2 more

Protects private tabular data from unauthorized training by injecting decoupled shortcut perturbations that drive models to near-random performance

Data Poisoning Attack tabular
PDF
defense arXiv Jan 31, 2026 · 9w ago

Self-Guard: Defending Large Reasoning Models via enhanced self-reflection

Jingnan Zheng, Jingjun Xu, Yanzhen Luo et al. · National University of Singapore · Southern University of Science and Technology +2 more

Defends Large Reasoning Models from jailbreaks by steering hidden-state activations to enforce safety compliance over sycophancy

Prompt Injection nlp
PDF Code
benchmark arXiv Jan 30, 2026 · 9w ago

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang et al. · University of Science and Technology of China · Nanyang Technological University +2 more

Mechanistic study showing character-disposition fine-tuning creates stronger, transferable LLM misalignment unifying backdoor triggers and jailbreak susceptibility

Model Poisoning Prompt Injection nlp
PDF
attack arXiv Jan 30, 2026 · 9w ago

Rethinking Transferable Adversarial Attacks on Point Clouds from a Compact Subspace Perspective

Keke Tang, Xianheng Liu, Weilong Peng et al. · Guangzhou University · University of Science and Technology of China +2 more

Transfers adversarial perturbations across 3D point cloud architectures via low-rank semantic subspace optimization

Input Manipulation Attack vision
PDF
benchmark arXiv Jan 29, 2026 · 9w ago

WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Zijin Yang, Yu Sun, Kejiang Chen et al. · University of Science and Technology of China · Anhui Province Key Laboratory of Digital Security +1 more

Proposes a unified VLM-based benchmark for evaluating residual and semantic watermarks in diffusion model image outputs

Output Integrity Attack visiongenerative
PDF
attack arXiv Jan 29, 2026 · 9w ago

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

Ningyuan He, Ronghong Huang, Qianqian Tang et al. · University of Science and Technology of China · Shandong University +1 more

Zero-query black-box text attacks evade LLM-based in-context learning classifiers with 95.3% success, plus joint defense recipe

Prompt Injection nlp
PDF Code
defense arXiv Jan 28, 2026 · 10w ago

SemBind: Binding Diffusion Watermarks to Semantics Against Black-Box Forgery Attacks

Xin Zhang, Zijin Yang, Kejiang Chen et al. · University of Science and Technology of China

Defends diffusion model image watermarks from black-box forgery by semantically binding latent signals via contrastive learning

Output Integrity Attack visiongenerative
PDF
defense arXiv Jan 27, 2026 · 10w ago

Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

Sen Nie, Jie Zhang, Zhuo Wang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Test-time defense purifies adversarial inputs to CLIP using spectral-guided contrastive rectification, outperforming SOTA by 18.1% against AutoAttack

Input Manipulation Attack visionmultimodal
1 citations PDF Code
defense arXiv Jan 20, 2026 · 11w ago

Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

Zhaopeng Zhang, Pengcheng Sun, Lan Zhang et al. · University of Science and Technology of China · Lenovo

Defends LLMs over knowledge bases from unauthorized data leakage using training-free activation steering to enforce multi-class permissions

Sensitive Information Disclosure Prompt Injection nlp
PDF
benchmark arXiv Jan 16, 2026 · 11w ago

Your One-Stop Solution for AI-Generated Video Detection

Long Ma, Zihao Xue, Yan Wang et al. · University of Science and Technology of China · Huzhou University +3 more

Comprehensive benchmark evaluating 33 AI-generated video detectors across 440K+ videos from 31 generative models

Output Integrity Attack visiongenerative
1 citations PDF Code
attack arXiv Jan 12, 2026 · 12w ago

MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP

Ruiqi Li, Zhiqiang Wang, Yunhao Yao et al. · University of Science and Technology of China

Automated black-box framework generates stealthy MCP tool poisoning attacks that hijack LLM agents into invoking high-privilege tools with 84.2% success rate

Insecure Plugin Design Prompt Injection nlp
1 citations PDF
Loading more papers…