ML Security Papers

Latest papers

111 papers

tool arXiv Apr 29, 2026 · 22d ago

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

Siyuan Li, Aodu Wulianghai, Guangyan Li et al. · Shanghai Jiao Tong University · Chinese Academy of Sciences

Detects LLM-generated text by analyzing sentiment distribution stability, achieving 49.89% F1 improvement over baselines

Output Integrity Attack nlp

PDF

defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp

PDF

attack arXiv Apr 26, 2026 · 25d ago

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

Yu Cui, Ruiqing Yue, Hang Fu et al. · Beijing Institute of Technology · Chinese Academy of Sciences +3 more

Extracts private information from LLM agent memory via single-query hybrid probing in black-box and gray-box settings

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 24, 2026 · 27d ago

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao, Xuehai Tang, Biyu Zhou et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences

Detects poisoned LLM agent skills by identifying attention hijacking patterns where malicious instructions redirect model reasoning

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Apr 15, 2026 · 5w ago

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Xixun Lin, Yang Liu, Yancheng Chen et al. · Chinese Academy of Sciences · Institute of Applied Physics and Computational Mathematics +1 more

Multi-layer security architecture embedded in LLM agent execution harnesses to defend against prompt injection and tool misuse attacks

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

attack arXiv Apr 14, 2026 · 5w ago

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Yongxuan Wu, Xixun Lin, He Zhang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Black-box attack inferring LLM multi-agent system communication topologies via adversarial queries, achieving 99% peak AUC

Model Theft Excessive Agency nlp

PDF Code

defense arXiv Apr 14, 2026 · 5w ago

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng et al. · Hunan Normal University · Chinese Academy of Sciences +1 more

Couples weight subspace constraints with activation regularization to prevent safety degradation during LLM fine-tuning

Prompt Injection nlp

PDF

defense arXiv Apr 14, 2026 · 5w ago

Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

Tianshuo Zhang, Haoyuan Zhang, Siran Peng et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Continual deepfake detection via distribution-level replay that condenses forgery cues into compact maps, avoiding raw image storage

Output Integrity Attack visiongenerative

PDF

defense arXiv Apr 12, 2026 · 5w ago

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Yuanbo Xie, Yingjie Zhang, Yulin Li et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +4 more

Runtime defense that embeds canary tokens in RAG-retrieved content to detect knowledge base leakage attacks in real-time

Sensitive Information Disclosure Prompt Injection nlp

PDF

benchmark arXiv Apr 9, 2026 · 6w ago

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

Yuankun Xie, Haonan Cheng, Jiayi Zhou et al. · Communication University of China · Ant Group +3 more

Benchmark challenge for detecting AI-generated speech, sound, singing, and music across diverse generation methods and real-world conditions

Output Integrity Attack audiomultimodalnlp

PDF

defense arXiv Apr 8, 2026 · 6w ago

Towards Robust Content Watermarking Against Removal and Forgery Attacks

Yifan Zhu, Yihan Wang, Xiao-Shan Gao · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Instance-specific watermarking defense for diffusion models resisting removal and forgery attacks via dynamic injection and two-sided detection

Output Integrity Attack visiongenerative

PDF

attack arXiv Apr 8, 2026 · 6w ago

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li, Zongyang Ma, Yuntong Pan et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +4 more

Adversarial attack that encodes harmful content in human-readable visual formats to evade MLLM content moderation systems

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF Code

attack arXiv Apr 8, 2026 · 6w ago

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Yizhe Zeng, Wei Zhang, Yunpeng Li et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Backdoor attack on CoT-reasoning LLMs that produces correct reasoning but wrong final answers, evading process-monitoring defenses

Model Poisoning Training Data Poisoning nlp

PDF

attack arXiv Apr 7, 2026 · 6w ago

Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

Yanxu Mao, Peipei Liu, Tiehan Cui et al. · Henan University · Chinese Academy of Sciences +2 more

Red-teams LLM agents by hijacking reasoning trajectories and memory retrieval without modifying user prompts, achieving cross-model jailbreaks

Prompt Injection Excessive Agency nlpmultimodal

PDF

attack arXiv Mar 25, 2026 · 8w ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp

PDF

defense arXiv Mar 25, 2026 · 8w ago

Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Jipeng Liu, Haichao Shi, Siyu Xing et al. · Chinese Academy of Sciences · Beihang University

Addresses optimization collapse in VLM-based deepfake detectors through gradient signal enhancement and contrastive regional injection for cross-domain generalization

Output Integrity Attack visionmultimodal

PDF

While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.

vlm transformer multimodal Chinese Academy of Sciences · Beihang University

PDF arXiv

attack arXiv Mar 22, 2026 · 8w ago

Can LLMs Fool Graph Learning? Exploring Universal Adversarial Attacks on Text-Attributed Graphs

Zihui Chen, Yuling Wang, Pengfei Jiao et al. · Hangzhou Dianzi University · Beihang University +1 more

LLM-driven universal adversarial attack framework targeting text-attributed graph models across GNN and PLM architectures

Input Manipulation Attack nlpgraph

PDF

defense arXiv Mar 19, 2026 · 9w ago

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Yue Zhao, Yujia Gong, Ruigang Liang et al. · Chinese Academy of Sciences · Beijing University of Posts and Telecommunications +1 more

Transfers safety functionality between LLMs by transplanting minimal neuron subsets, enabling alignment enhancement and jailbreak defense without retraining

Prompt Injection nlp

PDF

defense arXiv Mar 19, 2026 · 9w ago

Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Lu Yu, Haiyang Zhang, Changsheng Xu · Tianjin University of Technology · Chinese Academy of Sciences +1 more

Defends CLIP against adversarial examples using complementary text-guided attention to maintain zero-shot generalization while improving robustness

Input Manipulation Attack visionnlpmultimodal

PDF Code

defense arXiv Mar 16, 2026 · 9w ago

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Zhuoshang Wang, Yubing Ren, Yanan Cao et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Black-box framework for third-party watermark detection in LLM outputs using proxy models and statistical tests

Output Integrity Attack nlp

PDF

Loading more papers…

Latest papers

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

Towards Robust Content Watermarking Against Removal and Forgery Attacks

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

How Vulnerable Are Edge LLMs?

Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Can LLMs Fool Graph Learning? Exploring Universal Adversarial Attacks on Text-Attributed Graphs

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue