Latest papers

111 papers
tool arXiv Apr 29, 2026 · 22d ago

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

Siyuan Li, Aodu Wulianghai, Guangyan Li et al. · Shanghai Jiao Tong University · Chinese Academy of Sciences

Detects LLM-generated text by analyzing sentiment distribution stability, achieving 49.89% F1 improvement over baselines

Output Integrity Attack nlp
PDF
defense arXiv Apr 27, 2026 · 24d ago

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Jiaqi Li, Yang Zhao, Bin Sun et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Self-play security training framework teaching AI agents to detect prompt injection, memory poisoning, and supply-chain attacks via role alternation

AI Supply Chain Attacks Prompt Injection Excessive Agency Blue-Team Agents nlp
PDF
attack arXiv Apr 26, 2026 · 25d ago

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

Yu Cui, Ruiqing Yue, Hang Fu et al. · Beijing Institute of Technology · Chinese Academy of Sciences +3 more

Extracts private information from LLM agent memory via single-query hybrid probing in black-box and gray-box settings

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Apr 24, 2026 · 27d ago

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao, Xuehai Tang, Biyu Zhou et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences

Detects poisoned LLM agent skills by identifying attention hijacking patterns where malicious instructions redirect model reasoning

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Apr 15, 2026 · 5w ago

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

Xixun Lin, Yang Liu, Yancheng Chen et al. · Chinese Academy of Sciences · Institute of Applied Physics and Computational Mathematics +1 more

Multi-layer security architecture embedded in LLM agent execution harnesses to defend against prompt injection and tool misuse attacks

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF
attack arXiv Apr 14, 2026 · 5w ago

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Yongxuan Wu, Xixun Lin, He Zhang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Black-box attack inferring LLM multi-agent system communication topologies via adversarial queries, achieving 99% peak AUC

Model Theft Excessive Agency nlp
PDF Code
defense arXiv Apr 14, 2026 · 5w ago

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng et al. · Hunan Normal University · Chinese Academy of Sciences +1 more

Couples weight subspace constraints with activation regularization to prevent safety degradation during LLM fine-tuning

Prompt Injection nlp
PDF
defense arXiv Apr 14, 2026 · 5w ago

Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

Tianshuo Zhang, Haoyuan Zhang, Siran Peng et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Continual deepfake detection via distribution-level replay that condenses forgery cues into compact maps, avoiding raw image storage

Output Integrity Attack visiongenerative
PDF
defense arXiv Apr 12, 2026 · 5w ago

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Yuanbo Xie, Yingjie Zhang, Yulin Li et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +4 more

Runtime defense that embeds canary tokens in RAG-retrieved content to detect knowledge base leakage attacks in real-time

Sensitive Information Disclosure Prompt Injection nlp
PDF
benchmark arXiv Apr 9, 2026 · 6w ago

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

Yuankun Xie, Haonan Cheng, Jiayi Zhou et al. · Communication University of China · Ant Group +3 more

Benchmark challenge for detecting AI-generated speech, sound, singing, and music across diverse generation methods and real-world conditions

Output Integrity Attack audiomultimodalnlp
PDF
defense arXiv Apr 8, 2026 · 6w ago

Towards Robust Content Watermarking Against Removal and Forgery Attacks

Yifan Zhu, Yihan Wang, Xiao-Shan Gao · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Instance-specific watermarking defense for diffusion models resisting removal and forgery attacks via dynamic injection and two-sided detection

Output Integrity Attack visiongenerative
PDF
attack arXiv Apr 8, 2026 · 6w ago

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li, Zongyang Ma, Yuntong Pan et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +4 more

Adversarial attack that encodes harmful content in human-readable visual formats to evade MLLM content moderation systems

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF Code
attack arXiv Apr 8, 2026 · 6w ago

MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

Yizhe Zeng, Wei Zhang, Yunpeng Li et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Backdoor attack on CoT-reasoning LLMs that produces correct reasoning but wrong final answers, evading process-monitoring defenses

Model Poisoning Training Data Poisoning nlp
PDF
attack arXiv Apr 7, 2026 · 6w ago

Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

Yanxu Mao, Peipei Liu, Tiehan Cui et al. · Henan University · Chinese Academy of Sciences +2 more

Red-teams LLM agents by hijacking reasoning trajectories and memory retrieval without modifying user prompts, achieving cross-model jailbreaks

Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Mar 25, 2026 · 8w ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp
PDF
defense arXiv Mar 25, 2026 · 8w ago

Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

Jipeng Liu, Haichao Shi, Siyu Xing et al. · Chinese Academy of Sciences · Beihang University

Addresses optimization collapse in VLM-based deepfake detectors through gradient signal enhancement and contrastive regional injection for cross-domain generalization

Output Integrity Attack visionmultimodal
PDF
attack arXiv Mar 22, 2026 · 8w ago

Can LLMs Fool Graph Learning? Exploring Universal Adversarial Attacks on Text-Attributed Graphs

Zihui Chen, Yuling Wang, Pengfei Jiao et al. · Hangzhou Dianzi University · Beihang University +1 more

LLM-driven universal adversarial attack framework targeting text-attributed graph models across GNN and PLM architectures

Input Manipulation Attack nlpgraph
PDF
defense arXiv Mar 19, 2026 · 9w ago

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Yue Zhao, Yujia Gong, Ruigang Liang et al. · Chinese Academy of Sciences · Beijing University of Posts and Telecommunications +1 more

Transfers safety functionality between LLMs by transplanting minimal neuron subsets, enabling alignment enhancement and jailbreak defense without retraining

Prompt Injection nlp
PDF
defense arXiv Mar 19, 2026 · 9w ago

Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Lu Yu, Haiyang Zhang, Changsheng Xu · Tianjin University of Technology · Chinese Academy of Sciences +1 more

Defends CLIP against adversarial examples using complementary text-guided attention to maintain zero-shot generalization while improving robustness

Input Manipulation Attack visionnlpmultimodal
PDF Code
defense arXiv Mar 16, 2026 · 9w ago

Rethinking LLM Watermark Detection in Black-Box Settings: A Non-Intrusive Third-Party Framework

Zhuoshang Wang, Yubing Ren, Yanan Cao et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Black-box framework for third-party watermark detection in LLM outputs using proxy models and statistical tests

Output Integrity Attack nlp
PDF
Loading more papers…