Latest papers

144 papers
defense arXiv Apr 28, 2026 · 23d ago

Adversarial Robustness of NTK Neural Networks

Yuxuan Hou · Qiuzhen College · Tsinghua University

Proves NTK neural networks achieve minimax optimal adversarial robustness with early stopping but fail catastrophically when overfitted

Input Manipulation Attack tabular
PDF
attack arXiv Apr 26, 2026 · 25d ago

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

Yu Cui, Ruiqing Yue, Hang Fu et al. · Beijing Institute of Technology · Chinese Academy of Sciences +3 more

Extracts private information from LLM agent memory via single-query hybrid probing in black-box and gray-box settings

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Apr 23, 2026 · 28d ago

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

Yanran Zhang, Wenzhao Zheng, Yifei Li et al. · Tsinghua University

Unified framework co-training image generation and AI-generated image detection through adversarial synergy and multimodal attention

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Apr 20, 2026 · 4w ago

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

Xiangyu Wen, Yuang Zhao, Xiaoyu Xu et al. · The Chinese University of Hong Kong · Shanghai Jiao Tong University +3 more

Kernel-based security architecture for LLM agents that intercepts unsafe tool calls using deterministic taint tracking and dependency graphs

Insecure Plugin Design Excessive Agency nlp
PDF Code
defense arXiv Apr 20, 2026 · 4w ago

IncreFA: Breaking the Static Wall of Generative Model Attribution

Haotian Qin, Dongliang Chang, Yueying Gao et al. · Beijing University of Posts and Telecommunications · Tsinghua University

Continual learning framework that adapts generative image attribution to newly released models while maintaining accuracy on previous generators

Output Integrity Attack visiongenerative
PDF Code
benchmark arXiv Apr 13, 2026 · 5w ago

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

Chenxi Qing, Junxi Wu, Zheng Liu et al. · Tsinghua University · Nankai University +2 more

Chinese benchmark for AI-generated text detection with real-world prompts across nine LLMs and multiple domains

Output Integrity Attack nlp
PDF Code
defense arXiv Apr 13, 2026 · 5w ago

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Junxiao Yang, Haoran Liu, Jinzhe Tu et al. · Tsinghua University · Alibaba Group

Defends LLMs against cross-lingual jailbreaks by anchoring safety alignment in language-agnostic semantic representations rather than surface text

Prompt Injection nlp
PDF
attack arXiv Apr 13, 2026 · 5w ago

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu et al. · Peking University · Sun Yat-Sen University +4 more

Multi-turn jailbreak attack that chains low-risk prompts to cumulatively bypass LLM safety guardrails across modalities

Prompt Injection nlpmultimodal
PDF
defense arXiv Apr 12, 2026 · 5w ago

Defending against Patch-Based and Texture-Based Adversarial Attacks with Spectral Decomposition

Wei Zhang, Xinyu Chang, Xiao Li et al. · Tsinghua University · University of Science and Technology Beijing

Spectral defense using wavelet decomposition to detect and mitigate both patch-based and texture-based adversarial attacks on vision models

Input Manipulation Attack vision
PDF Code
benchmark arXiv Apr 9, 2026 · 6w ago

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Hengyu An, Minxi Li, Jinghuai Zhang et al. · Zhejiang University · Tsinghua University +3 more

Benchmark framework for evaluating multi-agent LLM systems against cascading injection attacks across external inputs, profiles, and inter-agent messages

Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Apr 2, 2026 · 7w ago

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Jiawei Chen, Simin Huang, Jiawei Du et al. · East China Normal University · Zhongguancun Academy +3 more

Physically realizable 3D adversarial textures that degrade vision-language-action robot models with 96.7% task failure rates

Input Manipulation Attack visionmultimodalnlp
PDF Code
attack arXiv Apr 1, 2026 · 7w ago

Enhancing Gradient Inversion Attacks in Federated Learning via Hierarchical Feature Optimization

Hao Fang, Wenbo Yu, Bin Chen et al. · Tsinghua University · Harbin Institute of Technology

GAN-based gradient inversion attack reconstructing client training data from FL gradients via hierarchical feature optimization

Model Inversion Attack visionfederated-learning
PDF
attack arXiv Mar 31, 2026 · 7w ago

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Yunrui Yu, Xuxiang Feng, Pengda Qin et al. · Tsinghua University · University of Macau +1 more

Novel adversarial attack targeting dummy-class defenses by simultaneously attacking true and dummy labels with adaptive weighting

Input Manipulation Attack vision
PDF
benchmark arXiv Mar 30, 2026 · 7w ago

Evaluating Privilege Usage of Agents on Real-World Tools

Quan Zhang, Lianhang Fu, Lvsi Lian et al. · East China Normal University · Xinjiang University +1 more

Benchmark evaluating LLM agents' privilege control under prompt injection attacks using real-world tools, finding 84.80% attack success

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF
attack arXiv Mar 30, 2026 · 7w ago

InkDrop: Invisible Backdoor Attacks Against Dataset Condensation

He Yang, Dongyi Lv, Song Ma et al. · Xi'an Jiaotong University · Tsinghua University

Stealthy backdoor attack on dataset condensation using boundary-proximate samples and imperceptible perturbations to evade detection

Model Poisoning vision
PDF Code
defense arXiv Mar 25, 2026 · 8w ago

Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness

Yunrui Yu, Hang Su, Jun Zhu · Tsinghua University

Discovers optimal adversarial robustness occurs when activation function curvature falls within 4-10, revealing fundamental expressivity-sharpness trade-off

Input Manipulation Attack vision
PDF
defense arXiv Mar 24, 2026 · 8w ago

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Yang Li, Yule Liu, Xinlei He et al. · Tsinghua University · The Hong Kong University of Science and Technology +1 more

Fine-tunes LLMs to generate explicit authorization reasoning chains before responses, defending against unauthorized access and prompt injection

Prompt Injection Sensitive Information Disclosure nlp
PDF
defense arXiv Mar 20, 2026 · 8w ago

Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang et al. · Northwest Institute of Nuclear Technology · Tsinghua University +1 more

Unifies adversarial robustness and LLM hallucination under a geometric uncertainty principle, proposing defenses without adversarial training

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
tool arXiv Mar 19, 2026 · 9w ago

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

Haochen Zhao, Shaoyang Cui · National University of Singapore · Tsinghua University

MITM-based red-teaming framework that tests autonomous web agent security through real-time network traffic manipulation attacks

Prompt Injection Excessive Agency nlp
PDF Code
attack arXiv Mar 16, 2026 · 9w ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Yihao Zhang, Zeming Wei, Xiaokun Luan et al. · Peking University · Sun Yat-Sen University +3 more

Self-replicating worm attack on LLM agent ecosystems achieving autonomous propagation through configuration hijacking and broadcast infection

AI Supply Chain Attacks Prompt Injection Excessive Agency nlpmultimodal
PDF
Loading more papers…