ML Security Papers

Latest papers

144 papers

defense arXiv Apr 28, 2026 · 23d ago

Adversarial Robustness of NTK Neural Networks

Yuxuan Hou · Qiuzhen College · Tsinghua University

Proves NTK neural networks achieve minimax optimal adversarial robustness with early stopping but fail catastrophically when overfitted

Input Manipulation Attack tabular

PDF

attack arXiv Apr 26, 2026 · 25d ago

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

Yu Cui, Ruiqing Yue, Hang Fu et al. · Beijing Institute of Technology · Chinese Academy of Sciences +3 more

Extracts private information from LLM agent memory via single-query hybrid probing in black-box and gray-box settings

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 23, 2026 · 28d ago

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

Yanran Zhang, Wenzhao Zheng, Yifei Li et al. · Tsinghua University

Unified framework co-training image generation and AI-generated image detection through adversarial synergy and multimodal attention

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Apr 20, 2026 · 4w ago

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

Xiangyu Wen, Yuang Zhao, Xiaoyu Xu et al. · The Chinese University of Hong Kong · Shanghai Jiao Tong University +3 more

Kernel-based security architecture for LLM agents that intercepts unsafe tool calls using deterministic taint tracking and dependency graphs

Insecure Plugin Design Excessive Agency nlp

PDF Code

defense arXiv Apr 20, 2026 · 4w ago

IncreFA: Breaking the Static Wall of Generative Model Attribution

Haotian Qin, Dongliang Chang, Yueying Gao et al. · Beijing University of Posts and Telecommunications · Tsinghua University

Continual learning framework that adapts generative image attribution to newly released models while maintaining accuracy on previous generators

Output Integrity Attack visiongenerative

PDF Code

benchmark arXiv Apr 13, 2026 · 5w ago

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

Chenxi Qing, Junxi Wu, Zheng Liu et al. · Tsinghua University · Nankai University +2 more

Chinese benchmark for AI-generated text detection with real-world prompts across nine LLMs and multiple domains

Output Integrity Attack nlp

PDF Code

defense arXiv Apr 13, 2026 · 5w ago

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Junxiao Yang, Haoran Liu, Jinzhe Tu et al. · Tsinghua University · Alibaba Group

Defends LLMs against cross-lingual jailbreaks by anchoring safety alignment in language-agnostic semantic representations rather than surface text

Prompt Injection nlp

PDF

attack arXiv Apr 13, 2026 · 5w ago

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu et al. · Peking University · Sun Yat-Sen University +4 more

Multi-turn jailbreak attack that chains low-risk prompts to cumulatively bypass LLM safety guardrails across modalities

Prompt Injection nlpmultimodal

PDF

defense arXiv Apr 12, 2026 · 5w ago

Defending against Patch-Based and Texture-Based Adversarial Attacks with Spectral Decomposition

Wei Zhang, Xinyu Chang, Xiao Li et al. · Tsinghua University · University of Science and Technology Beijing

Spectral defense using wavelet decomposition to detect and mitigate both patch-based and texture-based adversarial attacks on vision models

Input Manipulation Attack vision

PDF Code

benchmark arXiv Apr 9, 2026 · 6w ago

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Hengyu An, Minxi Li, Jinghuai Zhang et al. · Zhejiang University · Tsinghua University +3 more

Benchmark framework for evaluating multi-agent LLM systems against cascading injection attacks across external inputs, profiles, and inter-agent messages

Prompt Injection Excessive Agency nlpmultimodal

PDF

attack arXiv Apr 2, 2026 · 7w ago

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Jiawei Chen, Simin Huang, Jiawei Du et al. · East China Normal University · Zhongguancun Academy +3 more

Physically realizable 3D adversarial textures that degrade vision-language-action robot models with 96.7% task failure rates

Input Manipulation Attack visionmultimodalnlp

PDF Code

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

vlm multimodal transformer East China Normal University · Zhongguancun Academy · A*STAR +2 more

PDF arXiv Code

attack arXiv Apr 1, 2026 · 7w ago

Enhancing Gradient Inversion Attacks in Federated Learning via Hierarchical Feature Optimization

Hao Fang, Wenbo Yu, Bin Chen et al. · Tsinghua University · Harbin Institute of Technology

GAN-based gradient inversion attack reconstructing client training data from FL gradients via hierarchical feature optimization

Model Inversion Attack visionfederated-learning

PDF

attack arXiv Mar 31, 2026 · 7w ago

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Yunrui Yu, Xuxiang Feng, Pengda Qin et al. · Tsinghua University · University of Macau +1 more

Novel adversarial attack targeting dummy-class defenses by simultaneously attacking true and dummy labels with adaptive weighting

Input Manipulation Attack vision

PDF

benchmark arXiv Mar 30, 2026 · 7w ago

Evaluating Privilege Usage of Agents on Real-World Tools

Quan Zhang, Lianhang Fu, Lvsi Lian et al. · East China Normal University · Xinjiang University +1 more

Benchmark evaluating LLM agents' privilege control under prompt injection attacks using real-world tools, finding 84.80% attack success

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

attack arXiv Mar 30, 2026 · 7w ago

InkDrop: Invisible Backdoor Attacks Against Dataset Condensation

He Yang, Dongyi Lv, Song Ma et al. · Xi'an Jiaotong University · Tsinghua University

Stealthy backdoor attack on dataset condensation using boundary-proximate samples and imperceptible perturbations to evade detection

Model Poisoning vision

PDF Code

defense arXiv Mar 25, 2026 · 8w ago

Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness

Yunrui Yu, Hang Su, Jun Zhu · Tsinghua University

Discovers optimal adversarial robustness occurs when activation function curvature falls within 4-10, revealing fundamental expressivity-sharpness trade-off

Input Manipulation Attack vision

PDF

defense arXiv Mar 24, 2026 · 8w ago

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Yang Li, Yule Liu, Xinlei He et al. · Tsinghua University · The Hong Kong University of Science and Technology +1 more

Fine-tunes LLMs to generate explicit authorization reasoning chains before responses, defending against unauthorized access and prompt injection

Prompt Injection Sensitive Information Disclosure nlp

PDF

defense arXiv Mar 20, 2026 · 8w ago

Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang et al. · Northwest Institute of Nuclear Technology · Tsinghua University +1 more

Unifies adversarial robustness and LLM hallucination under a geometric uncertainty principle, proposing defenses without adversarial training

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

tool arXiv Mar 19, 2026 · 9w ago

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

Haochen Zhao, Shaoyang Cui · National University of Singapore · Tsinghua University

MITM-based red-teaming framework that tests autonomous web agent security through real-time network traffic manipulation attacks

Prompt Injection Excessive Agency nlp

PDF Code

attack arXiv Mar 16, 2026 · 9w ago

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Yihao Zhang, Zeming Wei, Xiaokun Luan et al. · Peking University · Sun Yat-Sen University +3 more

Self-replicating worm attack on LLM agent ecosystems achieving autonomous propagation through configuration hijacking and broadcast infection

AI Supply Chain Attacks Prompt Injection Excessive Agency nlpmultimodal

PDF

Loading more papers…

Latest papers

Adversarial Robustness of NTK Neural Networks

Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

IncreFA: Breaking the Static Wall of Generative Model Attribution

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Defending against Patch-Based and Texture-Based Adversarial Attacks with Spectral Decomposition

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

Enhancing Gradient Inversion Attacks in Federated Learning via Hierarchical Feature Optimization

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Evaluating Privilege Usage of Agents on Real-World Tools

InkDrop: Invisible Backdoor Attacks Against Dataset Condensation

Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue