Latest papers

6 papers
defense arXiv Mar 12, 2026 · 25d ago

BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Siquan Huang, Yijiang Li, Ningzhi Gao et al. · South China University of Technology · University of California San Diego +1 more

Zero-shot inference-time backdoor detector for vision encoders using progressive masking and embedding trajectory clustering

Model Poisoning visionmultimodal
PDF
attack arXiv Dec 1, 2025 · Dec 2025

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen et al. · Georgia Institute of Technology · University of Illinois Urbana-Champaign +4 more

Decomposes harmful requests into innocuous sub-queries via tree search to jailbreak commercial LLM guardrails at 95%+ success

Prompt Injection nlp
1 citations PDF Code
defense arXiv Sep 26, 2025 · Sep 2025

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily et al. · University of Science and Technology of China · Nanyang Technological University +5 more

Mechanistic interpretability framework identifies backdoor-responsible attention heads in LLMs, enabling surgical neutralization or amplification of backdoor behavior

Model Poisoning nlp
1 citations PDF
defense arXiv Sep 11, 2025 · Sep 2025

ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version)

Nojan Sheybani, Alessandro Pegoraro, Jonathan Knauer et al. · University of California San Diego · Technical University of Darmstadt

Defends Split Learning against backdoor injection using zero-knowledge proofs to verify client-side DCT-based defense execution

Model Poisoning federated-learningvision
PDF
attack arXiv Sep 1, 2025 · Sep 2025

Web Fraud Attacks Against LLM-Driven Multi-Agent Systems

Dezhang Kong, Hujin Peng, Yilun Zhang et al. · Zhejiang University · Changsha University of Science and Technology +4 more

Attacks LLM multi-agent systems via manipulated web links using homoglyph, subdirectory, and obfuscation techniques

Insecure Plugin Design Excessive Agency nlp
PDF Code
defense arXiv Aug 26, 2025 · Aug 2025

ReLATE+: Unified Framework for Adversarial Attack Detection, Classification, and Resilient Model Selection in Time-Series Classification

Cagla Ipek Kocal, Onat Gungor, Tajana Rosing et al. · San Diego State University · University of California San Diego

Defends time-series classifiers by detecting and classifying adversarial attacks, then selecting pre-validated robust models to cut retraining costs by 77%

Input Manipulation Attack timeseries
PDF