Latest papers

6 papers
tool arXiv Jan 27, 2026 · 9w ago

Proactive Hardening of LLM Defenses with HASTE

Henry Chen, Victor Aranda, Samarth Keshari et al. · Palo Alto Networks

Iterative hard-negative mining framework that generates evasive prompts to stress-test and retrain prompt injection detectors

Prompt Injection nlp
PDF
attack arXiv Jan 8, 2026 · 12w ago

Deep Dive into the Abuse of DL APIs To Create Malicious AI Models and How to Detect Them

Mohamed Nabeel, Oleksii Starov · Palo Alto Networks

Demonstrates stealthy malicious model injection via TensorFlow API abuse on HuggingFace and proposes LLM-based semantic scanner to detect it

AI Supply Chain Attacks
PDF
attack arXiv Dec 19, 2025 · Dec 2025

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Tung-Ling Li, Yuhao Wu, Hongliang Liu · Palo Alto Networks

Beam-search adversarial control tokens flip LLM-as-a-Judge binary decisions in RLHF pipelines, enabling reward hacking with low-perplexity sequences

Input Manipulation Attack Prompt Injection nlp
PDF
defense CCS Sep 26, 2025 · Sep 2025

You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

Bochuan Cao, Changjiang Li, Yuanpu Cao et al. · The Pennsylvania State University · Palo Alto Networks +1 more

Attacks GPT-4o/Claude to extract system prompts, then defends with SysVec encoding prompts as hidden internal vectors

Sensitive Information Disclosure nlp
5 citations 1 influentialPDF
attack arXiv Sep 25, 2025 · Sep 2025

Automatic Red Teaming LLM-based Agents with Model Context Protocol Tools

Ping He, Changjiang Li, Binbin Zhao et al. · Zhejiang University · Palo Alto Networks

Automates generation of malicious MCP tools that manipulate LLM agent behavior while evading current detection mechanisms

Insecure Plugin Design Prompt Injection nlp
6 citations PDF
defense arXiv Aug 21, 2025 · Aug 2025

VideoEraser: Concept Erasure in Text-to-Video Diffusion Models

Naen Xu, Jinghuai Zhang, Changjiang Li et al. · Zhejiang University · University of California +2 more

Training-free concept erasure framework prevents T2V diffusion models from generating harmful, private, or copyrighted content despite adversarial prompts

Output Integrity Attack generativevision
PDF