ML Security Papers

Latest papers

9 papers

tool arXiv Apr 30, 2026 · 21d ago

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Hongliang Liu, Tung-Ling Li, Yuhao Wu · Palo Alto Networks

Two-pass perturbation probing identifies 50-neuron safety refusal circuits in aligned LLMs, enabling precision ablation interventions

Prompt Injection nlp

PDF

attack arXiv Apr 14, 2026 · 5w ago

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu et al. · Zhejiang University · Palo Alto Networks +3 more

Stealthy LLM backdoor injection via weight editing that compiles activation steering into null-space constraints for reliable jailbreaks

Model Poisoning AI Supply Chain Attacks Prompt Injection nlp

PDF

benchmark arXiv Apr 9, 2026 · 6w ago

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Hengyu An, Minxi Li, Jinghuai Zhang et al. · Zhejiang University · Tsinghua University +3 more

Benchmark framework for evaluating multi-agent LLM systems against cascading injection attacks across external inputs, profiles, and inter-agent messages

Prompt Injection Excessive Agency nlpmultimodal

PDF

tool arXiv Jan 27, 2026 · Jan 2026

Proactive Hardening of LLM Defenses with HASTE

Henry Chen, Victor Aranda, Samarth Keshari et al. · Palo Alto Networks

Iterative hard-negative mining framework that generates evasive prompts to stress-test and retrain prompt injection detectors

Prompt Injection nlp

PDF

attack arXiv Jan 8, 2026 · Jan 2026

Deep Dive into the Abuse of DL APIs To Create Malicious AI Models and How to Detect Them

Mohamed Nabeel, Oleksii Starov · Palo Alto Networks

Demonstrates stealthy malicious model injection via TensorFlow API abuse on HuggingFace and proposes LLM-based semantic scanner to detect it

AI Supply Chain Attacks

PDF

attack arXiv Dec 19, 2025 · Dec 2025

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

Tung-Ling Li, Yuhao Wu, Hongliang Liu · Palo Alto Networks

Beam-search adversarial control tokens flip LLM-as-a-Judge binary decisions in RLHF pipelines, enabling reward hacking with low-perplexity sequences

Input Manipulation Attack Prompt Injection nlp

PDF

defense CCS Sep 26, 2025 · Sep 2025

You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

Bochuan Cao, Changjiang Li, Yuanpu Cao et al. · The Pennsylvania State University · Palo Alto Networks +1 more

Attacks GPT-4o/Claude to extract system prompts, then defends with SysVec encoding prompts as hidden internal vectors

Sensitive Information Disclosure nlp

5 citations 1 influentialPDF

attack arXiv Sep 25, 2025 · Sep 2025

Automatic Red Teaming LLM-based Agents with Model Context Protocol Tools

Ping He, Changjiang Li, Binbin Zhao et al. · Zhejiang University · Palo Alto Networks

Automates generation of malicious MCP tools that manipulate LLM agent behavior while evading current detection mechanisms

Insecure Plugin Design Prompt Injection Red-Team Agents nlp

6 citations PDF

defense arXiv Aug 21, 2025 · Aug 2025

VideoEraser: Concept Erasure in Text-to-Video Diffusion Models

Naen Xu, Jinghuai Zhang, Changjiang Li et al. · Zhejiang University · University of California +2 more

Training-free concept erasure framework prevents T2V diffusion models from generating harmful, private, or copyrighted content despite adversarial prompts

Output Integrity Attack generativevision

PDF

Latest papers

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Proactive Hardening of LLM Defenses with HASTE

Deep Dive into the Abuse of DL APIs To Create Malicious AI Models and How to Detect Them

AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens

You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors

Automatic Red Teaming LLM-based Agents with Model Context Protocol Tools

VideoEraser: Concept Erasure in Text-to-Video Diffusion Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue