ML Security Papers

Stats

Latest papers

16 papers

attack arXiv Apr 13, 2026 · 5w ago

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu et al. · Peking University · Sun Yat-Sen University +4 more

Multi-turn jailbreak attack that chains low-risk prompts to cumulatively bypass LLM safety guardrails across modalities

Prompt Injection nlpmultimodal

PDF

attack arXiv Apr 6, 2026 · 6w ago

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Zijun Wang, Haoqin Tu, Letian Zhang et al. · UC Santa Cruz · National University of Singapore +4 more

Real-world evaluation showing poisoning of agent persistent state (skills, config, memory) increases attack success from 25% to 64-74% across four LLM backbones

Prompt Injection Excessive Agency nlp

PDF Code

attack FLLM Mar 4, 2026 · 11w ago

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Neha Nagaraja, Lan Zhang, Zhilong Wang et al. · Northern Arizona University · ByteDance

Black-box attack conceals adversarial text instructions inside natural images to hijack multimodal LLM outputs via visual prompt injection

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

defense arXiv Mar 2, 2026 · 11w ago

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

Yu Lin, Qizhi Zhang, Wenqiang Ruan et al. · ByteDance · Nanjing University

Defends user input privacy in cloud LLM inference by obfuscating activations to resist internal state inversion attacks

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

attack arXiv Feb 24, 2026 · 12w ago

OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

Longxiang Wang, Xiang Zheng, Xuhao Zhang et al. · City University of Hong Kong · ByteDance

Attacks multi-tenant LLM services via KV cache side-channels to reconstruct private prompts with 12× efficiency gains

Sensitive Information Disclosure nlp

PDF

benchmark arXiv Dec 6, 2025 · Dec 2025

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo et al. · Nanyang Technological University · BraneMatrix AI +7 more

Unified benchmark and toolbox evaluating 13 attack methods and 15 defenses against multimodal jailbreaks across 18 open- and closed-source MLLMs

Prompt Injection multimodalnlpvision

5 citations PDF Code

attack arXiv Nov 12, 2025 · Nov 2025

Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

Jiajie Su, Zihan Nan, Yunshan Ma et al. · Zhejiang University · Peking University +4 more

RL-driven profile pollution attack crafts stealthy input sequence perturbations to hijack sequential recommender predictions

Input Manipulation Attack nlp

PDF

attack arXiv Nov 11, 2025 · Nov 2025

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Yuxuan Zhou, Yuzhao Peng, Yang Bai et al. · Tsinghua University · ByteDance +4 more

Analyzes why mild OOD image manipulation best jailbreaks VLMs, then proposes JOCR, an OCR-based visual attack outperforming SOTA baselines

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF

attack arXiv Nov 10, 2025 · Nov 2025

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Yuxuan Zhou, Yang Bai, Kuofeng Gao et al. · Tsinghua University · ByteDance +1 more

Multi-agent framework automates black-box jailbreaking of VLMs via coordinated image-text pair generation, achieving 60%+ ASR on GPT-4o

Prompt Injection multimodalnlp

PDF

defense arXiv Oct 22, 2025 · Oct 2025

FPT-Noise: Dynamic Scene-Aware Counterattack for Test-Time Adversarial Defense in Vision-Language Models

Jia Deng, Jin Li, Zhenhua Zhao et al. · Guangzhou University · ByteDance

Test-time defense for CLIP that dynamically generates image-specific counterattack noise to neutralize adversarial perturbations without retraining

Input Manipulation Attack visionmultimodal

2 citations PDF

defense arXiv Oct 20, 2025 · Oct 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek et al. · ByteDance · University of Chicago +2 more

Inference-time defense reintroducing alignment tokens mid-generation to block jailbreaks and adversarial prefill attacks in LLMs

Input Manipulation Attack Prompt Injection nlp

PDF

attack arXiv Oct 19, 2025 · Oct 2025

Black-box Optimization of LLM Outputs by Asking for Directions

Jie Zhang, Meng Ding, Yang Liu et al. · ETH Zürich · University at Buffalo +1 more

Exploits LLMs' comparative confidence expressions as black-box optimization signal for adversarial image attacks, jailbreaks, and prompt injections

Input Manipulation Attack Prompt Injection visionnlpmultimodal

2 citations PDF Code

benchmark arXiv Oct 14, 2025 · Oct 2025

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Lang Gao, Xuhui Li, Chenxi Wang et al. · MBZUAI · ByteDance +2 more

Benchmarks AI-text detectors on personalized LLM imitations, reveals feature-inversion failure mode, proposes diagnostic probe framework

Output Integrity Attack nlp

1 citations PDF Code

defense SSRN Oct 8, 2025 · Oct 2025

A2AS: Agentic AI Runtime Security and Self-Defense

Eugene Neelou, Ivan Novikov, Max Moroz et al. · A2AS · OWASP +10 more

Proposes A2AS runtime security framework for LLM agents enforcing prompt authentication, behavior boundaries, and in-context defenses

Prompt Injection Excessive Agency nlp

3 citations PDF

defense arXiv Sep 4, 2025 · Sep 2025

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh et al. · Tsinghua University · ByteDance +1 more

Multi-turn jailbreak embeds harmful requests in ethical dilemmas to bypass LLM safety; LoRA defense separates analytic from instrumental harmful responses

Prompt Injection nlp

PDF

defense arXiv Aug 2, 2025 · Aug 2025

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection

Peiran Wang, Yang Liu, Yunfei Lu et al. · ByteDance

Defends LLM agents against prompt injection by converting runtime traces into program dependency graphs with a type-system policy enforcer

Prompt Injection Excessive Agency nlp

PDF

Latest papers

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

FPT-Noise: Dynamic Scene-Aware Counterattack for Test-Time Adversarial Defense in Vision-Language Models

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Black-box Optimization of LLM Outputs by Asking for Directions

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

A2AS: Agentic AI Runtime Security and Self-Defense

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue