Latest papers

14 papers
attack FLLM Mar 4, 2026 · 4w ago

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Neha Nagaraja, Lan Zhang, Zhilong Wang et al. · Northern Arizona University · ByteDance

Black-box attack conceals adversarial text instructions inside natural images to hijack multimodal LLM outputs via visual prompt injection

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
defense arXiv Mar 2, 2026 · 5w ago

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

Yu Lin, Qizhi Zhang, Wenqiang Ruan et al. · ByteDance · Nanjing University

Defends user input privacy in cloud LLM inference by obfuscating activations to resist internal state inversion attacks

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
attack arXiv Feb 24, 2026 · 5w ago

OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services

Longxiang Wang, Xiang Zheng, Xuhao Zhang et al. · City University of Hong Kong · ByteDance

Attacks multi-tenant LLM services via KV cache side-channels to reconstruct private prompts with 12× efficiency gains

Sensitive Information Disclosure nlp
PDF
benchmark arXiv Dec 6, 2025 · Dec 2025

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo et al. · Nanyang Technological University · BraneMatrix AI +7 more

Unified benchmark and toolbox evaluating 13 attack methods and 15 defenses against multimodal jailbreaks across 18 open- and closed-source MLLMs

Prompt Injection multimodalnlpvision
5 citations PDF Code
attack arXiv Nov 12, 2025 · Nov 2025

Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

Jiajie Su, Zihan Nan, Yunshan Ma et al. · Zhejiang University · Peking University +4 more

RL-driven profile pollution attack crafts stealthy input sequence perturbations to hijack sequential recommender predictions

Input Manipulation Attack nlp
PDF
attack arXiv Nov 11, 2025 · Nov 2025

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Yuxuan Zhou, Yuzhao Peng, Yang Bai et al. · Tsinghua University · ByteDance +4 more

Analyzes why mild OOD image manipulation best jailbreaks VLMs, then proposes JOCR, an OCR-based visual attack outperforming SOTA baselines

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF
attack arXiv Nov 10, 2025 · Nov 2025

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Yuxuan Zhou, Yang Bai, Kuofeng Gao et al. · Tsinghua University · ByteDance +1 more

Multi-agent framework automates black-box jailbreaking of VLMs via coordinated image-text pair generation, achieving 60%+ ASR on GPT-4o

Prompt Injection multimodalnlp
PDF
defense arXiv Oct 22, 2025 · Oct 2025

FPT-Noise: Dynamic Scene-Aware Counterattack for Test-Time Adversarial Defense in Vision-Language Models

Jia Deng, Jin Li, Zhenhua Zhao et al. · Guangzhou University · ByteDance

Test-time defense for CLIP that dynamically generates image-specific counterattack noise to neutralize adversarial perturbations without retraining

Input Manipulation Attack visionmultimodal
2 citations PDF
defense arXiv Oct 20, 2025 · Oct 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek et al. · ByteDance · University of Chicago +2 more

Inference-time defense reintroducing alignment tokens mid-generation to block jailbreaks and adversarial prefill attacks in LLMs

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Oct 19, 2025 · Oct 2025

Black-box Optimization of LLM Outputs by Asking for Directions

Jie Zhang, Meng Ding, Yang Liu et al. · ETH Zürich · University at Buffalo +1 more

Exploits LLMs' comparative confidence expressions as black-box optimization signal for adversarial image attacks, jailbreaks, and prompt injections

Input Manipulation Attack Prompt Injection visionnlpmultimodal
2 citations PDF Code
benchmark arXiv Oct 14, 2025 · Oct 2025

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Lang Gao, Xuhui Li, Chenxi Wang et al. · MBZUAI · ByteDance +2 more

Benchmarks AI-text detectors on personalized LLM imitations, reveals feature-inversion failure mode, proposes diagnostic probe framework

Output Integrity Attack nlp
1 citations PDF Code
defense SSRN Oct 8, 2025 · Oct 2025

A2AS: Agentic AI Runtime Security and Self-Defense

Eugene Neelou, Ivan Novikov, Max Moroz et al. · A2AS · OWASP +10 more

Proposes A2AS runtime security framework for LLM agents enforcing prompt authentication, behavior boundaries, and in-context defenses

Prompt Injection Excessive Agency nlp
3 citations PDF
defense arXiv Sep 4, 2025 · Sep 2025

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh et al. · Tsinghua University · ByteDance +1 more

Multi-turn jailbreak embeds harmful requests in ethical dilemmas to bypass LLM safety; LoRA defense separates analytic from instrumental harmful responses

Prompt Injection nlp
PDF
defense arXiv Aug 2, 2025 · Aug 2025

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection

Peiran Wang, Yang Liu, Yunfei Lu et al. · ByteDance

Defends LLM agents against prompt injection by converting runtime traces into program dependency graphs with a type-system policy enforcer

Prompt Injection Excessive Agency nlp
PDF