ML Security Papers

Stats

Latest papers

21 papers

attack arXiv Mar 17, 2026 · 22d ago

Adversarial attacks against Modern Vision-Language Models

Alejandro Paredes La Torre · Duke University

Gradient-based adversarial attacks achieve 53-67% success against LLaVA VLM agent but only 6-15% against Qwen2.5-VL

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

tool arXiv Mar 3, 2026 · 5w ago

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Zhongxi Wang, Yueqian Lin, Jingyang Zhang et al. · Duke University · Virtue AI

Open-source platform for red-teaming multimodal LLMs with multi-turn jailbreaks and cross-modal payload switching

Prompt Injection nlpmultimodal

PDF

benchmark arXiv Feb 24, 2026 · 6w ago

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu, Yuchen Zhou, Muduo Xu et al. · Duke University · New York University +3 more

Benchmark revealing that all existing detectors fail to detect diffusion-model-inpainted forgeries in financial documents

Output Integrity Attack vision

1 citations PDF

benchmark arXiv Feb 23, 2026 · 6w ago

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

Xingyu Shen, Tommy Duong, Xiaodong An et al. · UC Berkeley · Duke University +4 more

Evaluates cosmetic physical attacks (beard, makeup, wrinkles) that fool age-estimation AI into misclassifying minors as adults, achieving up to 83% success rate

Input Manipulation Attack vision

PDF

defense arXiv Feb 23, 2026 · 6w ago

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek et al. · Duke University · Princeton University +3 more

Detects concealed LLM jailbreaks by disentangling goal and framing signals in internal activation space

Prompt Injection nlp

PDF

defense arXiv Feb 23, 2026 · 6w ago

CREDIT: Certified Ownership Verification of Deep Neural Networks Against Model Extraction Attacks

Bolin Shen, Zhan Cheng, Neil Zhenqiang Gong et al. · Florida State University · University of Wisconsin +2 more

Certifies DNN ownership against model extraction using mutual information similarity with theoretical verification guarantees

Model Theft visionnlp

PDF Code

defense arXiv Feb 14, 2026 · 7w ago

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

Yuqi Jia, Ruiqi Wang, Xilong Wang et al. · Duke University · NVIDIA

Three-class attention-based classifier detects prompt injection by distinguishing misaligned, aligned, and non-instruction LLM inputs

Prompt Injection nlp

PDF

benchmark arXiv Feb 12, 2026 · 7w ago

MalTool: Malicious Tool Attacks on LLM Agents

Yuepeng Hu, Yuqi Jia, Mengyuan Li et al. · Duke University · UC Berkeley

Benchmarks malicious tool code attacks on LLM agents; coding LLMs generate evasive malware that defeats VirusTotal and agent-specific detectors

AI Supply Chain Attacks Insecure Plugin Design nlp

PDF

defense arXiv Feb 3, 2026 · 9w ago

WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

Xilong Wang, Yinuo Liu, Zhun Wang et al. · Duke University · UC Berkeley

Defends LLM web agents against indirect prompt injection by detecting and localizing malicious webpage segments

Prompt Injection nlp

PDF Code

attack arXiv Dec 10, 2025 · Dec 2025

ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

Reachal Wang, Yuqi Jia, Neil Zhenqiang Gong · Duke University

Gradient-optimized prompt injection attack on multi-source LLM agents that succeeds regardless of segment ordering in the input

Input Manipulation Attack Prompt Injection nlp

2 citations PDF Code

defense arXiv Nov 26, 2025 · Nov 2025

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury et al. · Duke University · AWS Generative AI Innovation Center

Demonstrates RLVR fine-tuning maintains LLM safety guardrails while improving reasoning, breaking the assumed safety-capability tradeoff

Prompt Injection nlp

1 citations PDF

benchmark arXiv Nov 26, 2025 · Nov 2025

The Double-Edged Nature of the Rashomon Set for Trustworthy Machine Learning

Ethan Hsu, Harry Chen, Chudi Zhong et al. · Duke University · MIT +2 more

Analyzes how Rashomon set diversity improves adversarial robustness but increases training data leakage via a proven robustness-privacy trade-off

Input Manipulation Attack Model Inversion Attack tabular

PDF

defense arXiv Oct 15, 2025 · Oct 2025

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

Wei Zou, Yupei Liu, Yanting Wang et al. · Pennsylvania State University · Duke University

Detects prompt injection in LLM applications using residual-stream representations and a lightweight linear classifier

Prompt Injection nlp

PDF

defense arXiv Oct 14, 2025 · Oct 2025

PromptLocate: Localizing Prompt Injection Attacks

Yuqi Jia, Yupei Liu, Zedian Shao et al. · Duke University · The Pennsylvania State University

First prompt injection localization method for LLMs, pinpointing injected instructions and data for post-attack forensics

Prompt Injection nlp

8 citations 1 influentialPDF

benchmark arXiv Oct 1, 2025 · Oct 2025

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Yinuo Liu, Ruohan Xu, Xilong Wang et al. · Duke University

Benchmarks prompt injection detection methods for web agents, exposing failures against instruction-free and imperceptible image attacks

Input Manipulation Attack Prompt Injection nlpvisionmultimodal

4 citations 1 influentialPDF Code

defense arXiv Oct 1, 2025 · Oct 2025

EditTrack: Detecting and Attributing AI-assisted Image Editing

Zhengyuan Jiang, Yuyang Zhang, Moyang Guo et al. · Duke University

Proposes EditTrack to detect whether an image was AI-edited from a specific base image and attribute the responsible editing model

Output Integrity Attack vision

1 citations PDF

defense arXiv Sep 29, 2025 · Sep 2025

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Yupei Liu, Yanting Wang, Yuqi Jia et al. · Penn State University · Duke University

Defends LLMs against prompt injection via multi-path sampling and task-guided aggregation at inference time

Prompt Injection nlp

3 citations 1 influentialPDF

defense arXiv Sep 29, 2025 · Sep 2025

Fingerprinting LLMs via Prompt Injection

Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li et al. · Duke University · Ant Group

Fingerprints LLMs for provenance detection by optimizing prompt-injection-based probes that survive post-training and quantization

Model Theft Model Theft nlp

1 citations 1 influentialPDF

survey arXiv Aug 20, 2025 · Aug 2025

A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives

Kaixiang Zhao, Lincan Li, Kaize Ding et al. · University of Notre Dame · Florida State University +3 more

Surveys model extraction attacks and defenses across MLaaS platforms, proposing a taxonomy of attack mechanisms and computing environments

Model Theft visionnlptabular

PDF Code

attack arXiv Aug 3, 2025 · Aug 2025

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng, Tianhao Li, Haotian Huang et al. · Duke University · North China University of Technology +7 more

Attacks LLMs via component-wise text perturbations, revealing heterogeneous adversarial robustness across dissected prompt structures

Prompt Injection nlp

PDF Code

Loading more papers…

Latest papers

Adversarial attacks against Modern Vision-Language Models

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

CREDIT: Certified Ownership Verification of Deep Neural Networks Against Model Extraction Attacks

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

MalTool: Malicious Tool Attacks on LLM Agents

WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

The Double-Edged Nature of the Rashomon Set for Trustworthy Machine Learning

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

PromptLocate: Localizing Prompt Injection Attacks

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

EditTrack: Detecting and Attributing AI-assisted Image Editing

SecInfer: Preventing Prompt Injection via Inference-time Scaling

Fingerprinting LLMs via Prompt Injection

A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue