ML Security Papers

Latest papers

32 papers

benchmark arXiv Apr 29, 2026 · 22d ago

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

Soheil Khodayari, Xuenan Zhang, Bhupendra Acharya et al. · Independent Researcher · CISPA Helmholtz Center for Information Security +1 more

Discovers 15.3K real-world indirect prompt injections across 1.2B URLs targeting LLM crawlers, agents, and automation systems

Prompt Injection nlpmultimodal

PDF

benchmark arXiv Apr 23, 2026 · 28d ago

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

Xiaoyi Chen, Haoyuan Wang, Siyuan Tang et al. · Indiana University Bloomington · Independent Researcher +3 more

Evaluation framework exposing weaknesses in LLM privacy unlearning through three-tier attacks: direct retrieval, in-context recovery, and fine-tuning restoration

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 20, 2026 · 4w ago

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

Thamilvendhan Munirathinam · Independent Researcher

Seven cross-domain detection techniques for prompt injection that outperform pattern matching and resist adaptive attacks

Prompt Injection nlp

PDF

benchmark arXiv Apr 9, 2026 · 6w ago

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

You Hu, Chenzhuo Zhao, Changfa Mo et al. · Zhejiang University · Independent Researcher +1 more

Benchmark dataset and evaluation framework for detecting AI-generated scientific figures across multiple generator sources and degradation scenarios

Output Integrity Attack visionmultimodalnlp

PDF Code

defense arXiv Apr 5, 2026 · 6w ago

Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

Mohammad Hossein Chinaei · Independent Researcher

Runtime reference monitor that tracks causal provenance across tool calls to block indirect information leakage through denial-feedback patterns in LLM agents

Insecure Plugin Design nlp

PDF

benchmark arXiv Feb 23, 2026 · 12w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp

3 citations PDF

benchmark arXiv Feb 18, 2026 · Feb 2026

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp

PDF

defense arXiv Feb 12, 2026 · Feb 2026

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

Oliver Zahn, Matt Beton, Simran Chana · arXiv · Independent Researcher +1 more

Defends against data poisoning via contributor-reputation-weighted training, outperforming Byzantine-robust baselines under joint credential-faking and gradient-alignment attacks

Data Poisoning Attack tabularfederated-learning

PDF

defense arXiv Feb 11, 2026 · Feb 2026

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll · Independent Researcher

Proposes a proxy-level scoring formula combining peak risk and persistence to detect multi-turn LLM jailbreaks without LLM inference

Prompt Injection nlp

PDF Code

defense arXiv Feb 7, 2026 · Feb 2026

UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

Jiaming He, Fuming Luo, Hongwei Li et al. · University of Electronic Science and Technology of China · Independent Researcher +2 more

Protects private tabular data from unauthorized training by injecting decoupled shortcut perturbations that drive models to near-random performance

Data Poisoning Attack tabular

PDF

tool arXiv Feb 7, 2026 · Feb 2026

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Kunal Pai, Parth Shah, Harshil Patel · University of California · Independent Researcher

Evolutionary framework auto-generates and mutates adversarial prompts to uncover LLM agent jailbreaks missed by static red-teaming

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Feb 2, 2026 · Feb 2026

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu et al. · National Taiwan University · Independent Researcher

Reveals LLM safety miscalibration via Expected Harm metric, boosting existing jailbreak success rates by up to 2×

Prompt Injection nlp

PDF

defense arXiv Feb 2, 2026 · Feb 2026

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Muli Yang, Gabriel James Goenawan, Henan Wang et al. · Institute for Infocomm Research (I2R) · Independent Researcher +1 more

Post-hoc Bayesian calibration framework fixes systematic bias in AI-generated image detectors under distribution shift without retraining

Output Integrity Attack visiongenerative

PDF Code

benchmark arXiv Jan 30, 2026 · Jan 2026

AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

Elif Nebioglu, Emirhan Bilgiç, Adrian Popescu · Independent Researcher · Sorbonne University +2 more

Proposes INP-X benchmark revealing AI image detectors rely on global VAE artifacts, crashing accuracy from 91% to chance level

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Jan 29, 2026 · Jan 2026

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Jun Xue, Yi Chai, Yanzhen Ren et al. · Wuhan University · Independent Researcher +3 more

Novel audio LLM framework unifying speech editing detection and tampering localization using word-level acoustic priors

Output Integrity Attack audionlp

1 citations PDF

attack arXiv Jan 19, 2026 · Jan 2026

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Murat Bilgehan Ertan, Emirhan Böge, Min Chen et al. · Centrum Wiskunde & Informatica · Vrije Universiteit Amsterdam +2 more

SAGE paraphrasing framework defeats membership inference attacks on LLMs by rewriting training data to preserve semantics but evade MIA signals

Membership Inference Attack nlp

PDF

defense arXiv Jan 8, 2026 · Jan 2026

Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

Shuliang Liu, Xingyu Li, Hongyi Liu et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks reasoning LLM text outputs by separating thinking from answering and adapting strength via semantic vectors

Output Integrity Attack nlp

1 citations PDF Code

benchmark arXiv Jan 7, 2026 · Jan 2026

Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

Binh Nguyen, Thai Le · Indiana University · Independent Researcher

Benchmarks reasoning robustness of audio deepfake detectors under adversarial attack, revealing a shield-vs-tax bifurcation based on acoustic perception quality

Input Manipulation Attack Output Integrity Attack audionlp

1 citations PDF

defense arXiv Jan 3, 2026 · Jan 2026

Byzantine-Robust Federated Learning Framework with Post-Quantum Secure Aggregation for Real-Time Threat Intelligence Sharing in Critical IoT Infrastructure

Milad Rahmati, Nima Rahmati · Independent Researcher

Defends federated learning against Byzantine poisoning attacks using reputation-based client filtering and post-quantum secure aggregation for IoT IDS

Data Poisoning Attack federated-learning

PDF

attack arXiv Dec 22, 2025 · Dec 2025

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Linzhi Chen, Yang Sun, Hongru Wei et al. · ShanghaiTech University · Independent Researcher

Backdoor attack on open-weight LoRA adapters using causal-guided detoxification, cutting false trigger rates by 50–70%

Model Poisoning Transfer Learning Attack nlp

1 citations PDF

Loading more papers…

Latest papers

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

Agents of Chaos

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

Byzantine-Robust Federated Learning Framework with Post-Quantum Secure Aggregation for Real-Time Threat Intelligence Sharing in Critical IoT Infrastructure

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue