ML Security Papers

Latest papers

21 papers

attack arXiv Mar 6, 2026 · 4w ago

Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces

Eitan Shaar, Ariel Shaulov, Yalcin Tur et al. · Yalcin Tur · Tel-Aviv University +4 more

Transfer adversarial attack optimizing in Stable Diffusion VAE latent space for low-frequency, cross-architecture-transferable perturbations

Input Manipulation Attack vision

PDF

benchmark arXiv Feb 24, 2026 · 5w ago

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu, Yuchen Zhou, Muduo Xu et al. · Duke University · New York University +3 more

Benchmark revealing that all existing detectors fail to detect diffusion-model-inpainted forgeries in financial documents

Output Integrity Attack vision

1 citations PDF

survey arXiv Feb 23, 2026 · 6w ago

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

Xiaochong Jiang, Shiqi Yang, Wenting Yang et al. · Northeastern University · New York University +2 more

Surveys runtime attack surfaces of agentic LLM systems, introducing the Viral Agent Loop self-propagating worm and a Zero-Trust defense architecture

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

attack arXiv Feb 22, 2026 · 6w ago

Understanding Empirical Unlearning with Combinatorial Interpretability

Shingo Kodama, Niv Cohen, Micah Adler et al. · Middlebury College · New York University +2 more

Attacks machine unlearning methods using combinatorial interpretability, showing erased knowledge persists in weights and recovers rapidly via fine-tuning

Model Inversion Attack nlpvision

PDF

attack arXiv Feb 9, 2026 · 8w ago

Data Reconstruction: Identifiability and Optimization with Sample Splitting

Yujie Shen, Zihan Wang, Jian Qian et al. · Tsinghua University · New York University +1 more

Improves training data reconstruction attacks on neural networks via identifiability theory and a sample-splitting optimization algorithm

Model Inversion Attack vision

PDF

defense arXiv Feb 5, 2026 · 8w ago

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Ziyuan Yang, Wenxuan Ding, Shangbin Feng et al. · University of Washington · New York University

Measures malicious third-party models' impact on multi-LLM collaboration systems and proposes supervisor-based defenses recovering 95% performance

AI Supply Chain Attacks Model Poisoning nlp

PDF Code

benchmark arXiv Jan 27, 2026 · 9w ago

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Chi Zhang, Wenxuan Ding, Jiale Liu et al. · The University of Texas at Austin · New York University +3 more

Benchmarks VLM susceptibility to persuasive conflicting text prompts that override visual evidence, finding 48% average accuracy drop

Prompt Injection visionnlpmultimodal

PDF

attack arXiv Jan 6, 2026 · Jan 2026

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

Devang Kulshreshtha, Hang Su, Chinmay Hegde et al. · Amazon · New York University +1 more

Attacker-LLM-free multi-turn jailbreak via lexical anchor injection achieves 97-100% ASR on GPT/Claude/Llama in ~6.4 queries

Prompt Injection nlp

PDF

attack arXiv Jan 4, 2026 · Jan 2026

DiMEx: Breaking the Cold Start Barrier in Data-Free Model Extraction via Latent Diffusion Priors

Yash Thesia, Meera Suthar · New York University

Weaponizes latent diffusion priors to eliminate cold start in data-free model extraction attacks against MLaaS APIs.

Model Theft vision

PDF

defense arXiv Jan 1, 2026 · Jan 2026

PatchBlock: A Lightweight Defense Against Adversarial Patches for Embedded EdgeAI Devices

Nandish Chattopadhyay, Abdul Basit, Amira Guesmi et al. · New York University · Dubai Artificial Intelligence

Lightweight CPU preprocessing defense neutralizes adversarial patches on EdgeAI devices via isolation forest and dimensionality reduction

Input Manipulation Attack vision

PDF

defense arXiv Nov 16, 2025 · Nov 2025

Beyond Pixels: Semantic-aware Typographic Attack for Geo-Privacy Protection

Jiayi Zhu, Yihao Huang, Yue Cao et al. · Xidian University · Ltd +5 more

Defends geo-privacy by embedding semantics-aware deceptive text overlays around images to mislead LVLMs into predicting wrong geolocations.

Input Manipulation Attack Prompt Injection visionmultimodal

PDF

defense arXiv Oct 11, 2025 · Oct 2025

SimKey: A Semantically Aware Key Module for Watermarking Language Models

Shingo Kodama, Haya Diwan, Lucas Rosenblatt et al. · Middlebury College · New York University +1 more

Semantic LSH-based key module makes LLM text watermarks robust to paraphrasing while blocking harmful content false attribution

Output Integrity Attack nlp

1 citations PDF Code

defense EMNLP Sep 29, 2025 · Sep 2025

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan, Victor Li, Qi Lei · New York University

Inference-time jailbreak defense using progressive self-reflection reduces LLM attack success rates from ~80% to under 6%

Prompt Injection nlp

1 citations PDF

defense arXiv Sep 15, 2025 · Sep 2025

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

Gustavo Sandoval, Denys Fenchenko, Junyao Chen · New York University

Adversarial fine-tuning defense cuts GPT-3 prompt injection attack success from 31% to near zero

Prompt Injection nlp

PDF

defense arXiv Sep 15, 2025 · Sep 2025

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

Mitchell Plyler, Yilun Zhang, Alexander Tuzhilin et al. · Mozilla Corporation · Ciphero AI +1 more

Novel Transformer detector using selected-next-token probabilities and contrastive pre-training to identify LLM-generated text out-of-domain

Output Integrity Attack nlp

PDF Code

attack arXiv Aug 28, 2025 · Aug 2025

Ransomware 3.0: Self-Composing and LLM-Orchestrated

Md Raz, Meet Udeshi, P.V. Sai Charan et al. · New York University

Demonstrates LLM-orchestrated ransomware that autonomously synthesizes polymorphic malware at runtime from natural language prompts

Excessive Agency nlp

PDF

defense arXiv Aug 16, 2025 · Aug 2025

TriQDef: Disrupting Semantic and Gradient Alignment to Prevent Adversarial Patch Transferability in Quantized Neural Networks

Amira Guesmi, Bassem Ouni, Muhammad Shafique · New York University · Technology Innovation Institute

Defends quantized neural networks against transferable adversarial patches by disrupting semantic and gradient alignment across bit-widths

Input Manipulation Attack vision

PDF

survey arXiv Aug 4, 2025 · Aug 2025

A Survey on Data Security in Large Language Models

Kang Chen, Xiuze Zhou, Yuanguo Lin et al. · Jimei University · Wenzhou-Kean University +3 more

Surveys data-centric security risks in LLMs — data poisoning, prompt injection, PII leakage — and reviews defenses across the model lifecycle

Data Poisoning Attack Prompt Injection Training Data Poisoning Sensitive Information Disclosure nlp

PDF

defense arXiv Aug 4, 2025 · Aug 2025

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi et al. · New York University · Inc. +2 more

PCA-based OOD detection gate blocks adversarial and off-topic queries from reaching RAG-backed LLMs in high-stakes domains

Prompt Injection nlp

PDF Code

defense arXiv Jan 4, 2025 · Jan 2025

AdaMixup: A Dynamic Defense Framework for Membership Inference Attack Mitigation

Ying Chen, Jiajing Chen, Yijie Weng et al. · New York University · University of California +3 more

Defends against membership inference attacks using adaptive mixup training that dynamically adjusts interpolation ratios during training

Membership Inference Attack vision

3 citations PDF

Loading more papers…

Latest papers

Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

Understanding Empirical Unlearning with Combinatorial Interpretability

Data Reconstruction: Identifiability and Optimization with Sample Splitting

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search

DiMEx: Breaking the Cold Start Barrier in Data-Free Model Extraction via Latent Diffusion Priors

PatchBlock: A Lightweight Defense Against Adversarial Patches for Embedded EdgeAI Devices

Beyond Pixels: Semantic-aware Typographic Attack for Geo-Privacy Protection

SimKey: A Semantically Aware Key Module for Watermarking Language Models

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models

SENTRA: Selected-Next-Token Transformer for LLM Text Detection

Ransomware 3.0: Self-Composing and LLM-Orchestrated

TriQDef: Disrupting Semantic and Gradient Alignment to Prevent Adversarial Patch Transferability in Quantized Neural Networks

A Survey on Data Security in Large Language Models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

AdaMixup: A Dynamic Defense Framework for Membership Inference Attack Mitigation

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue