ML Security Papers

Stats

Latest papers

9 papers

attack arXiv Apr 14, 2026 · 5w ago

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg · Cisco

Typographic prompt injection attacks on VLMs that bypass safety filters by rendering malicious text as images

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

defense arXiv Apr 7, 2026 · 6w ago

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala et al. · OWASP · Amazon Web Services +3 more

Proves continuous utility-preserving prompt filters cannot eliminate all LLM jailbreaks due to topological constraints on prompt space

Prompt Injection nlp

PDF Code

benchmark arXiv Feb 25, 2026 · 12w ago

Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala et al. · Amazon · Cisco +2 more

Maps LLM safety failure topology using quality-diversity optimization to reveal behavioral attraction basins across three frontier models

Prompt Injection nlp

PDF Code

tool arXiv Feb 25, 2026 · 12w ago

Adversarial Hubness Detector: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems

Idan Habler, Vineeth Sai Narajala, Stav Koren et al. · Cisco · OWASP +1 more

Open-source scanner (hubscan) detecting adversarially crafted hub documents injected into RAG vector databases to hijack LLM context

Data Poisoning Attack Prompt Injection nlpmultimodal

PDF Code

attack arXiv Jan 27, 2026 · Jan 2026

Membership Inference Attacks Against Fine-tuned Diffusion Language Models

Yuetian Chen, Kaiyuan Zhang, Yuntao Du et al. · Purdue University · Cisco

Proposes SAMA, a membership inference attack exploiting mask aggregation to expose privacy vulnerabilities in diffusion language models

Membership Inference Attack nlp

PDF

benchmark arXiv Dec 31, 2025 · Dec 2025

Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing

Manish Bhatt, Adrian Wood, Idan Habler et al. · OWASP · Amazon +3 more

Adapts Go-Explore to red-team LLM tool-using agents, finding seed variance (8x spread) dominates algorithmic choice in prompt injection discovery

Prompt Injection Excessive Agency Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

benchmark arXiv Nov 5, 2025 · Nov 2025

Death by a Thousand Prompts: Open Model Vulnerability Analysis

Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan et al. · Cisco

Benchmarks prompt injection and jailbreak resilience of 8 open-weight LLMs; multi-turn attacks reach 92.78% success, 2–10x over single-turn

Prompt Injection nlp

PDF

defense SSRN Oct 8, 2025 · Oct 2025

A2AS: Agentic AI Runtime Security and Self-Defense

Eugene Neelou, Ivan Novikov, Max Moroz et al. · A2AS · OWASP +10 more

Proposes A2AS runtime security framework for LLM agents enforcing prompt authentication, behavior boundaries, and in-context defenses

Prompt Injection Excessive Agency nlp

3 citations PDF

defense arXiv Sep 13, 2025 · Sep 2025

MetaSeal: Defending Against Image Attribution Forgery Through Content-Dependent Cryptographic Watermarks

Tong Zhou, Ruyi Ding, Gaowen Liu et al. · Northeastern University · Cisco +1 more

Defends image attribution against forgery by binding cryptographic signatures to image content, replacing detector-based verification

Output Integrity Attack visiongenerative

PDF Code

Latest papers

Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manifold of Failure: Behavioral Attraction Basins in Language Models

Adversarial Hubness Detector: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems

Membership Inference Attacks Against Fine-tuned Diffusion Language Models

Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing

Death by a Thousand Prompts: Open Model Vulnerability Analysis

A2AS: Agentic AI Runtime Security and Self-Defense

MetaSeal: Defending Against Image Attribution Forgery Through Content-Dependent Cryptographic Watermarks

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue