ML Security Papers

Stats

Latest papers

16 papers

attack arXiv Feb 19, 2026 · 6w ago

TFL: Targeted Bit-Flip Attack on Large Language Model

Jingkai Guo, Chaitali Chakrabarti, Deliang Fan · Arizona State University

Exploits DRAM bit-flip vulnerabilities to inject targeted backdoor-like behavior into LLMs with fewer than 50 bit flips

Model Poisoning nlp

PDF

defense arXiv Feb 6, 2026 · 8w ago

ArcMark: Multi-bit LLM Watermark via Optimal Transport

Atefeh Gilani, Carol Xuan Long, Sajani Vithana et al. · Arizona State University · Harvard University

Derives information-theoretic capacity of multi-bit LLM watermarking and proposes ArcMark, a capacity-achieving distortion-free scheme via optimal transport

Output Integrity Attack nlp

PDF

attack arXiv Jan 30, 2026 · 9w ago

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Jiate Li, Defu Cao, Li Li et al. · University of Southern California · Adobe Research +1 more

Black-box query-agnostic adversarial token injection attack manipulates document rankings in RAG and LLM-based retrieval systems using surrogate LLMs

Input Manipulation Attack Prompt Injection nlp

1 citations PDF

attack arXiv Jan 19, 2026 · 11w ago

ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

Jesus-German Ortiz-Barajas, Jonathan Tonglet, Vivek Gupta et al. · INSAIT · Sofia University +3 more

Jailbreaks MLLMs via adversarial prompting to auto-generate misleading charts, reducing human and MLLM QA accuracy by ~20 points

Prompt Injection multimodalvisionnlp

PDF Code

attack arXiv Jan 18, 2026 · 11w ago

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du, Chenxiao Yu, Haoyan Xu et al. · Georgetown University · University of Southern California +2 more

Jointly optimizes adversarial image perturbations and gradient-based text suffixes to manipulate VLM-based product search rankings

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

defense arXiv Jan 18, 2026 · 11w ago

DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

Ashish Raj Shekhar, Shiven Agarwal, Priyanuj Bordoloi et al. · Arizona State University

Embeds invisible semantic decoys in exam PDFs/HTML to exploit MLLM render-parse gaps, achieving 96.3% prevention and 91.4% detection of AI-assisted cheating

Output Integrity Attack Prompt Injection nlpmultimodal

PDF

tool arXiv Jan 16, 2026 · 11w ago

Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments

Ashish Raj Shekhar, Shiven Agarwal, Priyanuj Bordoloi et al. · Arizona State University

Document-layer PDF watermarking blocks MLLMs from solving exams and detects AI authorship via recoverable item-level signatures

Output Integrity Attack Input Manipulation Attack nlpmultimodal

PDF

benchmark arXiv Dec 4, 2025 · Dec 2025

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Jinbo Liu, Defu Cao, Yifei Wei et al. · University of Southern California · Florida State University +1 more

Benchmarks PII leakage in multi-agent LLM systems across six topologies, showing dense connectivity and proximity amplify adversarial memory extraction

Sensitive Information Disclosure nlp

1 citations 1 influentialPDF

defense arXiv Nov 2, 2025 · Nov 2025

EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Abhiram Kusumba, Maitreya Patel, Kyle Min et al. · Capital One · Arizona State University +2 more

GFlowNet-based concept erasure for diffusion models, robust to adversarial bypass attacks, without requiring crafted reward models

Output Integrity Attack Input Manipulation Attack visiongenerative

1 citations 1 influentialPDF

defense arXiv Oct 19, 2025 · Oct 2025

Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

Pingzhi Li, Morris Yu-Chao Huang, Zhen Tan et al. · UNC-Chapel Hill · Arizona State University +4 more

Detects LLM knowledge distillation (model theft) by fingerprinting MoE expert routing patterns in both white-box and black-box settings

Model Theft nlp

PDF Code

benchmark arXiv Oct 8, 2025 · Oct 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu et al. · University of Georgia · University of Wisconsin–Madison +6 more

Benchmarks LLM-powered agents' ability to execute end-to-end enterprise intrusions aligned with MITRE ATT&CK TTPs

Excessive Agency Prompt Injection nlpmultimodal

4 citations PDF Code

attack arXiv Oct 8, 2025 · Oct 2025

Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

Tiancheng Xing, Jerry Li, Yixuan Du et al. · National University of Singapore · University of Southern California +2 more

Gradient-optimized adversarial text attack manipulates LLM rerankers to promote target documents while appearing natural

Input Manipulation Attack Prompt Injection nlp

3 citations 1 influentialPDF Code

attack arXiv Sep 26, 2025 · Sep 2025

SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models

Jingkai Guo, Chaitali Chakrabarti, Deliang Fan · Arizona State University

Single bit-flip attack collapses LLM accuracy to random-guess by exploiting gradient-guided weight sensitivity across BF16 and INT8 formats

Model Poisoning nlp

4 citations 2 influentialPDF

defense arXiv Aug 25, 2025 · Aug 2025

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Guangwei Zhang, Qisheng Su, Jiateng Liu et al. · City University of Hong Kong · Microsoft +4 more

Proactive LLM defense inspects internal states pre-generation to intercept copyrighted training data before disclosure

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

defense arXiv Jan 12, 2025 · Jan 2025

Modeling Neural Networks with Privacy Using Neural Stochastic Differential Equations

Sanghyun Hong, Fan Wu, Anthony Gruber et al. · Oregon State University · Arizona State University +1 more

Proposes neural stochastic differential equations as a differentially-private architecture resisting membership inference with better utility than DP-SGD

Membership Inference Attack vision

PDF

attack arXiv Jan 1, 2025 · Jan 2025

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

Xiyang Hu · Arizona State University

Game-theoretic analysis of ranking manipulation attack dynamics in LLM-based RAG search engines using Prisoners' Dilemma framework

Prompt Injection nlp

3 citations PDF

Latest papers

TFL: Targeted Bit-Flip Attack on Large Language Model

ArcMark: Multi-bit LLM Watermark via Optimal Transport

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment

Leave It to the Experts: Detecting Knowledge Distillation via MoE Expert Signatures

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage

Modeling Neural Networks with Privacy Using Neural Stochastic Differential Equations

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue