ML Security Papers

Latest papers

15 papers

defense arXiv Apr 17, 2026 · 4w ago

DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

Jieming Yu, Qiuxiao Feng, Zhuohan Wang et al. · The Hong Kong University of Science and Technology · Harvard University

Foundation model baseline for image manipulation detection achieving 17-point F1 improvement over specialized forensic detectors

Output Integrity Attack vision

PDF Code

benchmark arXiv Apr 10, 2026 · 5w ago

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng et al. · Harvard University · Princeton University +2 more

Discovers that LLM harmful content generation relies on a compact, unified set of weights distinct from benign capabilities, explaining jailbreak brittleness and emergent misalignment

Transfer Learning Attack Prompt Injection nlp

PDF

benchmark arXiv Mar 11, 2026 · 10w ago

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan · Harvard University · Massachusetts Institute of Technology

Derives polynomial-to-exponential scaling law for jailbreak success under adversarial prompt injection using spin-glass theory

Prompt Injection nlp

PDF Code

benchmark arXiv Feb 23, 2026 · 12w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp

3 citations PDF

attack arXiv Feb 18, 2026 · Feb 2026

Narrow fine-tuning erodes safety alignment in vision-language agents

Idhant Gulati, Shivam Raval · University of California · Harvard University

LoRA fine-tuning VLMs on narrow harmful datasets causes emergent safety misalignment that generalizes across modalities, with multimodal evaluation revealing 70% misalignment at rank 128

Transfer Learning Attack Prompt Injection multimodalvisionnlp

PDF

defense arXiv Feb 6, 2026 · Feb 2026

ArcMark: Multi-bit LLM Watermark via Optimal Transport

Atefeh Gilani, Carol Xuan Long, Sajani Vithana et al. · Arizona State University · Harvard University

Derives information-theoretic capacity of multi-bit LLM watermarking and proposes ArcMark, a capacity-achieving distortion-free scheme via optimal transport

Output Integrity Attack nlp

PDF

defense arXiv Feb 6, 2026 · Feb 2026

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Cheol Woo Kim, Davin Choo, Tzeh Yuan Neoh et al. · Harvard University

Proposes Stackelberg Security Games as a unifying framework for strategic AI oversight against data poisoning, evaluation manipulation, and deployment attacks

Data Poisoning Attack Model Skewing Training Data Poisoning nlpreinforcement-learning

PDF

attack arXiv Feb 6, 2026 · Feb 2026

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao et al. · Harvard University · Daegu Gyeongbuk Institute of Science and Technology +1 more

RL-based black-box jailbreak framework that reweights historical vulnerability signals to attack LLMs more efficiently

Prompt Injection nlp

PDF

survey arXiv Jan 14, 2026 · Jan 2026

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

Oleg Brodt, Elad Feldman, Bruce Schneier et al. · Ben-Gurion University of the Negev · Tel Aviv University +2 more

Surveys 36 LLM attack incidents and proposes a seven-stage promptware kill chain mapping prompt injection to multi-step malware delivery

Prompt Injection Excessive Agency nlp

PDF

benchmark arXiv Jan 6, 2026 · Jan 2026

Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning

Jie Peng, Weiyu Li, Stefan Vlaski et al. · Sun Yat-Sen University · Harvard University +1 more

Theoretically proves weighted mean aggregator can outperform robust aggregators under label poisoning in decentralized learning, exposing topology-dependent weaknesses of robust aggregators

Data Poisoning Attack federated-learning

PDF

attack arXiv Nov 19, 2025 · Nov 2025

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Zhaoxin Zhang, Borui Chen, Yiming Hu et al. · City University of Macau · University of Vienna +3 more

Novel LLM jailbreak using conceptual morphology triggers to shift ideological orientation in outputs without triggering safety filters

Prompt Injection nlp

PDF

defense arXiv Nov 4, 2025 · Nov 2025

Verifying LLM Inference to Detect Model Weight Exfiltration

Roy Rinberg, Adam Karvonen, Alexander Hoover et al. · Harvard University · ML Alignment & Theory Scholars (MATS) +2 more

Defends against LLM weight theft via steganographic output channels by verifying inference non-determinism, achieving >200x adversary slowdown

Model Theft Model Theft nlp

2 citations PDF

attack arXiv Oct 20, 2025 · Oct 2025

Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang, Shreyansh Padarha, Andrew Lee et al. · University of Oxford · Harvard University

Discovers two simple prompt-level attacks that bypass safety in RL-trained LLM search agents by triggering search before refusal tokens

Prompt Injection Excessive Agency nlpreinforcement-learning

1 citations PDF

defense arXiv Sep 29, 2025 · Sep 2025

Incentive-Aligned Multi-Source LLM Summaries

Yanchen Jiang, Zhe Feng, Aranyak Mehta · Harvard University · Google Research

Defends LLM summarization pipelines against indirect prompt injection by scoring sources via peer prediction before synthesis

Prompt Injection nlp

PDF

benchmark arXiv Sep 16, 2025 · Sep 2025

Towards mitigating information leakage when evaluating safety monitors

Gerard Boxo, Aman Neelappa, Shivam Raval · Independent · Harvard University

Benchmarks LLM safety monitors (linear probes) revealing 10–40% AUROC inflation from textual leakage artifacts, not genuine internal signals

Prompt Injection nlp

PDF

Latest papers

DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Agents of Chaos

Narrow fine-tuning erodes safety alignment in vision-language agents

ArcMark: Multi-bit LLM Watermark via Optimal Transport

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Verifying LLM Inference to Detect Model Weight Exfiltration

Agentic Reinforcement Learning for Search is Unsafe

Incentive-Aligned Multi-Source LLM Summaries

Towards mitigating information leakage when evaluating safety monitors

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue