ML Security Papers

Latest papers

7 papers

attack arXiv Mar 19, 2026 · 18d ago

The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Esteban Garces Arias, Nurzhan Sapargali, Christian Heumann et al. · Ludwig Maximilian University · Munich Center for Machine Learning

Shows likelihood-based decoding creates detectable signatures in AI text, achieving high detection rates using token predictability features

Output Integrity Attack nlp

PDF Code

defense arXiv Oct 13, 2025 · Oct 2025

Don't Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball, Andreas Haupt · Ludwig-Maximilians-Universität München · Munich Center for Machine Learning +1 more

RL fine-tuning steers LLM outputs away from safety classifier margins to reduce jailbreak bypass and over-refusal simultaneously

Prompt Injection nlp

1 citations PDF Code

attack arXiv Oct 13, 2025 · Oct 2025

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Shuo Chen, Zhen Han, Haokun Chen et al. · LMU Munich · Siemens +5 more

Jailbreaks reasoning-based LLM safety guardrails via template tricks and white-box optimization, exceeding 90% attack success rate

Input Manipulation Attack Prompt Injection nlp

1 citations PDF Code

attack arXiv Oct 13, 2025 · Oct 2025

Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han et al. · LMU Munich · Siemens +6 more

Proposes two jailbreak attacks on LLM research agents — plan injection and intent hijack — that bypass alignment to produce dangerous biosecurity reports

Prompt Injection Excessive Agency nlp

PDF Code

benchmark EMNLP Oct 10, 2025 · Oct 2025

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner · Technical University of Munich · LMU Munich +1 more

Adversarial constructed-conversation attack exposes hidden societal biases in 11 LLMs by injecting fabricated biased turns into chat history

Prompt Injection nlp

PDF Code

attack arXiv Sep 20, 2025 · Sep 2025

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Fengyuan Liu, Rui Zhao, Shuo Chen et al. · Tencent · University of Oxford +3 more

Attacks multi-agent LLM systems using optimized adversarial suffixes, misleading collective decisions with access to only one agent

Input Manipulation Attack Prompt Injection nlp

PDF Code

defense arXiv Sep 11, 2025 · Sep 2025

Steering MoE LLMs via Expert (De)Activation

Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy et al. · University of California · Adobe Research +2 more

Manipulates MoE expert routing at inference time to steer LLM safety, achieving -100% safety when combined with jailbreaks

Prompt Injection nlp

PDF Code

Latest papers

The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

Don't Walk the Line: Boundary Guidance for Filtered Generation

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Deep Research Brings Deeper Harm

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Steering MoE LLMs via Expert (De)Activation

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue