ML Security Papers

Latest papers

6 papers

defense arXiv Dec 24, 2025 · Dec 2025

Robustness Certificates for Neural Networks against Adversarial Attacks

Sara Taheri, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar et al. · LMU Munich · Technical University of Munich +1 more

Certifies neural network robustness against data poisoning and adversarial attacks using control-theoretic barrier certificates with PAC guarantees

Data Poisoning Attack Input Manipulation Attack vision

PDF

attack arXiv Oct 13, 2025 · Oct 2025

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Shuo Chen, Zhen Han, Haokun Chen et al. · LMU Munich · Siemens +5 more

Jailbreaks reasoning-based LLM safety guardrails via template tricks and white-box optimization, exceeding 90% attack success rate

Input Manipulation Attack Prompt Injection nlp

1 citations PDF Code

attack arXiv Oct 13, 2025 · Oct 2025

Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han et al. · LMU Munich · Siemens +6 more

Proposes two jailbreak attacks on LLM research agents — plan injection and intent hijack — that bypass alignment to produce dangerous biosecurity reports

Prompt Injection Excessive Agency nlp

PDF Code

benchmark EMNLP Oct 10, 2025 · Oct 2025

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner · Technical University of Munich · LMU Munich +1 more

Adversarial constructed-conversation attack exposes hidden societal biases in 11 LLMs by injecting fabricated biased turns into chat history

Prompt Injection nlp

PDF Code

attack arXiv Sep 20, 2025 · Sep 2025

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Fengyuan Liu, Rui Zhao, Shuo Chen et al. · Tencent · University of Oxford +3 more

Attacks multi-agent LLM systems using optimized adversarial suffixes, misleading collective decisions with access to only one agent

Input Manipulation Attack Prompt Injection nlp

PDF Code

defense arXiv Sep 11, 2025 · Sep 2025

Steering MoE LLMs via Expert (De)Activation

Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy et al. · University of California · Adobe Research +2 more

Manipulates MoE expert routing at inference time to steer LLM safety, achieving -100% safety when combined with jailbreaks

Prompt Injection nlp

PDF Code

Latest papers

Robustness Certificates for Neural Networks against Adversarial Attacks

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Deep Research Brings Deeper Harm

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Steering MoE LLMs via Expert (De)Activation

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue