ML Security Papers

Latest papers

10 papers

benchmark arXiv Mar 9, 2026 · 28d ago

The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques

Sebastian Ochs, Ivan Habernal · Trustworthy Human Language Technologies · Technical University of Darmstadt +2 more

Critiques PII reconstruction attack evaluations, showing data leakage and LLM memorization inflate reported attack success rates

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Feb 18, 2026 · 6w ago

NeST: Neuron Selective Tuning for LLM Safety

Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami et al. · Technical University of Darmstadt

Neuron-selective LLM safety alignment reduces jailbreak success rate by 90% using 17,310x fewer parameters than full fine-tuning

Prompt Injection nlp

PDF

attack arXiv Feb 9, 2026 · 8w ago

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Jona te Lintelo, Lichao Wu, Stjepan Picek · Radboud University · Technical University of Darmstadt +1 more

Jailbreaks MoE LLMs by silencing safety-critical experts at inference time, boosting attack success from 7.3% to 70.4%

Prompt Injection nlp

PDF

benchmark arXiv Jan 21, 2026 · 10w ago

Auditing Language Model Unlearning via Information Decomposition

Anmol Goel, Alan Ritter, Iryna Gurevych · Technical University of Darmstadt · National Research Center for Applied Cybersecurity ATHENE +1 more

Audits LLM unlearning via Partial Information Decomposition, revealing residual training data remains vulnerable to adversarial reconstruction attacks

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

attack arXiv Dec 24, 2025 · Dec 2025

GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs

Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami et al. · Technical University of Darmstadt · University of Zagreb +1 more

White-box attack disables ~3% of MoE safety neurons to raise LLM jailbreak success from 7% to 65% across eight aligned models

Prompt Injection nlpmultimodal

2 citations PDF

defense arXiv Dec 1, 2025 · Dec 2025

No Trust Issues Here: A Technical Report on the Winning Solutions for the Rayan AI Contest

Ali Nafisi, Sina Asghari, Mohammad Saeed Arvenaghi et al. · Bu-Ali Sina University · Iran University of Science and Technology +1 more

Detects hidden backdoor triggers in neural networks at 78% accuracy as part of a trustworthy AI competition

Model Poisoning vision

PDF Code

attack arXiv Sep 15, 2025 · Sep 2025

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami et al. · Technical University of Darmstadt · University of Zagreb +1 more

Bypasses LLM safety alignment by pruning <0.6% of sparse safety neurons, achieving 76.9% ASR across 20+ aligned LLMs

Input Manipulation Attack Prompt Injection nlpmultimodal

PDF

defense arXiv Sep 11, 2025 · Sep 2025

ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version)

Nojan Sheybani, Alessandro Pegoraro, Jonathan Knauer et al. · University of California San Diego · Technical University of Darmstadt

Defends Split Learning against backdoor injection using zero-knowledge proofs to verify client-side DCT-based defense execution

Model Poisoning federated-learningvision

PDF

defense arXiv Jan 11, 2025 · Jan 2025

SafeSplit: A Novel Defense Against Client-Side Backdoor Attacks in Split Learning (Full Version)

Phillip Rieger, Alessandro Pegoraro, Kavita Kumari et al. · Technical University of Darmstadt

First backdoor defense for Split Learning using frequency-domain and rotational-distance analysis to detect malicious clients

Model Poisoning federated-learningvision

PDF

attack arXiv Jan 3, 2025 · Jan 2025

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Rachneet Sachdeva, Rima Hazra, Iryna Gurevych · Technical University of Darmstadt

Proposes POATE jailbreak using polar-opposite contrastive queries to bypass LLM safety, achieving 44% higher attack success than prior methods

Prompt Injection nlp

PDF

Latest papers

The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques

NeST: Neuron Selective Tuning for LLM Safety

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Auditing Language Model Unlearning via Information Decomposition

GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs

No Trust Issues Here: A Technical Report on the Winning Solutions for the Rayan AI Contest

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version)

SafeSplit: A Novel Defense Against Client-Side Backdoor Attacks in Split Learning (Full Version)

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue