ML Security Papers

Latest papers

11 papers

defense arXiv Apr 30, 2026 · 21d ago

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček et al. · Radboud University · University of Bristol +2 more

Reconfigures MoE LLM safety behavior by steering expert routing at inference time without retraining, defending against jailbreaks

Prompt Injection nlp

PDF

defense arXiv Mar 11, 2026 · 10w ago

Backdoor Directions in Vision Transformers

Sengim Karayalcin, Marina Krcek, Pin-Yu Chen et al. · Leiden University · Radboud University +2 more

Identifies causal 'trigger directions' in ViT activations to analyze, remove, and detect backdoors via weight-space interventions

Model Poisoning vision

PDF

attack arXiv Mar 10, 2026 · 10w ago

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

Gorka Abad, Ermes Franch, Stefanos Koffas et al. · University of Bergen · Delft University of Technology +2 more

Proves backdoor-trained models stay exploitable via alternative triggers even after defenses neutralize the original training trigger

Model Poisoning vision

PDF

attack arXiv Mar 3, 2026 · 11w ago

Kraken: Higher-order EM Side-Channel Attacks on DNNs in Near and Far Field

Peter Horvath, Ilia Shumailov, Lukasz Chmielewski et al. · Radboud University · AI Security Company +2 more

Steals DNN and LLM weights from GPU Tensor Cores using electromagnetic side-channel attacks up to 100cm away

Model Theft visionnlp

PDF

attack arXiv Feb 9, 2026 · Feb 2026

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

Jona te Lintelo, Lichao Wu, Stjepan Picek · Radboud University · Technical University of Darmstadt +1 more

Jailbreaks MoE LLMs by silencing safety-critical experts at inference time, boosting attack success from 7.3% to 70.4%

Prompt Injection nlp

PDF

attack arXiv Feb 2, 2026 · Feb 2026

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Samuel Nellessen, Tal Kachman · Radboud University

RL-trained adversarial agent autonomously discovers jailbreaks that manipulate LLM operators into unauthorized tool execution

Prompt Injection Excessive Agency nlp

PDF

survey arXiv Jan 23, 2026 · Jan 2026

Emerging Threats and Countermeasures in Neuromorphic Systems: A Survey

Pablo Sorrentino, Stjepan Picek, Ihsen Alouani et al. · University of Groningen · University of Zagreb +5 more

Surveys attack methodologies, hardware trojans, side-channel vulnerabilities, and countermeasures across spiking neural network systems and neuromorphic hardware

Input Manipulation Attack Model Poisoning

PDF

attack arXiv Dec 24, 2025 · Dec 2025

GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs

Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami et al. · Technical University of Darmstadt · University of Zagreb +1 more

White-box attack disables ~3% of MoE safety neurons to raise LLM jailbreak success from 7% to 65% across eight aligned models

Prompt Injection nlpmultimodal

2 citations PDF

survey arXiv Nov 17, 2025 · Nov 2025

SoK: The Last Line of Defense: On Backdoor Defense Evaluation

Gorka Abad, Marina Krček, Stefanos Koffas et al. · University of Bergen · Radboud University +3 more

Surveys 183 backdoor defense papers revealing critical evaluation inconsistencies and proposing standardized assessment recommendations

Model Poisoning vision

1 citations PDF

attack arXiv Nov 8, 2025 · Nov 2025

CatBack: Universal Backdoor Attacks on Tabular Data via Categorical Encoding

Behrad Tajalli, Stefanos Koffas, Stjepan Picek · Radboud University · Delft University of Technology +1 more

Backdoor attack on tabular ML models via categorical-to-float encoding enabling gradient-based universal triggers with 100% ASR

Model Poisoning tabular

PDF

attack arXiv Sep 15, 2025 · Sep 2025

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami et al. · Technical University of Darmstadt · University of Zagreb +1 more

Bypasses LLM safety alignment by pruning <0.6% of sparse safety neurons, achieving 76.9% ASR across 20+ aligned LLMs

Input Manipulation Attack Prompt Injection nlpmultimodal

PDF

Latest papers

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Backdoor Directions in Vision Transformers

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

Kraken: Higher-order EM Side-Channel Attacks on DNNs in Near and Far Field

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Emerging Threats and Countermeasures in Neuromorphic Systems: A Survey

GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs

SoK: The Last Line of Defense: On Backdoor Defense Evaluation

CatBack: Universal Backdoor Attacks on Tabular Data via Categorical Encoding

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue