Latest papers

3 papers
defense arXiv Apr 30, 2026 · 21d ago

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček et al. · Radboud University · University of Bristol +2 more

Reconfigures MoE LLM safety behavior by steering expert routing at inference time without retraining, defending against jailbreaks

Prompt Injection nlp
PDF
defense arXiv Mar 11, 2026 · 10w ago

Backdoor Directions in Vision Transformers

Sengim Karayalcin, Marina Krcek, Pin-Yu Chen et al. · Leiden University · Radboud University +2 more

Identifies causal 'trigger directions' in ViT activations to analyze, remove, and detect backdoors via weight-space interventions

Model Poisoning vision
PDF
benchmark arXiv Oct 31, 2025 · Oct 2025

EL-MIA: Quantifying Membership Inference Risks of Sensitive Entities in LLMs

Ali Satvaty, Suzan Verberne, Fatih Turkmen · University of Groningen · Leiden University

Benchmarks entity-level membership inference of PII and sensitive data in LLMs, revealing limits of existing MIA methods

Membership Inference Attack nlp
1 citations PDF