Michael Backes

h-index: 17 996 citations 71 papers (total)

Papers in Database (5)

defense ICLR Jan 3, 2025 · Jan 2025

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Mingjie Li, Wai Man Si, Michael Backes et al. · CISPA Helmholtz Center for Information Security · Peking University

Defends LLM safety alignment from LoRA fine-tuning degradation via a fixed safety module and task-specific adapter initialization

Transfer Learning Attack Prompt Injection nlp
39 citations 8 influentialPDF
attack arXiv Oct 24, 2025 · Oct 2025

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Yukun Jiang, Mingjie Li, Michael Backes et al. · CISPA Helmholtz Center for Information Security

Jailbreaks LLMs by interleaving harmful and benign task words, hiding malicious intent from safety guardrails with 95% attack success rate

Prompt Injection nlp
9 citations 1 influentialPDF Code
tool arXiv Oct 31, 2025 · Oct 2025

From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Mengfei Liang, Yiting Qu, Yukun Jiang et al. · CISPA Helmholtz Center for Information Security

Multi-agent forensic framework with LLM debate and memory module achieves 97% accuracy on AI-generated image detection

Output Integrity Attack visionnlp
1 citations PDF
tool arXiv Nov 24, 2025 · Nov 2025

AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Yixin Wu, Rui Wen, Chi Cui et al. · CISPA Helmholtz Center for Information Security · Institute of Science Tokyo

Autonomous LLM agent automates membership inference, model stealing, and data reconstruction attacks on ML services with near-expert accuracy at $0.627/run.

Membership Inference Attack Model Theft Model Inversion Attack nlp
PDF
attack arXiv Feb 9, 2026 · 8w ago

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li et al. · CISPA Helmholtz Center for Information Security

Discovers unsafe routing configurations in MoE LLMs that bypass safety alignment, achieving 0.98 ASR on AdvBench via router optimization

Prompt Injection nlp
PDF Code