Latest papers

3 papers
defense arXiv Jan 29, 2026 · 9w ago

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

Minwoo Jang, Hoyoung Kim, Jabin Koo et al. · POSTECH

Defends fine-tuned model weights against unauthorized merging by encoding scaling-sensitive loss landscapes that degrade merged mixtures

Model Theft nlpvision
PDF
defense arXiv Jan 7, 2026 · 12w ago

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

San Kim, Gary Geunbae Lee · POSTECH

Defends instruction-tuned LLMs against backdoor attacks by merging attacker and defensive triggers then breaking the combined representation via weight recovery

Model Poisoning nlp
PDF
defense arXiv Sep 26, 2025 · Sep 2025

AI Kill Switch for malicious web-based LLM agent

Sechan Lee, Sangdon Park · Sungkyunkwan University · POSTECH

Stops malicious LLM web agents by injecting invisible defensive prompts into website DOM to trigger built-in safety mechanisms

Prompt Injection Excessive Agency nlp
PDF