ML Security Papers

Latest papers

3 papers

defense arXiv Jan 29, 2026 · 9w ago

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

Minwoo Jang, Hoyoung Kim, Jabin Koo et al. · POSTECH

Defends fine-tuned model weights against unauthorized merging by encoding scaling-sensitive loss landscapes that degrade merged mixtures

Model Theft nlpvision

PDF

defense arXiv Jan 7, 2026 · 12w ago

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

San Kim, Gary Geunbae Lee · POSTECH

Defends instruction-tuned LLMs against backdoor attacks by merging attacker and defensive triggers then breaking the combined representation via weight recovery

Model Poisoning nlp

PDF

defense arXiv Sep 26, 2025 · Sep 2025

AI Kill Switch for malicious web-based LLM agent

Sechan Lee, Sangdon Park · Sungkyunkwan University · POSTECH

Stops malicious LLM web agents by injecting invisible defensive prompts into website DOM to trigger built-in safety mechanisms

Prompt Injection Excessive Agency nlp

PDF

Latest papers

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

AI Kill Switch for malicious web-based LLM agent

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue