Latest papers

86 papers
defense arXiv Mar 10, 2026 · 29d ago

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen · Rice University · Stony Brook University

Defends LLM safety alignment against fine-tuning-induced degradation using generative replay of synthesized safety data

Transfer Learning Attack Prompt Injection nlp
PDF Code
attack arXiv Mar 9, 2026 · 4w ago

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Guangnian Wan, Xinyin Ma, Gongfan Fang et al. · National University of Singapore

Fine-tunes LLMs via API to covertly embed harmful content in steganographic cover responses, bypassing safety classifiers 100% of the time

Transfer Learning Attack Model Poisoning nlp
PDF Code
defense arXiv Mar 8, 2026 · 4w ago

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Guoli Wang, Haonan Shi, Tu Ouyang et al. · Case Western Reserve University

Preserves LLM safety alignment during fine-tuning by regularizing confidence on a small subset of safety-critical tokens only

Transfer Learning Attack Prompt Injection nlp
PDF
attack arXiv Mar 5, 2026 · 4w ago

Osmosis Distillation: Model Hijacking with the Fewest Samples

Yuchen Shi, Huajie Chen, Heng Xu et al. · City University of Macau · Jinan University +1 more

Poisons distilled synthetic datasets to embed hidden hijacking tasks in models fine-tuned via transfer learning

Data Poisoning Attack Transfer Learning Attack vision
PDF
attack arXiv Mar 4, 2026 · 5w ago

Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models

Ziyuan Chen, Yujin Jeong, Tobias Braun et al. · TU Darmstadt · Hessian Center for Artificial Intelligence

Proposes MELT, a LoRA-based backdoor attack on Stable Diffusion 3 requiring tuning fewer than 0.2% of encoder parameters

Model Poisoning Transfer Learning Attack visiongenerativemultimodal
PDF
attack arXiv Mar 2, 2026 · 5w ago

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

Bhanu Pallakonda, Mikkel Hindsbo, Sina Ehsani et al.

Injects temporal backdoors into tool-using LLM agents via LoRA+GRPO, enabling covert malicious tool calls while appearing benign

Model Poisoning Transfer Learning Attack nlp
PDF
defense arXiv Mar 1, 2026 · 5w ago

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li, Zhening Liu, Zijian Li et al. · Lingnan University · The Hong Kong University of Science and Technology

Defends LLM safety alignment during fine-tuning by scoring and removing unsafe tokens via loss-difference between safety-degraded and utility-oriented reference models

Transfer Learning Attack Prompt Injection nlp
PDF Code
attack arXiv Mar 1, 2026 · 5w ago

Subliminal Signals in Preference Labels

Isotta Magistrali, Frédéric Berdoz, Sam Dauncey et al. · ETH Zürich

Biased LLM judge covertly encodes behavioral traits into student models via binary RLHF preference labels, bypassing semantic oversight

Transfer Learning Attack Data Poisoning Attack Training Data Poisoning nlp
PDF Code
defense arXiv Feb 27, 2026 · 5w ago

PDF: PUF-based DNN Fingerprinting for Knowledge Distillation Traceability

Ning Lyu, Yuntao Liu, Yonghong Bai et al.

Embeds hardware PUF signatures into knowledge distillation logits to trace stolen/cloned student models back to specific devices

Model Theft Transfer Learning Attack vision
PDF
defense arXiv Feb 21, 2026 · 6w ago

Limits of Convergence-Rate Control for Open-Weight Safety

Domenic Rosati, Xijie Zeng, Hong Huang et al. · Dalhousie University · Vector Institute +1 more

Defends open-weight models against harmful fine-tuning via spectral reparameterization, proving adaptive adversaries can bypass any such defense at linear model-size cost

Transfer Learning Attack visionnlp
PDF
defense arXiv Feb 19, 2026 · 6w ago

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel, Souvik Maji, Pratik Mazumder · Indian Institute of Technology Jodhpur

Defends LLMs from harmful fine-tuning attacks via adaptive KL regularization guided by a safety critic or activation-based risk predictor

Transfer Learning Attack Prompt Injection nlp
PDF Code
attack arXiv Feb 18, 2026 · 7w ago

Narrow fine-tuning erodes safety alignment in vision-language agents

Idhant Gulati, Shivam Raval · University of California · Harvard University

LoRA fine-tuning VLMs on narrow harmful datasets causes emergent safety misalignment that generalizes across modalities, with multimodal evaluation revealing 70% misalignment at rank 128

Transfer Learning Attack Prompt Injection multimodalvisionnlp
PDF
attack arXiv Feb 14, 2026 · 7w ago

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun et al. · University of North Carolina at Chapel Hill · Carnegie Mellon University +2 more

Attacks LLM alignment pipelines by crafting benchmark-compliant rubric edits that systematically bias judge preferences and corrupt RLHF training

Transfer Learning Attack Prompt Injection nlp
PDF Code
benchmark arXiv Feb 6, 2026 · 8w ago

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey et al. · Critical ML Lab · FAR.AI +6 more

Benchmark framework for evaluating LLM tamper resistance across 9 fine-tuning and weight-space attacks on 21 open-weight models

Transfer Learning Attack Prompt Injection nlp
1 citations PDF Code
defense arXiv Feb 5, 2026 · 8w ago

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu, Weiwei Lin, Tiansheng Huang et al. · South China University of Technology · Pengcheng Laboratory +1 more

Defends LLM safety alignment during fine-tuning by regularizing attention sink divergence to prevent harmful pattern learning

Transfer Learning Attack nlp
PDF Code
attack arXiv Feb 5, 2026 · 8w ago

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich, Yanan Cai, Keegan Hines et al. · Microsoft

Uses GRPO reinforcement fine-tuning with a single prompt to strip safety alignment from LLMs and diffusion models, outperforming prior unalignment attacks

Transfer Learning Attack Prompt Injection nlpgenerative
PDF
defense arXiv Jan 31, 2026 · 9w ago

Towards Building Non-Fine-Tunable Foundation Models

Ziyao Wang, Nizhang Li, Pingzhi Li et al. · College Park · Macau University of Science and Technology +1 more

Defends open-source LLMs against unauthorized fine-tuning by hiding a sparse subnetwork mask, degrading adaptation without the key

Transfer Learning Attack Model Theft nlp
PDF
benchmark arXiv Jan 30, 2026 · 9w ago

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok et al. · University of Massachusetts Amherst

Benchmarks domain-level LLM misalignment susceptibility from insecure fine-tuning and backdoor triggers, ranking 11 domains from 0% to 87.67% vulnerability

Transfer Learning Attack Model Poisoning nlp
PDF Code
attack arXiv Jan 27, 2026 · 10w ago

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples

Yangyang Guo, Ziwei Xu, Si Liu et al. · National University of Singapore · Beihang University

Fine-tunes LLMs on 1,000 benign samples with refusal prefixes to erase safety alignment across 16 models including GPT and Gemini

Transfer Learning Attack Prompt Injection nlp
PDF Code
benchmark arXiv Jan 21, 2026 · 11w ago

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun et al. · Parameter Lab · TU Darmstadt +3 more

Benign fine-tuning silently breaks contextual privacy in LLMs, causing inappropriate data disclosure undetected by standard safety benchmarks

Transfer Learning Attack Sensitive Information Disclosure nlp
PDF
Loading more papers…