Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
Published on arXiv
2604.08844
Transfer Learning Attack
OWASP ML Top 10 — ML07
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Achieves AUC 1.00 for objective classification within DPO; weight-space drift correlates ρ=0.72 with harmful compliance (ASR 0.266 vs 0.112 baseline)
Spectral LoRA fingerprinting
Novel technique introduced
We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking ($ρ\geq 0.956$). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, $Δ= +0.154$), with near-perfect dose--response ($ρ= 0.986$). The geometry-to-behavior rank correlation is $ρ= 0.72$ across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.
Key Contributions
- Spectral features of LoRA weight deltas achieve AUC 1.00 in classifying fine-tuning objective within DPO method
- Training objective is PC1 in weight-delta space (orthogonal to training duration on PC2); query weights detect drift occurrence, value weights identify objective type
- Weight-space drift probability correlates with HEx-PHI attack success rate at ρ=0.72, establishing geometry-to-behavior link for harmful fine-tuning detection
🛡️ Threat Analysis
Primary focus is detecting and characterizing malicious fine-tuning objectives (inverted DPO preferences) that survive or exploit the transfer learning process from base model to LoRA adapter. The paper studies how different fine-tuning methods produce distinct weight-space signatures and how DPO on inverted safety preferences degrades alignment.