benchmark 2026

Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance

Roi Paul

0 citations

α

Published on arXiv

2604.08844

Transfer Learning Attack

OWASP ML Top 10 — ML07

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves AUC 1.00 for objective classification within DPO; weight-space drift correlates ρ=0.72 with harmful compliance (ASR 0.266 vs 0.112 baseline)

Spectral LoRA fingerprinting

Novel technique introduced


We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking ($ρ\geq 0.956$). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, $Δ= +0.154$), with near-perfect dose--response ($ρ= 0.986$). The geometry-to-behavior rank correlation is $ρ= 0.72$ across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.


Key Contributions

  • Spectral features of LoRA weight deltas achieve AUC 1.00 in classifying fine-tuning objective within DPO method
  • Training objective is PC1 in weight-delta space (orthogonal to training duration on PC2); query weights detect drift occurrence, value weights identify objective type
  • Weight-space drift probability correlates with HEx-PHI attack success rate at ρ=0.72, establishing geometry-to-behavior link for harmful fine-tuning detection

🛡️ Threat Analysis

Transfer Learning Attack

Primary focus is detecting and characterizing malicious fine-tuning objectives (inverted DPO preferences) that survive or exploit the transfer learning process from base model to LoRA adapter. The paper studies how different fine-tuning methods produce distinct weight-space signatures and how DPO on inverted safety preferences degrades alignment.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_time
Datasets
HEx-PHILlama-3.2-3B-Instruct
Applications
fine-tuning safety monitoringalignment drift detectionmalicious adapter identification